Consider the following DataFrame:
link tags views
/a [tag_a, tag_b] 100
/b [tag_a, tag_c] 200
/c [tag_b, tag_c] 150
What would be an efficient way to 'groupby' items within a list in the tags column. For instance, if one were to find the cumulative views for each tag in the DataFrame above, the result would be:
tag views
tag_a 300
tag_b 250
tag_c 350
So far, this is what I have come up with:
# get all unique tags
all_tags = list(set([item for sublist in df.tags.tolist() for item in sublist]))
# get a count of each tag
tag_views = {tag: df[df.tags.map(lambda x: tag in x)].views.sum() for tag in all_tags}
This approach is rather slow for a large dataset. Is there a more efficient way (perhaps using the builtin groupby function) of doing this?
You could split the tags
column into multiple rows and then groupby
:
df = pd.DataFrame(...)
tag = pd.DataFrame(df.tags.tolist()).stack()
tag.index = tag.index.droplevel(-1)
tag.name = 'tag'
df.join(tag).groupby('tag').sum()
Result:
views
tag
tag_a 300
tag_b 250
tag_c 350
This will not be very space efficient because of the join
, especially for a high number of tags per url. For a small number of tags I would be interested to hear about the timings.
Alternatively use a multi-index:
df = pd.DataFrame(...)
all_tags = [...]
groups = df.tags.map(lambda cell: tuple(tag in cell for tag in all_tags))
df.index = pd.MultiIndex.from_tuples(groups.values, names=all_tags)
for t in all_tags:
print(t, df.xs(True, level=t).views.sum())
Result:
tag_a 300
tag_b 250
tag_c 350
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments