我有数据框,但是所有字符串都是重复的,当我尝试打印图形时,它包含重复的列。我尝试删除它,但是随后我的图形打印不正确。我的csv在这里。
数据框common_users
:
used_at common users pair of websites
0 2014 1364 avito.ru and e1.ru
1 2014 1364 e1.ru and avito.ru
2 2014 1716 avito.ru and drom.ru
3 2014 1716 drom.ru and avito.ru
4 2014 1602 avito.ru and auto.ru
5 2014 1602 auto.ru and avito.ru
6 2014 299 avito.ru and avtomarket.ru
7 2014 299 avtomarket.ru and avito.ru
8 2014 579 avito.ru and am.ru
9 2014 579 am.ru and avito.ru
10 2014 602 avito.ru and irr.ru/cars
11 2014 602 irr.ru/cars and avito.ru
12 2014 424 avito.ru and cars.mail.ru/sale
13 2014 424 cars.mail.ru/sale and avito.ru
14 2014 634 e1.ru and drom.ru
15 2014 634 drom.ru and e1.ru
16 2014 475 e1.ru and auto.ru
17 2014 475 auto.ru and e1.ru
.....
您会看到网站名称颠倒了。我试着pair of websites
按我的排序KeyError
。我使用代码
df = pd.read_csv("avito_trend.csv", parse_dates=[2])
def f(df):
dfs = []
for x in [list(x) for x in itertools.combinations(df['address'].unique(), 2)]:
c1 = df.loc[df['address'].isin([x[0]]), 'ID']
c2 = df.loc[df['address'].isin([x[1]]), 'ID']
c = pd.Series(list(set(c1).intersection(set(c2))))
#add inverted intersection c2 vs c1
c_invert = pd.Series(list(set(c2).intersection(set(c1))))
dfs.append(pd.DataFrame({'common users':len(c), 'pair of websites':' and '.join(x)}, index=[0]))
#swap values in x
x[1],x[0] = x[0],x[1]
dfs.append(pd.DataFrame({'common users':len(c_invert), 'pair of websites':' and '.join(x)}, index=[0]))
return pd.concat(dfs)
common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()
graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')
#sort by column 2014
graph_by_common_users = graph_by_common_users.sort_values(2014, ascending=False)
ax = graph_by_common_users.plot(kind='barh', width=0.5, figsize=(10,20))
[label.set_rotation(25) for label in ax.get_xticklabels()]
rects = ax.patches
labels = [int(round(graph_by_common_users.loc[i, y])) for y in graph_by_common_users.columns.tolist() for i in graph_by_common_users.index]
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_width() + 3, rect.get_y() + rect.get_height(), label, fontsize=8)
plt.show()
我的图看起来像:
您可以先sort
在function中添加新列f
,然后再按列对值进行排序pair of websites
,最后drop_duplicates
按列used_at
和进行排序sort
:
import pandas as pd
import itertools
df = pd.read_csv("avito_trend.csv",
parse_dates=[2])
def f(df):
dfs = []
i = 0
for x in [list(x) for x in itertools.combinations(df['address'].unique(), 2)]:
i += 1
c1 = df.loc[df['address'].isin([x[0]]), 'ID']
c2 = df.loc[df['address'].isin([x[1]]), 'ID']
c = pd.Series(list(set(c1).intersection(set(c2))))
#add inverted intersection c2 vs c1
c_invert = pd.Series(list(set(c2).intersection(set(c1))))
dfs.append(pd.DataFrame({'common users':len(c), 'pair of websites':' and '.join(x), 'sort': i}, index=[0]))
#swap values in x
x[1],x[0] = x[0],x[1]
dfs.append(pd.DataFrame({'common users':len(c_invert), 'pair of websites':' and '.join(x), 'sort': i}, index=[0]))
return pd.concat(dfs)
common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()
common_users = common_users.sort_values('pair of websites')
common_users = common_users.drop_duplicates(subset=['used_at','sort'])
#print common_users
graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')
#print graph_by_common_users
#change order of columns
graph_by_common_users = graph_by_common_users[[2015,2014]]
graph_by_common_users = graph_by_common_users.sort_values(2014, ascending=False)
ax = graph_by_common_users.plot(kind='barh', width=0.5, figsize=(10,20))
[label.set_rotation(25) for label in ax.get_xticklabels()]
rects = ax.patches
labels = [int(round(graph_by_common_users.loc[i, y])) for y in graph_by_common_users.columns.tolist() for i in graph_by_common_users.index]
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_width() + 20, rect.get_y() - 0.25 + rect.get_height(), label, fontsize=8)
#sorting values of legend
handles, labels = ax.get_legend_handles_labels()
# sort both labels and handles by labels
labels, handles = zip(*sorted(zip(labels, handles), key=lambda t: t[0]))
ax.legend(handles, labels)
我的图:
编辑:
评论是:
由于年份2014
和的组合2015
不同4
,因此第一4
列和第二列中的值缺失:
used_at 2015 2014
pair of websites
avito.ru and drom.ru 1491.0 1716.0
avito.ru and auto.ru 1473.0 1602.0
avito.ru and e1.ru 1153.0 1364.0
drom.ru and auto.ru NaN 874.0
e1.ru and drom.ru 539.0 634.0
avito.ru and irr.ru/cars 403.0 602.0
avito.ru and am.ru 262.0 579.0
e1.ru and auto.ru 451.0 475.0
avito.ru and cars.mail.ru/sale 256.0 424.0
drom.ru and irr.ru/cars 277.0 423.0
auto.ru and irr.ru/cars 288.0 409.0
auto.ru and am.ru 224.0 408.0
drom.ru and am.ru 187.0 394.0
auto.ru and cars.mail.ru/sale 195.0 330.0
avito.ru and avtomarket.ru 205.0 299.0
drom.ru and cars.mail.ru/sale 189.0 292.0
drom.ru and avtomarket.ru 175.0 247.0
auto.ru and avtomarket.ru 162.0 243.0
e1.ru and irr.ru/cars 148.0 235.0
e1.ru and am.ru 99.0 224.0
am.ru and irr.ru/cars NaN 223.0
irr.ru/cars and cars.mail.ru/sale 94.0 197.0
am.ru and cars.mail.ru/sale NaN 166.0
e1.ru and cars.mail.ru/sale 105.0 154.0
e1.ru and avtomarket.ru 105.0 139.0
avtomarket.ru and irr.ru/cars NaN 139.0
avtomarket.ru and am.ru 72.0 133.0
avtomarket.ru and cars.mail.ru/sale 48.0 105.0
auto.ru and drom.ru 799.0 NaN
cars.mail.ru/sale and am.ru 73.0 NaN
irr.ru/cars and am.ru 102.0 NaN
irr.ru/cars and avtomarket.ru 73.0 NaN
然后,我创建所有反向组合-问题已解决。但是为什么有NaN
呢?为什么组合在2014
和中不同2015
?
我添加到功能f
:
def f(df):
print df['address'].unique()
dfs = []
i = 0
for x in [list(x) for x in itertools.combinations((df['address'].unique()), 2)]:
...
...
和输出是(为什么第一次打印两次在warning
这里描述):
['avito.ru' 'e1.ru' 'drom.ru' 'auto.ru' 'avtomarket.ru' 'am.ru'
'irr.ru/cars' 'cars.mail.ru/sale']
['avito.ru' 'e1.ru' 'drom.ru' 'auto.ru' 'avtomarket.ru' 'am.ru'
'irr.ru/cars' 'cars.mail.ru/sale']
['avito.ru' 'e1.ru' 'auto.ru' 'drom.ru' 'irr.ru/cars' 'avtomarket.ru'
'cars.mail.ru/sale' 'am.ru']
所以列表是不同的,然后组合也是不同的->我得到一些NaN
值。
解决方案是对组合列表进行排序。
def f(df):
#print (sorted(df['address'].unique()))
dfs = []
for x in [list(x) for x in itertools.combinations(sorted(df['address'].unique()), 2)]:
c1 = df.loc[df['address'].isin([x[0]]), 'ID']
...
...
所有代码是:
import pandas as pd
import itertools
df = pd.read_csv("avito_trend.csv",
parse_dates=[2])
def f(df):
#print (sorted(df['address'].unique()))
dfs = []
for x in [list(x) for x in itertools.combinations(sorted(df['address'].unique()), 2)]:
c1 = df.loc[df['address'].isin([x[0]]), 'ID']
c2 = df.loc[df['address'].isin([x[1]]), 'ID']
c = pd.Series(list(set(c1).intersection(set(c2))))
dfs.append(pd.DataFrame({'common users':len(c), 'pair of websites':' and '.join(x)}, index=[0]))
return pd.concat(dfs)
common_users = df.groupby([df['used_at'].dt.year]).apply(f).reset_index(drop=True, level=1).reset_index()
#print common_users
graph_by_common_users = common_users.pivot(index='pair of websites', columns='used_at', values='common users')
#change order of columns
graph_by_common_users = graph_by_common_users[[2015,2014]]
graph_by_common_users = graph_by_common_users.sort_values(2014, ascending=False)
#print graph_by_common_users
ax = graph_by_common_users.plot(kind='barh', width=0.5, figsize=(10,20))
[label.set_rotation(25) for label in ax.get_xticklabels()]
rects = ax.patches
labels = [int(round(graph_by_common_users.loc[i, y])) \
for y in graph_by_common_users.columns.tolist() \
for i in graph_by_common_users.index]
for rect, label in zip(rects, labels):
height = rect.get_height()
ax.text(rect.get_width()+20, rect.get_y() - 0.25 + rect.get_height(), label, fontsize=8)
handles, labels = ax.get_legend_handles_labels()
# sort both labels and handles by labels
labels, handles = zip(*sorted(zip(labels, handles), key=lambda t: t[0]))
ax.legend(handles, labels)
和图:
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句