创建不同直方图的子图时,出现上述类型错误。
为了提供一些背景信息,我有一个很大的数据集,必须将其分成几个单独的块进行清理,以避免出现内存问题。我分别保存了每个块,然后在另一个笔记本上将它们连接在一起。
当我运行代码以创建带有分块数据帧的子图时,它工作正常,但是当我再次使用连接数据运行子图代码时,出现类型错误。我不明白为什么,因为我没有真正更改任何内容。
错误发生在这里:
我的完整代码
#Overall
CRev_All_age1 = df_optimized.groupby(['YearOnboarded', 'age_buckets']).sum().reset_index()
#Europe
CRev_EU = df_optimized.loc[df_optimized['Continents'] == 'Europe']
Plot_CRev_EU_age1 = CRev_EU.groupby(['YearOnboarded', 'age_buckets']).sum().reset_index()
#Asia
CRev_Asia = df_optimized.loc[df_optimized['Continents'] == 'Asia']
Plot_CRev_Asia_age1 = CRev_Asia.groupby(['YearOnboarded', 'age_buckets']).sum().reset_index()
#Other
CRev_Other = df_optimized.loc[(df_optimized['Continents'] != 'Europe') & (df_optimized['Continents'] != 'Asia')]
Plot_CRev_Other_age1 = CRev_Other.groupby(['YearOnboarded', 'age_buckets']).sum().reset_index()
fig, axes = plt.subplots(2,2, constrained_layout=True, figsize=(14,12))
ax1, ax2, ax3, ax4 =axes.flatten()
#plot1
ax1 = sns.histplot( data=CRev_All_age1, x="YearOnboarded", hue="age_buckets",weights="Revenue2", multiple="stack", discrete=True, shrink=.9, ax=ax1)
ax1.set_title('Overall - Client Revenue (Million)', fontsize=16, fontweight='bold')
ax1.tick_params('x', labelrotation=15)
ax1.set_ylabel('Revenue', fontsize=12)
ax1.set_xlabel('Year Onboarded', fontsize=12)
#plot2
ax2 = sns.histplot( data=Plot_CRev_EU_age1, x="YearOnboarded", hue="age_buckets",weights="Revenue2", multiple="stack", discrete=True, shrink=.9, ax=ax2)
ax2.set_title('Europe - Client Revenue (Million)', fontsize=14, fontweight='bold')
plt.setp(ax2.xaxis.get_majorticklabels(), rotation=15)
ax2.set_ylabel('Revenue', fontsize=12)
ax2.set_xlabel('Year Onboarded', fontsize=12)
#plot3
ax3 = sns.histplot( data=Plot_CRev_Asia_age1, x="YearOnboarded", hue="age_buckets",weights="Revenue2", multiple="stack", discrete=True, shrink=.9, ax=ax3)
ax3.set_title('Asia - Client Revenue (Million)', fontsize=14, fontweight='bold')
for tick in ax3.get_xticklabels():
tick.set_rotation(15)
ax3.set_ylabel('Revenue', fontsize=12)
ax3.set_xlabel('Year Onboarded', fontsize=12)
#plot4
ax4 = sns.histplot( data=Plot_CRev_Other_age1, x="YearOnboarded", hue="age_buckets",weights="Revenue2", multiple="stack", discrete=True, shrink=.9, ax=ax4)
ax4.set_title('Other Continents - Client Revenue (Million)', fontsize=14, fontweight='bold')
ax4.tick_params(labelrotation=15)
ax4.set_ylabel('Revenue', fontsize=12)
ax4.set_xlabel('Year Onboarded', fontsize=12)
plt.show()
玩具数据
dataset = {'YearOnboarded': [2018,2019,2020,2016,2019,2020,2017,2019,2020,2018,2019,2020,2016,2016,2016,2017,2016,2018,2016],
'Revenue2': [100,50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8],
'age_buckets': ['18-30','30-39','40-49','50-59','18-30','30-39','40-49','50-59','18-30','30-39','40-49','50-59',
'18-30','30-39','40-49','50-59','18-30','30-39','40-49'],
'Continents': ['Europe','Asia','Africa','Africa','Other','Asia','Africa','Other','America','America','Europe','Europe',
'Other','Europe','Asia','Africa','Asia','Europe','Other']}
df_optimized = pd.DataFrame(data=dataset)
如果有人可以帮助我了解这种情况的发生原因以及如何解决该问题,我将不胜感激。
谢谢!
编辑:找到问题的根源以及如何解决。将每个块数据集导入新内核时,一列具有混合数据类型。使用转换具有混合数据类型的列.astype('category')
并不能解决我的问题,因此我必须在导入数据时更改数据类型,read_csv
dtype
然后才能正常工作。
此问题可能源于Revenue2中的一个字符,当您从用来保存数据块的任何文件类型加载数据时,pandas都无法将其识别为整数。即使列中只有一个元素不能解释为整数,pandas也会将整个列作为对象读取。在示例中,我曾经用-
没有整数等效项的形式来表示此字符串字符。
如果运行此代码:
import pandas as pd
import seaborn as sns
df = pd.DataFrame({'YearOnboarded': [2018,2019,2020,2016,2019,2020,2017,2019,2020,2018,2019,2020,2016,2016,2016,2017,2016,2018,2016],
'Revenue2': ["-",50,25,30,40,50,60,100,20,40,100,20,5,5,8,4,10,20,8],
'age_buckets': ['18-30','30-39','40-49','50-59','18-30','30-39','40-49','50-59','18-30','30-39','40-49','50-59',
'18-30','30-39','40-49','50-59','18-30','30-39','40-49'],
'Continents': ['Europe','Asia','Africa','Africa','Other','Asia','Africa','Other','America','America','Europe','Europe',
'Other','Europe','Asia','Africa','Asia','Europe','Other']})
df['Revenue2'] = df['Revenue2'].astype(int)
您会收到此错误:
ValueError: invalid literal for int() with base 10: '-'
这很有用,因为它指出了第一个令人反感的字符,然后您可以将该字符替换为填充符,然后重试:
df['Revenue2'] = df.Revenue2.astype(str).str.replace('-','0').astype(int)
df['Revenue2'] = df['Revenue2'].astype(int)
最终,我认为您应该能够删除所有无效字符,并使用全为整数的列。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句