我将解释我的问题陈述:
假设我有训练数据和测试数据。对于训练和测试,我在同一列中有NaN值。现在,我对nan推算的策略是:用某列分组,并用该组的平均值填充nan。例:
x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120]})
Occupation salary expenditure
0 driver 100 20.0
1 driver 150 40.0
2 mechanic 70 10.0
3 teacher 300 100.0
4 mechanic 90 NaN
5 teacher 250 80.0
6 unemployed 10 0.0
7 driver 90 NaN
8 mechanic 110 40.0
9 teacher 350 120.0
对于火车数据,我可以这样做:
x_train['expenditure'] = x_train.groupby('Occupation')['expenditure'].transform(lambda x:x.fillna(x.mean())
但是我该如何对测试数据执行类似的操作。平均值就是训练组的平均值。我正在尝试使用for循环来执行此操作,但是这是永远的。
创建mean
到Series
:
mean = x_train.groupby('Occupation')['expenditure'].mean()
print (mean)
Occupation
driver 30.0
mechanic 25.0
teacher 100.0
unemployed 0.0
Name: expenditure, dtype: float64
然后通过替换缺失值Series.map
和Series.fillna
:
x_train['expenditure'] = x_train['expenditure'].fillna(x_train['Occupation'].map(mean))
print (x_train)
Occupation salary expenditure
0 driver 100 20.0
1 driver 150 40.0
2 mechanic 70 10.0
3 teacher 300 100.0
4 mechanic 90 25.0
5 teacher 250 80.0
6 unemployed 10 0.0
7 driver 90 30.0
8 mechanic 110 40.0
9 teacher 350 120.0
并以相同的方式使用test
数据:
x_test['expenditure'] = x_test['expenditure'].fillna(x_test['Occupation'].map(mean))
编辑:
多列解决方案-改为map
使用DataFrame.join
:
x_train = pd.DataFrame({
'Occupation': ['driver', 'driver', 'mechanic', 'teacher', 'mechanic', 'teacher',
'unemployed', 'driver', 'mechanic', 'teacher'],
'salary': [100, 150, 70, 300, 90, 250, 10, 90, 110, 350],
'expenditure': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'expenditure1': [20, 40, 10, 100, np.nan, 80, 0, np.nan, 40, 120],
'col':list('aabbddeehh')})
mean = x_train.groupby('Occupation').mean()
print (mean)
salary expenditure expenditure1
Occupation
driver 113.333333 30.0 30.0
mechanic 90.000000 25.0 25.0
teacher 300.000000 100.0 100.0
unemployed 10.000000 0.0 0.0
x_train = x_train.fillna(x_train[['Occupation']].join(mean, on='Occupation'))
print (x_train)
Occupation salary expenditure expenditure1 col
0 driver 100 20.0 20.0 a
1 driver 150 40.0 40.0 a
2 mechanic 70 10.0 10.0 b
3 teacher 300 100.0 100.0 b
4 mechanic 90 25.0 25.0 d
5 teacher 250 80.0 80.0 d
6 unemployed 10 0.0 0.0 e
7 driver 90 30.0 30.0 e
8 mechanic 110 40.0 40.0 h
9 teacher 350 120.0 120.0 h
x_test = x_test.fillna(x_test[['Occupation']].join(mean, on='Occupation'))
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句