我有一张看起来像这样的桌子。完整的示例代码和源数据在底部,包括提供的两个答案。
id technology question response
0 subj1 technology1 Q1 3
1 subj1 technology2 Q1 4
...
10 subj1 technology3 Q3 6
11 subj1 technology4 Q3 2
12 subj1 technology4 Q4 7
13 subj1 technology3 Q4 5
14 subj1 technology1 Q4 5
15 subj1 technology2 Q4 9
16 subj2 technology2 Q1 1
17 subj2 technology1 Q1 4
...
29 subj2 technology3 Q4 0
我想要的是一个表格,其中“问题”列的不同值本身成为列,问题列单元格中的值是给定主题和技术的给定问题的响应值,如下所示(只是一个例子):
id technology Q1 Q2 Q3 Q4
0 subj1 technology1 3 3 2 1
1 subj1 technology2 4 4 3 1
...
10 subj1 technology3 6 3 7 2
...
16 subj2 technology2 4 5 7 3
如果我像这样旋转表格,我可以接近它(请注意下面完整代码中的改进版本,基于目前的建议):
source_data_df_pvt1 = pd.pivot_table(source_data_df, index = ['id'],
columns = ['technology', 'question'],
values = 'response', aggfunc='first')
这给了我这个多维数据框:
technology technology1 technology2 technology3 technology4 technology5
question Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1
id
subj1 3.0 9.0 7.0 5.0 4.0 5.0 3.0 9.0 3.0 8.0 6.0 5.0 5.0 8.0 2.0 7.0 NaN
subj2 4.0 9.0 8.0 7.0 1.0 5.0 8.0 20.0 20.0 9.0 4.0 0.0 3.0 0.0 8.0 6.0 NaN
subj3 14.0 NaN 10.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 15.0
subj4 13.0 4.0 5.0 11.0 17.0 NaN NaN NaN 0.0 NaN NaN NaN 0.0 NaN NaN NaN NaN
subj5 3.0 20.0 4.0 8.0 2.0 20.0 3.0 2.0 3.0 5.0 7.0 5.0 4.0 2.0 7.0 5.0 NaN
subj6 2.0 8.0 1.0 6.0 0.0 7.0 4.0 1.0 20.0 6.0 1.0 0.0 6.0 8.0 7.0 3.0 NaN
我不希望我的数据框是多维的,我只想要一个单维表。
熊猫可以做到这一点吗?
完整的示例代码和输出,包括工作解决方案:
import pandas as pd
import numpy as np
pd.set_option('display.width', 1000)
#See https://gist.github.com/NathanDotTo/e506c1946c23234d2c24a2bd27e570a0#file-technology_skills-csv
sample_data = "../test_data/technology_skills.csv"
#The sample data has these columns:
column_names = ["id", "technology", "question", "response"]
source_data_df = pd.read_csv(sample_data, names = column_names, header = None)
#Change response to be numeric
pd.to_numeric(source_data_df.response)
#Strip white spaces from questions
source_data_df['question'] = source_data_df['question'].str.strip()
#Pivot the table to create columns from response values for each question
source_data_df_pvt1 = pd.pivot_table(source_data_df, index = ['id','technology'],
columns = [ 'question'],
values = 'response',
aggfunc='first',
fill_value=np.nan).reset_index()
print('source_data_df_pvt1 *******************')
print (source_data_df_pvt1)
source_data_df_pvt2 = source_data_df.pivot_table(index=['id','technology'],
columns='question',
values='response',
aggfunc='sum',
fill_value=np.nan).reset_index()
print('source_data_df_pvt2 *******************')
print (source_data_df_pvt2)
结果如下:
source_data_df_pvt1 *******************
question id technology Q1 Q2 Q3 Q4
0 subj1 technology1 3 9.0 7.0 5.0
1 subj1 technology2 4 5.0 3.0 9.0
2 subj1 technology3 3 8.0 6.0 5.0
3 subj1 technology4 5 8.0 2.0 7.0
4 subj2 technology1 4 9.0 8.0 7.0
5 subj2 technology2 1 5.0 8.0 20.0
6 subj2 technology3 20 9.0 4.0 0.0
7 subj2 technology4 3 0.0 8.0 6.0
8 subj3 technology1 14 NaN 10.0 0.0
9 subj3 technology5 15 NaN NaN NaN
10 subj4 technology1 13 4.0 5.0 11.0
11 subj4 technology2 17 NaN NaN NaN
12 subj4 technology3 0 NaN NaN NaN
13 subj4 technology4 0 NaN NaN NaN
14 subj5 technology1 3 20.0 4.0 8.0
15 subj5 technology2 2 20.0 3.0 2.0
16 subj5 technology3 3 5.0 7.0 5.0
17 subj5 technology4 4 2.0 7.0 5.0
18 subj6 technology1 2 8.0 1.0 6.0
19 subj6 technology2 0 7.0 4.0 1.0
20 subj6 technology3 20 6.0 1.0 0.0
21 subj6 technology4 6 8.0 7.0 3.0
source_data_df_pvt2 *******************
question id technology Q1 Q2 Q3 Q4
0 subj1 technology1 3 9.0 7.0 5.0
1 subj1 technology2 4 5.0 3.0 9.0
2 subj1 technology3 3 8.0 6.0 5.0
3 subj1 technology4 5 8.0 2.0 7.0
4 subj2 technology1 4 9.0 8.0 7.0
5 subj2 technology2 1 5.0 8.0 20.0
6 subj2 technology3 20 9.0 4.0 0.0
7 subj2 technology4 3 0.0 8.0 6.0
8 subj3 technology1 14 NaN 10.0 0.0
9 subj3 technology5 15 NaN NaN NaN
10 subj4 technology1 13 4.0 5.0 11.0
11 subj4 technology2 17 NaN NaN NaN
12 subj4 technology3 0 NaN NaN NaN
13 subj4 technology4 0 NaN NaN NaN
14 subj5 technology1 3 20.0 4.0 8.0
15 subj5 technology2 2 20.0 3.0 2.0
16 subj5 technology3 3 5.0 7.0 5.0
17 subj5 technology4 4 2.0 7.0 5.0
18 subj6 technology1 2 8.0 1.0 6.0
19 subj6 technology2 0 7.0 4.0 1.0
20 subj6 technology3 20 6.0 1.0 0.0
21 subj6 technology4 6 8.0 7.0 3.0
作为奖励,就使用的简单性而言,这正是我想要的。它对两种枢轴样式的工作方式相同。
for row in source_data_df_pvt1.itertuples():
print(row)
print(row.id)
print(row.technology)
print(row.Q1)
print(row.Q2)
print(row.Q3)
print(row.Q4)
df = (source_data_df.set_index(['id','technology','question'])['response']
.unstack(fill_value=0)
.reset_index())
但如果出现错误:
ValueError:索引包含重复条目,无法重塑
这意味着在三元组id
, technology
, 中有重复项question
,因此有必要通过first
以下方式删除重复项或聚合:
source_data_df = source_data_df.drop_duplicates(['id','technology','question'])
df = (source_data_df.set_index(['id','technology','question'])['response']
.unstack(fill_value=0)
.reset_index())
什么是相同的:
df = pd.pivot_table(source_data_df, index = ['id','technology'],
columns = [ 'question'],
values = 'response',
aggfunc='first',
fill_value=0).reset_index()
print (df)
question id technology Q1 Q3 Q4
0 subj1 technology1 3 0 5
1 subj1 technology2 4 0 9
2 subj1 technology3 0 6 5
3 subj1 technology4 0 2 7
4 subj2 technology1 4 0 0
5 subj2 technology2 1 0 0
6 subj2 technology3 0 0 0
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句