没有多维表的熊猫从长到宽

内森·索瓦茨基

我有一张看起来像这样的桌子。完整的示例代码和源数据在底部,包括提供的两个答案。

           id   technology     question  response
0        subj1  technology1       Q1         3  
1        subj1  technology2       Q1         4
...
10       subj1  technology3      Q3          6
11       subj1  technology4       Q3         2
12       subj1  technology4       Q4         7
13       subj1  technology3       Q4         5
14       subj1  technology1       Q4         5
15       subj1  technology2       Q4         9
16       subj2  technology2       Q1         1
17       subj2  technology1       Q1         4
...
29       subj2  technology3       Q4         0

我想要的是一个表格,其中“问题”列的不同值本身成为列,问题列单元格中的值是给定主题和技术的给定问题的响应值,如下所示(只是一个例子):

         id      technology      Q1 Q2  Q3  Q4
0        subj1  technology1      3  3   2   1
1        subj1  technology2      4  4   3   1
...
10       subj1  technology3      6  3   7   2
...
16       subj2  technology2      4  5   7   3

如果我像这样旋转表格,我可以接近它(请注意下面完整代码中的改进版本,基于目前的建议):

source_data_df_pvt1 = pd.pivot_table(source_data_df, index = ['id'],  
                                    columns = ['technology', 'question'], 
                                    values = 'response', aggfunc='first')

这给了我这个多维数据框:

technology technology1                   technology2                      technology3                technology4                technology5
question            Q1   Q2    Q3     Q4          Q1    Q2   Q3    Q4            Q1  Q2   Q3    Q4          Q1   Q2   Q3   Q4          Q1
id                                                                                                                                     
subj1              3.0   9.0   7.0   5.0         4.0   5.0  3.0   9.0           3.0  8.0  6.0  5.0         5.0  8.0  2.0  7.0         NaN
subj2              4.0   9.0   8.0   7.0         1.0   5.0  8.0  20.0        20.0  9.0  4.0  0.0         3.0  0.0  8.0  6.0         NaN
subj3             14.0   NaN  10.0   0.0         NaN   NaN  NaN   NaN         NaN  NaN  NaN  NaN         NaN  NaN  NaN  NaN        15.0
subj4             13.0   4.0   5.0  11.0        17.0   NaN  NaN   NaN         0.0  NaN  NaN  NaN         0.0  NaN  NaN  NaN         NaN
subj5              3.0  20.0   4.0   8.0         2.0  20.0  3.0   2.0         3.0  5.0  7.0  5.0         4.0  2.0  7.0  5.0         NaN
subj6              2.0   8.0   1.0   6.0         0.0   7.0  4.0   1.0        20.0  6.0  1.0  0.0         6.0  8.0  7.0  3.0         NaN

我不希望我的数据框是多维的,我只想要一个单维表。

熊猫可以做到这一点吗?

完整的示例代码和输出,包括工作解决方案:

import pandas as pd
import numpy as np

pd.set_option('display.width', 1000)

#See https://gist.github.com/NathanDotTo/e506c1946c23234d2c24a2bd27e570a0#file-technology_skills-csv
sample_data = "../test_data/technology_skills.csv"

#The sample data has these columns:
column_names = ["id", "technology", "question", "response"]

source_data_df = pd.read_csv(sample_data, names = column_names, header = None)

#Change response to be numeric
pd.to_numeric(source_data_df.response)

#Strip white spaces from questions
source_data_df['question'] = source_data_df['question'].str.strip()

#Pivot the table to create columns from response values for each question     

source_data_df_pvt1 = pd.pivot_table(source_data_df, index = ['id','technology'],  
                                columns = [ 'question'], 
                                values = 'response', 
                                aggfunc='first',
                                fill_value=np.nan).reset_index()

print('source_data_df_pvt1 *******************')
print (source_data_df_pvt1)

source_data_df_pvt2 = source_data_df.pivot_table(index=['id','technology'],
                                 columns='question',
                                 values='response',
                                 aggfunc='sum',
                                 fill_value=np.nan).reset_index()

print('source_data_df_pvt2 *******************')
print (source_data_df_pvt2)

结果如下:

source_data_df_pvt1 *******************
question     id   technology  Q1    Q2    Q3    Q4
0         subj1  technology1   3   9.0   7.0   5.0
1         subj1  technology2   4   5.0   3.0   9.0
2         subj1  technology3   3   8.0   6.0   5.0
3         subj1  technology4   5   8.0   2.0   7.0
4         subj2  technology1   4   9.0   8.0   7.0
5         subj2  technology2   1   5.0   8.0  20.0
6         subj2  technology3  20   9.0   4.0   0.0
7         subj2  technology4   3   0.0   8.0   6.0
8         subj3  technology1  14   NaN  10.0   0.0
9         subj3  technology5  15   NaN   NaN   NaN 
10        subj4  technology1  13   4.0   5.0  11.0
11        subj4  technology2  17   NaN   NaN   NaN
12        subj4  technology3   0   NaN   NaN   NaN
13        subj4  technology4   0   NaN   NaN   NaN
14        subj5  technology1   3  20.0   4.0   8.0
15        subj5  technology2   2  20.0   3.0   2.0
16        subj5  technology3   3   5.0   7.0   5.0
17        subj5  technology4   4   2.0   7.0   5.0
18        subj6  technology1   2   8.0   1.0   6.0
19        subj6  technology2   0   7.0   4.0   1.0
20        subj6  technology3  20   6.0   1.0   0.0
21        subj6  technology4   6   8.0   7.0   3.0

source_data_df_pvt2 *******************
question     id   technology  Q1    Q2    Q3    Q4
0         subj1  technology1   3   9.0   7.0   5.0
1         subj1  technology2   4   5.0   3.0   9.0
2         subj1  technology3   3   8.0   6.0   5.0
3         subj1  technology4   5   8.0   2.0   7.0
4         subj2  technology1   4   9.0   8.0   7.0
5         subj2  technology2   1   5.0   8.0  20.0
6         subj2  technology3  20   9.0   4.0   0.0
7         subj2  technology4   3   0.0   8.0   6.0
8         subj3  technology1  14   NaN  10.0   0.0
9         subj3  technology5  15   NaN   NaN   NaN
10        subj4  technology1  13   4.0   5.0  11.0
11        subj4  technology2  17   NaN   NaN   NaN
12        subj4  technology3   0   NaN   NaN   NaN
13        subj4  technology4   0   NaN   NaN   NaN
14        subj5  technology1   3  20.0   4.0   8.0
15        subj5  technology2   2  20.0   3.0   2.0
16        subj5  technology3   3   5.0   7.0   5.0
17        subj5  technology4   4   2.0   7.0   5.0
18        subj6  technology1   2   8.0   1.0   6.0
19        subj6  technology2   0   7.0   4.0   1.0
20        subj6  technology3  20   6.0   1.0   0.0
21        subj6  technology4   6   8.0   7.0   3.0

作为奖励,就使用的简单性而言,这正是我想要的。它对两种枢轴样式的工作方式相同。

for row in source_data_df_pvt1.itertuples():
    print(row)
    print(row.id)
    print(row.technology)
    print(row.Q1)
    print(row.Q2)
    print(row.Q3)
    print(row.Q4)
耶斯列

使用set_indexunstack

df = (source_data_df.set_index(['id','technology','question'])['response']
                   .unstack(fill_value=0)
                   .reset_index())

但如果出现错误:

ValueError:索引包含重复条目,无法重塑

这意味着在三元组id, technology, 中有重复项question,因此有必要通过first以下方式删除重复项或聚合

source_data_df = source_data_df.drop_duplicates(['id','technology','question'])
df = (source_data_df.set_index(['id','technology','question'])['response']
                   .unstack(fill_value=0)
                   .reset_index())

什么是相同的:

df = pd.pivot_table(source_data_df, index = ['id','technology'],  
                                    columns = [ 'question'], 
                                    values = 'response', 
                                    aggfunc='first',
                                    fill_value=0).reset_index()
print (df)
question     id   technology  Q1  Q3  Q4
0         subj1  technology1   3   0   5
1         subj1  technology2   4   0   9
2         subj1  technology3   0   6   5
3         subj1  technology4   0   2   7
4         subj2  technology1   4   0   0
5         subj2  technology2   1   0   0
6         subj2  technology3   0   0   0

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

来自分类Dev

Python熊猫从长到宽

来自分类Dev

从长到宽没有id.var的形式?

来自分类Dev

从长到宽的数据。大熊猫

来自分类Dev

大熊猫,长到宽窗

来自分类Dev

使用熊猫将数据框从长到宽转换-单行输出

来自分类Dev

在大熊猫中,我如何从长到宽?

来自分类Dev

BigQuery SQL一种从长到宽表的热编码

来自分类Dev

R:从长到宽换位并聚集具有匹配ID的行

来自分类Dev

R长到宽,带有一对圆柱

来自分类Dev

具有选定列的长到宽数据框

来自分类Dev

与dplyr长到宽

来自分类Dev

长到宽-重复列

来自分类Dev

grep 输出从长到宽

来自分类Dev

r 从长到宽重塑

来自分类Dev

填充每个类别的缺失值后,如何重塑从长到宽的数据框熊猫?

来自分类Dev

从长到宽重塑数据帧

来自分类Dev

从长到宽重塑面板数据

来自分类Dev

从R到NA的长到宽

来自分类Dev

HIVE如何从长到宽的结构?

来自分类Dev

从长到宽塑造R挑战

来自分类Dev

数据框长到宽格式

来自分类Dev

从长到宽整形并创建具有二进制值的列

来自分类Dev

R中具有非唯一键组合的长到宽

来自分类Dev

是否有一种通过计数类别数将长到宽转换的整洁方法?

来自分类Dev

有条件地填充缺失值,同时重塑R中从长到宽的数据集

来自分类Dev

在大多数列具有多个值的地方从长到宽重塑

来自分类Dev

长到宽 - 将一列(具有多个度量)转换为一对列

来自分类Dev

R:有没有一种方法可以对杂乱的数据进行排序,这些杂乱的数据会从长到宽旋转,并随着它们在变量之间移动而进入一个逻辑键:值列?

来自分类Dev

没有分组索引的熊猫数据透视表