在数据框中动态拆分行

debugcn 发表于 Dev

mrh5028

我需要获取一个CSV文件并拆分行并进行级联。输入的CSV可以有不同数量的列（总是偶数），但始终以相同的方式拆分。我决定使用Pandas，因为对于某些文件，输出将为500,000行，并且我认为这样可以加快速度。

输入：

h1  h2  h3  h4  h5  h6
A1  A2  A3  A4  A5  A6
B1  B2  B3  B4  B5  B6

预期产量

h1  h2  h3  h4  h5  h6
A1  A2
A1  A2  A3  A4
A1  A2  A3  A4  A5  A6
B1  B2
B1  B2  B3  B4
B1  B2  B3  B4  B5  B6

我尝试使用下面的代码（通过一些搜索和我自己的编辑拼凑而成），您可以看到它很接近，但并不是我所需要的。

importFile = pd.read_csv('file.csv')
df = df_importFile = pd.DataFrame(importFile)

index_cols = ['h1']
cols = [c for c in df if c not in index_cols]

df2 = df.set_index(index_cols).stack().reset_index(level=1, drop=True).to_frame('Value')

df2 = pd.concat([pd.Series([v if i % len(cols) == n else ''
                        for i, v in enumerate(df2.Value)], name=col)
             for n, col in enumerate(cols)], axis=1).set_index(df2.index)


df2.to_csv('output.csv')

这给出了以下内容

h1  h2  h3  h4  h5  h6
A1  A2
A1      A3
A1          A4
A1              A5
A1                  A6

海盗

# take number of columns and divide by 2
# this is the number of pairs
pairs = df.shape[1] // 2

# np.repeat takes the number of rows and returns an object to slice
# the dataframe array df.values... then slice... result should be 
# of length pairs * len(df)
a = df.values[np.repeat(np.arange(df.shape[0]), pairs)]

# row values to condition with as column vector
dim0 = (np.arange(a.shape[0]) % (pairs))[:, None ]

# column values to condition with as row vector
dim1 = np.repeat(np.arange(pairs), 2)

# boolean mask to use in np.where generated
# via the magic of numpy broadcasting
mask = dim0 >= dim1

# QED
pd.DataFrame(np.where(mask, a, ''), columns=df.columns)