如何根据行的排序值对pandas数据帧的每一行进行排序并返回列索引

debugcn 发表于 Dev

拉维

我正在尝试对 Pandas 数据帧的每一行进行排序，并在新数据帧中获取排序值的索引。我可以慢慢来。任何人都可以建议使用并行化或矢量化代码对此进行改进。我在下面发布了一个例子。

data_url = ' https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv '

# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)

# drop categorical column
gapminder.drop(['country', 'continent'], axis=1, inplace=True) 

# print the first three rows
print(gapminder.head(n=3))

   year         pop  lifeExp   gdpPercap
0  1952   8425333.0   28.801  779.445314
1  1957   9240934.0   30.332  820.853030
2  1962  10267083.0   31.997  853.100710

我正在寻找的结果是这个

tag_0   tag_1   tag_2   tag_3
0   pop year    gdpPercap   lifeExp
1   pop year    gdpPercap   lifeExp
2   pop year    gdpPercap   lifeExp

在这种情况下，由于pop始终高于gdpPercap和lifeExp，因此始终排在首位。

我可以使用以下代码实现所需的输出。但是如果df有很多行/列，计算需要更长的时间。

任何人都可以建议对此进行改进

def sort_df(df):
    sorted_tags = pd.DataFrame(index = df.index, columns = ['tag_{}'.format(i) for i in range(df.shape[1])])
    for i in range(df.shape[0]):
        sorted_tags.iloc[i,:] = list( df.iloc[i, :].sort_values(ascending=False).index)
    return sorted_tags

sort_df(gapminder)

马蒂亚斯·奥萨德尼克

这可能和 numpy 一样快：

def sort_df(df):
    return pd.DataFrame(
        data=df.columns.values[np.argsort(-df.values, axis=1)],
        columns=['tag_{}'.format(i) for i in range(df.shape[1])]
    )

print(sort_df(gapminder.head(3)))

  tag_0 tag_1      tag_2    tag_3
0   pop  year  gdpPercap  lifeExp
1   pop  year  gdpPercap  lifeExp
2   pop  year  gdpPercap  lifeExp

说明：np.argsort沿行对值进行排序，但返回对数组进行排序的索引而不是已排序的值，后者可用于对数组进行协同排序。减号按降序排序。在您的情况下，您使用索引对列进行排序。numpy 广播负责返回正确的形状。

您的示例的运行时间约为 3 毫秒，而您的函数为 2.5 秒。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。