根据一列中所有值的组合来熊猫新数据框

debugcn 发表于 Dev

背景

我已经收集了一段时间的公交车位置数据，并希望构建一个模型来预测公交车何时到达特定站点。

以最简单的形式，我有一个像这样的DataFrame：

import pandas as pd

df = pd.DataFrame({'station': ['Station 1', 'Station 2', 'Station 3', 'Station 4'], 
                    'arrival_time': ['10:00', '10:02', '10:03', '10:05']})
print(df)

     station arrival_time
0  Station 1        10:00
1  Station 2        10:02
2  Station 3        10:03
3  Station 4        10:05

我想将每个站点的到达时间映射到行程中稍后某个站点的到达时间。预期的输出如下所示：

  station_prev arrival_time_prev station_next arrival_time_next
0    Station 1             10:00    Station 2             10:02
1    Station 2             10:02    Station 3             10:03
2    Station 3             10:03    Station 4             10:05
3    Station 1             10:00    Station 3             10:03
4    Station 2             10:02    Station 4             10:05
5    Station 1             10:00    Station 4             10:05

我已经对df.shift（）和以下用于单数DataFrame的作品进行了实验。

import pandas as pd
import numpy as np

def combos(df):
    
    columns_prev = np.array(df.columns) + '_prev'
    columns_next = np.array(df.columns) + '_next'
 
    df_combo = pd.DataFrame()
    
    for i in range(1, df.shape[0]):
        df_prev = df.shift(i)
        df_prev.columns = columns_prev
        df_next = df.copy()
        df_next.columns = columns_next
        combo = pd.concat([df_prev, df_next], axis=1).dropna()
        df_combo = df_combo.append(combo, ignore_index=True)
    
    return df_combo

但是，对于较大的DataFrames来说，速度相当慢，并且在我尝试将其包装到一个较大的函数中时会定期中断，该函数会汇总来自多次行程的数据（我经常遇到关键错误，但不明白为什么）。关于如何更优雅，有效和可靠地执行此操作的任何想法？在此先多谢！

cs95

将“站”转换为有序的分类列：

df['station'] = pd.Categorical(df['station'], ordered=True).codes

您现在可以进行交叉联接和过滤：

tmp = df.assign(key=1)
(tmp.merge(tmp, on='key', suffixes=('_prev', '_next'))
    .drop('key', 1)
    .query('station_prev < station_next'))

    station_prev arrival_time_prev  station_next arrival_time_next
1              0             10:00             1             10:02
2              0             10:00             2             10:03
3              0             10:00             3             10:05
6              1             10:02             2             10:03
7              1             10:02             3             10:05
11             2             10:03             3             10:05

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。