迭代使用groupby apply函数

debugcn 发表于 Dev

强子

我有一个工作脚本，该脚本返回一个df，其中包含提供的半径内的点数。下面的示例df。

当前，这Label A会将函数应用于并返回指定半径内的其他点。
将此函数Label迭代传递给所有唯一值的最有效方法是什么？而不是一次传递一个值。

码：

import pandas as pd
import numpy as np

df = pd.DataFrame({
        'Time' : ['09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2'],                 
        'Label' : ['A','B','C','D','E','A','B','C','D','E'],                 
        'X' : [8,4,3,8,7,7,3,3,4,6],
        'Y' : [3,3,3,4,3,2,1,2,4,2],
        })

def countPoints(coordinates, ID, radius):
    """Create df that returns coordinates within unique id radius."""

    points = coordinates[['X', 'Y']].values

    array = points[:,None,:] - points[0:,]

    distance = np.linalg.norm(array, axis = 2)

    df = coordinates[distance[coordinates['Label'].eq(ID).values.argmax()] <= radius]

    df['Point'] = ID

    return df

目前，我正在将该函数分别应用于所有值Label，然后将df串联在一起。如果中有许多唯一值，则效率会降低Label。

有没有办法迭代地应用它。

# Label A
df_A = df.groupby('Time').apply(countPoints, ID = 'A', radius = 1).reset_index(drop = True)

# Label B
df_B = df.groupby('Time').apply(countPoints, ID = 'B', radius = 1).reset_index(drop = True)

# Label C
df_C = df.groupby('Time').apply(countPoints, ID = 'C', radius = 1).reset_index(drop = True)

# Combine df's
df1 = pd.concat([df_A, df_B, df_C]).sort_values(by = 'Time').reset_index(drop = True)

预期输出：

          Time Label  X  Y Point
0   09:00:00.1     A  8  3     A
1   09:00:00.1     D  8  4     A
2   09:00:00.1     E  7  3     A
3   09:00:00.1     B  4  3     B
4   09:00:00.1     C  3  3     B
5   09:00:00.1     B  4  3     C
6   09:00:00.1     C  3  3     C
7   09:00:00.2     A  7  2     A
8   09:00:00.2     E  6  2     A
9   09:00:00.2     B  3  1     B
10  09:00:00.2     C  3  2     B
11  09:00:00.2     B  3  1     C
12  09:00:00.2     C  3  2     C

安迪（Andy L.）

只需按以下步骤移至pd.concat函数内部countPoints

def countPoints(coordinates, radius):  #remove parameter `ID` since applying all IDs
    """Create df that returns coordinates within unique id radius."""

    points = coordinates[['X', 'Y']].values

    array = points[:,None,:] - points[0:,]

    distance = np.linalg.norm(array, axis = 2)

    df = pd.concat([coordinates[m].assign(Point=id) for id, m in 
                            zip(coordinates['Label'], (distance <= radius))], 
                   ignore_index=True)      

    return df


df_out = df.groupby('Time').apply(countPoints, radius = 1).reset_index(drop=True)

Out[175]:
          Time Label  X  Y Point
0   09:00:00.1     A  8  3     A
1   09:00:00.1     D  8  4     A
2   09:00:00.1     E  7  3     A
3   09:00:00.1     B  4  3     B
4   09:00:00.1     C  3  3     B
5   09:00:00.1     B  4  3     C
6   09:00:00.1     C  3  3     C
7   09:00:00.1     A  8  3     D
8   09:00:00.1     D  8  4     D
9   09:00:00.1     A  8  3     E
10  09:00:00.1     E  7  3     E
11  09:00:00.2     A  7  2     A
12  09:00:00.2     E  6  2     A
13  09:00:00.2     B  3  1     B
14  09:00:00.2     C  3  2     B
15  09:00:00.2     B  3  1     C
16  09:00:00.2     C  3  2     C
17  09:00:00.2     D  4  4     D
18  09:00:00.2     A  7  2     E
19  09:00:00.2     E  6  2     E

以上就是所有的输出IDS，你的预期输出为A，B，C。所以，只要切df_out一下就只选那3个ID

df_ABC = df_out[df_out.Point.isin(['A', 'B', 'C'])].reset_index(drop=True)

Out[180]:
          Time Label  X  Y Point
0   09:00:00.1     A  8  3     A
1   09:00:00.1     D  8  4     A
2   09:00:00.1     E  7  3     A
3   09:00:00.1     B  4  3     B
4   09:00:00.1     C  3  3     B
5   09:00:00.1     B  4  3     C
6   09:00:00.1     C  3  3     C
7   09:00:00.2     A  7  2     A
8   09:00:00.2     E  6  2     A
9   09:00:00.2     B  3  1     B
10  09:00:00.2     C  3  2     B
11  09:00:00.2     B  3  1     C
12  09:00:00.2     C  3  2     C

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。