我有一个工作脚本,该脚本返回一个df,其中包含提供的半径内的点数。下面的示例df。
Label
A
会将函数应用于并返回指定半径内的其他点。Label
迭代传递给所有唯一值的最有效方法是什么?而不是一次传递一个值。码:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Time' : ['09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2'],
'Label' : ['A','B','C','D','E','A','B','C','D','E'],
'X' : [8,4,3,8,7,7,3,3,4,6],
'Y' : [3,3,3,4,3,2,1,2,4,2],
})
def countPoints(coordinates, ID, radius):
"""Create df that returns coordinates within unique id radius."""
points = coordinates[['X', 'Y']].values
array = points[:,None,:] - points[0:,]
distance = np.linalg.norm(array, axis = 2)
df = coordinates[distance[coordinates['Label'].eq(ID).values.argmax()] <= radius]
df['Point'] = ID
return df
目前,我正在将该函数分别应用于所有值Label
,然后将df串联在一起。如果中有许多唯一值,则效率会降低Label
。
有没有办法迭代地应用它。
# Label A
df_A = df.groupby('Time').apply(countPoints, ID = 'A', radius = 1).reset_index(drop = True)
# Label B
df_B = df.groupby('Time').apply(countPoints, ID = 'B', radius = 1).reset_index(drop = True)
# Label C
df_C = df.groupby('Time').apply(countPoints, ID = 'C', radius = 1).reset_index(drop = True)
# Combine df's
df1 = pd.concat([df_A, df_B, df_C]).sort_values(by = 'Time').reset_index(drop = True)
预期输出:
Time Label X Y Point
0 09:00:00.1 A 8 3 A
1 09:00:00.1 D 8 4 A
2 09:00:00.1 E 7 3 A
3 09:00:00.1 B 4 3 B
4 09:00:00.1 C 3 3 B
5 09:00:00.1 B 4 3 C
6 09:00:00.1 C 3 3 C
7 09:00:00.2 A 7 2 A
8 09:00:00.2 E 6 2 A
9 09:00:00.2 B 3 1 B
10 09:00:00.2 C 3 2 B
11 09:00:00.2 B 3 1 C
12 09:00:00.2 C 3 2 C
只需按以下步骤移至pd.concat
函数内部countPoints
def countPoints(coordinates, radius): #remove parameter `ID` since applying all IDs
"""Create df that returns coordinates within unique id radius."""
points = coordinates[['X', 'Y']].values
array = points[:,None,:] - points[0:,]
distance = np.linalg.norm(array, axis = 2)
df = pd.concat([coordinates[m].assign(Point=id) for id, m in
zip(coordinates['Label'], (distance <= radius))],
ignore_index=True)
return df
df_out = df.groupby('Time').apply(countPoints, radius = 1).reset_index(drop=True)
Out[175]:
Time Label X Y Point
0 09:00:00.1 A 8 3 A
1 09:00:00.1 D 8 4 A
2 09:00:00.1 E 7 3 A
3 09:00:00.1 B 4 3 B
4 09:00:00.1 C 3 3 B
5 09:00:00.1 B 4 3 C
6 09:00:00.1 C 3 3 C
7 09:00:00.1 A 8 3 D
8 09:00:00.1 D 8 4 D
9 09:00:00.1 A 8 3 E
10 09:00:00.1 E 7 3 E
11 09:00:00.2 A 7 2 A
12 09:00:00.2 E 6 2 A
13 09:00:00.2 B 3 1 B
14 09:00:00.2 C 3 2 B
15 09:00:00.2 B 3 1 C
16 09:00:00.2 C 3 2 C
17 09:00:00.2 D 4 4 D
18 09:00:00.2 A 7 2 E
19 09:00:00.2 E 6 2 E
以上就是所有的输出ID
S,你的预期输出为A
,B
,C
。所以,只要切df_out
一下就只选那3个ID
df_ABC = df_out[df_out.Point.isin(['A', 'B', 'C'])].reset_index(drop=True)
Out[180]:
Time Label X Y Point
0 09:00:00.1 A 8 3 A
1 09:00:00.1 D 8 4 A
2 09:00:00.1 E 7 3 A
3 09:00:00.1 B 4 3 B
4 09:00:00.1 C 3 3 B
5 09:00:00.1 B 4 3 C
6 09:00:00.1 C 3 3 C
7 09:00:00.2 A 7 2 A
8 09:00:00.2 E 6 2 A
9 09:00:00.2 B 3 1 B
10 09:00:00.2 C 3 2 B
11 09:00:00.2 B 3 1 C
12 09:00:00.2 C 3 2 C
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句