下面的示例代码具有三个for
循环:
import numpy as np
import pandas as pd
#Generating a sample (ndarray) of 25 particles with 3 random coordinates in the range between 0 and 3.
#Maybe think of the particles as contained in a cube of 3 x 3 x 3 units.
sample_data = np.random.uniform(0, 3, (25,3))
#Converting the narray into a dataframe
df = pd.DataFrame(data = sample_data, columns = ['A', 'B', 'C'])
print(df)
#Generating another narray which will store the number of particles in each cell of the cube
#Each cell has dimentions 1 x 1 x 1; total cells = 27
counts_in_cells = np.empty((3, 3, 3))
counts_in_cells[:] = np.NaN
#Three nested loops to count the number of particles in each cell
for i, x_low in enumerate(np.arange(0, 3, 1)):
for j, y_low in enumerate(np.arange(0, 3, 1)):
for k, z_low in enumerate(np.arange(0, 3, 1)):
#Specifying filtering conditions for three dimentions of the cells
x_condition = (df['A'] >= x_low) & (df['A'] < (x_low + 1))
y_condition = (df['B'] >= y_low) & (df['B'] < (y_low + 1))
z_condition = (df['C'] >= z_low) & (df['C'] < (z_low + 1))
#Applying the filtering conditions
df_select = df[x_condition & y_condition & z_condition]
#Counting the particles in cells (desired outcome)
counts_in_cells[i][j][k] = len(df_select)
#Paricles in each cell
print(counts_in_cells)
样品输入
期望的结果
快速运行
该示例代码可以在Kaggle上立即运行:https ://www.kaggle.com/awaismirza/counting-particles-in-each-cell-of-a-cube 。
问题
我想避免三个循环,因为此代码的实际版本需要几分钟才能运行。(它有90k粒子和更大的立方体。)此外,我必须将实际代码运行6k次,这将需要很多天。
有没有办法(熊猫功能或NumPy遮罩等)避免循环并更快地运行代码?
原始码
此处提供了代码的实际版本,但是上面的示例代码应该足以理解该问题。
由于您的实际数据与本示例大相径庭,因此您需要进行更多处理。请将此代码放在一粒盐中,然后自己检查结果。
import numpy as np
import pandas as pd
# Reading a data file
df_gal = pd.read_csv('massive_galaxies.csv') # modified for my purposes
def density_field_calc(clus_x, clus_y, clus_z):
#Converting strings into floats
clus_x = float(clus_x)
clus_y = float(clus_y)
clus_z = float(clus_z)
# Filtering the input dataframe using the arguments of the function
df_gal_selected = df_gal[(df_gal['x[kpc/h]'] >= (clus_x - 120000)) & (df_gal['x[kpc/h]'] <= (clus_x + 120000))
& (df_gal['y[kpc/h]'] >= (clus_y - 120000)) & (df_gal['y[kpc/h]'] <= (clus_y + 120000))
& (df_gal['z[kpc/h]'] >= (clus_z - 120000)) & (df_gal['z[kpc/h]'] <= (clus_z + 120000))]
# copy the filtered value and normalize - subtract the lower bound so we start at 0
# max upper value of filtered data are just shy of 240000
dfs = df_gal_selected.copy()
dfs['x[kpc/h]'] -= clus_x-120000
dfs['y[kpc/h]'] -= clus_y-120000
dfs['z[kpc/h]'] -= clus_z-120000
# now divide by 5000 (integer-div) so we get bin-numbers
dfs['x[kpc/h]'] = dfs['x[kpc/h]'] // 5000
dfs['y[kpc/h]'] = dfs['y[kpc/h]'] // 5000
dfs['z[kpc/h]'] = dfs['z[kpc/h]'] // 5000
# same trick as ealier, make tuples, convert tuples to running bin numbers
dfs["cell"] = list(zip(dfs['x[kpc/h]'].astype(int), dfs['y[kpc/h]'].astype(int), dfs['z[kpc/h]'].astype(int)))
lu = {(x,y,z):z*49*49 + y*49 + z for x in range(48) for y in range(48) for z in range(48)}
dfs["idx"] = dfs["cell"].map(lu)
# print(dfs)
# occurences by tuples grouped
print(dfs.groupby(["cell"]).count()["idx"])
# Creating and initiating an array containing NaN values
counts_in_cells = np.empty((48, 48, 48))
counts_in_cells[:] = 0
for cell in dfs["cell"]:
x, y, z = cell
counts_in_cells[x, y, z] += 1
np.set_printoptions(precision=1, suppress=True )
print(counts_in_cells)
density_field_calc('416658.59', '455771.69', '72710.742')
输出:
# from groupby and count()
cell
(0, 6, 10) 1
(0, 6, 24) 1
(0, 6, 25) 1
(0, 6, 40) 1
(0, 8, 12) 1
..
(47, 44, 14) 1
(47, 44, 15) 3
(47, 45, 33) 1
(47, 45, 43) 1
(47, 47, 44) 1
Name: idx, Length: 3407, dtype: int64
# the np counted - mostly 0's
[[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 1. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 1.]
[0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 1.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
...
[[0. 0. 0. ... 0. 0. 1.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 1. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]]
该选择包含4401行,其numpy总和(sum(sum(sum(sum(count(counts_in_cells)))))也是4401.0-因此它可能起作用,并且在几秒钟内结束。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句