如何避免循环为熊猫数据帧过滤指定限制？

Awais Mirza 发表于 Dev

阿瓦斯·米尔扎（Awais Mirza）

下面的示例代码具有三个for循环：

import numpy as np 
import pandas as pd

#Generating a sample (ndarray) of 25 particles with 3 random coordinates in the range between 0 and 3. 
#Maybe think of the particles as contained in a cube of 3 x 3 x 3 units.
sample_data = np.random.uniform(0, 3, (25,3))

#Converting the narray into a dataframe
df = pd.DataFrame(data = sample_data, columns = ['A', 'B', 'C'])
print(df)

#Generating another narray which will store the number of particles in each cell of the cube
#Each cell has dimentions 1 x 1 x 1; total cells = 27
counts_in_cells = np.empty((3, 3, 3))
counts_in_cells[:] = np.NaN

#Three nested loops to count the number of particles in each cell
for i, x_low in enumerate(np.arange(0, 3, 1)):
    for j, y_low in enumerate(np.arange(0, 3, 1)):
        for k, z_low in enumerate(np.arange(0, 3, 1)):
            
            #Specifying filtering conditions for three dimentions of the cells
            x_condition = (df['A'] >= x_low) & (df['A'] < (x_low + 1))
            y_condition = (df['B'] >= y_low) & (df['B'] < (y_low + 1))
            z_condition = (df['C'] >= z_low) & (df['C'] < (z_low + 1))
            
            #Applying the filtering conditions
            df_select = df[x_condition & y_condition & z_condition]
            
            #Counting the particles in cells (desired outcome)
            counts_in_cells[i][j][k] = len(df_select)

#Paricles in each cell 
print(counts_in_cells)

样品输入

期望的结果

快速运行

该示例代码可以在Kaggle上立即运行：https ://www.kaggle.com/awaismirza/counting-particles-in-each-cell-of-a-cube 。

问题

我想避免三个循环，因为此代码的实际版本需要几分钟才能运行。（它有90k粒子和更大的立方体。）此外，我必须将实际代码运行6k次，这将需要很多天。

有没有办法（熊猫功能或NumPy遮罩等）避免循环并更快地运行代码？

原始码

此处提供了代码的实际版本，但是上面的示例代码应该足以理解该问题。

帕特里克·阿特纳

由于您的实际数据与本示例大相径庭，因此您需要进行更多处理。请将此代码放在一粒盐中，然后自己检查结果。

import numpy as np
import pandas as pd

# Reading a data file 
df_gal = pd.read_csv('massive_galaxies.csv')  # modified for my purposes

 
def density_field_calc(clus_x, clus_y, clus_z): 
    #Converting strings into floats
    clus_x = float(clus_x)
    clus_y = float(clus_y)
    clus_z = float(clus_z)

    # Filtering the input dataframe using the arguments of the function
    df_gal_selected = df_gal[(df_gal['x[kpc/h]'] >= (clus_x - 120000)) & (df_gal['x[kpc/h]'] <= (clus_x + 120000))
                            & (df_gal['y[kpc/h]'] >= (clus_y - 120000)) & (df_gal['y[kpc/h]'] <= (clus_y + 120000)) 
                             & (df_gal['z[kpc/h]'] >= (clus_z - 120000)) & (df_gal['z[kpc/h]'] <= (clus_z + 120000))]

    # copy the filtered value and normalize - subtract the lower bound so we start at 0 
    # max upper value of filtered data are just shy of 240000
    dfs = df_gal_selected.copy()
    dfs['x[kpc/h]'] -= clus_x-120000
    dfs['y[kpc/h]'] -= clus_y-120000
    dfs['z[kpc/h]'] -= clus_z-120000
    
    # now divide by 5000 (integer-div) so we get bin-numbers

    dfs['x[kpc/h]'] = dfs['x[kpc/h]'] // 5000
    dfs['y[kpc/h]'] = dfs['y[kpc/h]'] // 5000
    dfs['z[kpc/h]'] = dfs['z[kpc/h]'] // 5000
    
    # same trick as ealier, make tuples, convert tuples to running bin numbers
    dfs["cell"] = list(zip(dfs['x[kpc/h]'].astype(int), dfs['y[kpc/h]'].astype(int), dfs['z[kpc/h]'].astype(int)))
    
    lu = {(x,y,z):z*49*49 + y*49 + z for x in range(48) for y in range(48) for z in range(48)} 
    
    dfs["idx"] = dfs["cell"].map(lu)
    # print(dfs)

    # occurences by tuples grouped
    print(dfs.groupby(["cell"]).count()["idx"])

    # Creating and initiating an array containing NaN values
    counts_in_cells = np.empty((48, 48, 48))
    counts_in_cells[:] = 0

    for cell in dfs["cell"]:
        x, y, z = cell
        counts_in_cells[x, y, z] += 1

    np.set_printoptions(precision=1, suppress=True )
    print(counts_in_cells)

density_field_calc('416658.59', '455771.69', '72710.742')

输出：

# from groupby and count()
cell
(0, 6, 10)      1
(0, 6, 24)      1
(0, 6, 25)      1
(0, 6, 40)      1
(0, 8, 12)      1
               ..
(47, 44, 14)    1
(47, 44, 15)    3
(47, 45, 33)    1
(47, 45, 43)    1
(47, 47, 44)    1
Name: idx, Length: 3407, dtype: int64

# the np counted - mostly 0's
[[[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 1. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 1.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 1.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 ...

 [[0. 0. 0. ... 0. 0. 1.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 1. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]]]

该选择包含4401行，其numpy总和（sum（sum（sum（sum（count（counts_in_cells）））））也是4401.0-因此它可能起作用，并且在几秒钟内结束。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。