Numpy进行循环矢量化：使用np.all广播和测试唯一元素

debugcn 发表于 Dev

韦斯

工作循环，预期结果

我正在尝试使用非常大的数据集对代码中的慢速for循环进行矢量化处理，以根据测试删除重复项。结果应仅保留前三个元素唯一的元素，而第四个元素是所有重复项中最大的元素。例如

in = np.array(((0, 12, 13, 1), (0, 12, 13, 10), (1, 12, 13, 2)))

应该成为

out = np.array(((0, 12, 13, 10), (1, 12, 13, 2)))

这对于使用for循环来说是微不足道的，但是正如我提到的那样，它非常慢。

unique = np.unique(in[:, :3], axis=0)
out = np.empty((0, 4))
for i in unique:
    out = np.vstack((out, np.hstack((i[:], np.max(in[np.all(in[:, :3] == i[:], axis=1)][:, 3])))))

我尝试过的（1）

当我尝试通过将每个索引替换为来删除带索引的for循环i[:]时unique[np.arange(unique.shape[0])]：

out = np.vstack((out, np.hstack((unique[np.arange(unique.shape[0])], np.max(in[np.all(in[:, :3].astype(int) == unique[np.arange(unique.shape[0])], axis=1)][:, 3])))))

Numpy抱怨输入形状与所有内容一起：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 6, in all
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py", line 2351, in all
    return _wrapreduction(a, np.logical_and, 'all', axis, None, out, keepdims=keepdims)
  File "/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py", line 90, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
numpy.AxisError: axis 1 is out of bounds for array of dimension 0

我尝试过的（2）

基于StackOverflow在输入此问题时的建议（python / NumPy中的Broadcast / Vectorizing内部和外部for循环）：

newout = np.vstack((newout, np.hstack((tempunique[:, None], np.max(inout[np.all(inout[:, :3].astype(int) == tempunique[:, None], axis=1)][:, 3])))))

我在抱怨输入和输出之间的大小不匹配时出错：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: boolean index did not match indexed array along dimension 0; dimension is 3 but corresponding boolean dimension is 2

重提问题

有没有正确的方法来广播我的索引以消除for循环？

xcmkz

我对您的用例了解不足，无法确定是否值得介绍Pandas，但是在Pandas中有效地做到这一点只需要几行代码：

import numpy as np
import pandas as pd

in_array = np.array(((0, 12, 13, 1), (0, 12, 13, 10), (1, 12, 13, 2)))
in_df = pd.DataFrame(in_array)


# group by unique combinations of the 0th, 1st, and 2nd columns, then take the
# max of the 3rd column in each group. `reset_index` change cols 0-2 from index
# back to normal columns
out_df = in_df.groupby([0, 1, 2])[3].max().reset_index()
out_array = out_df.values

print(out_array)
# Output:
# [[ 0 12 13 10]
#  [ 1 12 13  2]]

一个简单的时序测试表明，使用Pandas处理100000行随机生成的输入数组需要0.0117秒，而使用for循环实现则需要2.6103秒。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。