熊猫数据框随机抽样

debugcn 发表于 Dev

弗朗西斯科·佩加（Francisco Pega）

我对熊猫（以及python ...和编程）比较陌生，我正在尝试进行Montecarlo仿真，但是我无法找到需要花费大量时间的解决方案

数据存储在称为“ YTDSales”的数据框中，该数据框每天有每种产品的销售额

Date          Product_A     Product_B     Product_C     Product_D     ...   Product_XX
01/01/2014         1000           300            70         34500     ...          780   
02/01/2014          400           400            70            20     ...           10   
03/01/2014         1110           400          1170            60     ...           50   
04/01/2014           20           320             0         71300     ...           10   
       ...
15/10/2014         1000           300            70         34500     ...         5000

我想做的是模拟不同的情况，并在一年中的剩余时间（从10月15日到年底）使用每种产品的历史分布。例如，根据显示的数据，我希望在今年剩余的时间里获得20到1100之间的销售额。

我所做的是以下

# creates range of "future dates"
last_historical = YTDSales.index.max()
year_end = dt.datetime(2014,12,30)
DatesEOY = pd.date_range(start=last_historical,end=year_end).shift(1)

# function that obtains a random sales number per product, between max and min
f = lambda x:np.random.randint(x.min(),x.max())

# create all the "future" dates and fill it with the output of f
for i in DatesEOY:
    YTDSales.loc[i]=YTDSales.apply(f)

该解决方案有效，但是大约需要3秒钟，如果我计划进行1000次迭代，这将是很多...是否有不迭代的方法？

谢谢

伊利

使用size选项np.random.randint可一次获取所需大小的样本。我将考虑的一种方法简要如下。

将所需的空间分配到一个新数组中，该数组将具有DatesEOY中的索引值，原始DataFrame中的列以及所有NaN值。然后连接到原始数据。
现在您知道了需要的每个随机样本的长度，现在可以使用extrasize关键字numpy.random.randint按列一次对所有样本进行采样，而无需循环。
用此批采样覆盖数据。

这可能是这样的：

new_df = pandas.DataFrame(index=DatesEOY, columns=YTDSales.columns)

num_to_sample = len(new_df)

f = lambda x: np.random.randint(x[1].min(), x[1].max(), num_to_sample)

output = pandas.concat([YTDSales, new_df], axis=0)

output[len(YTDSales):] = np.asarray(map(f, YTDSales.iteritems())).T

在此过程中，我选择通过将旧框架与新的“占位符”连接起来，从而制作一个全新的DataFrame。对于非常大的数据，这显然可能效率很低。

另一种方法是像在for循环解决方案中所做的那样进行放大设置。

我没有足够长的时间来尝试这种方法，以至于无法弄清楚如何一次“放大”一批索引。但是，如果您知道了这一点，则可以使用所有NaN值（位于的索引值处DatesEOY）“放大”原始数据框，然后将函数应用于YTDSales而不是完全引入output。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。