脾气暴躁的随机选择多个循环

debugcn 发表于 Dev

用户名

我试图多次执行多次仿真以获得所需的仿真分布。我有一个数据集，看起来像下面的数据集。

fruit_type, reading, prob
Apple, 12,.05
apple, 15, .5
orange 18, .99

我的代码示例如下。

def sim(seconds):
    output = pd.DataFrame()
    current = []
    #output = pd.DataFrame()
    for i in range(1, 100000000):
        if data2['fruit_type'].all() == 'Apple':
            hostrecord1 = np.random.choice(data2['reading'], size=23, replace=True, p=data2['prob'])
            current = hostrecord1.sum() + 150

        if data2['fruit_type'].all() == 'Orange':
            hostrecord2 = np.random.choice(data2['reading'], size=23, replace=True, p=data2['prob'])
            current = hostrecord2.sum() + 150

        if data2['fruit_type'].all() == 'Peach':
            hostrecord3 = np.random.choice(data2['reading'], size=20, replace=True, p=data2['prob'])
            current = hostrecord3.sum() + 150

    #put all records in one array
    #return all records 
    output = pd.concat(current)
    return output

我试图弄清楚如何在不同的条件下执行多个仿真fruit_type，但当前无法弄清逻辑。每个模拟都应选择相对于的特定行，fruit_type因此模拟由其fruit_type一部分指定。每个样本的大小在设计上都不同，因为每个样本fruit_type都有不同的条件。

我的预期输出是所有模拟值的数组。我也想将所有结果附加到一个熊猫数据框中。

wflynny

您的解释尚不清楚，但这是一个猜测：

# initialize data
In [1]: fruits = ['apple', 'peach', 'orange']
In [2]: data = np.vstack((np.random.choice(fruits, size=10), 
                          np.random.randint(0, 100, size=10), 
                          np.random.rand(10))).T
In [3]: df = pd.DataFrame(data, columns=['fruit_type', 'reading', 'prob'])

关键是df这样索引df[df.fruit_type == fruit_of_interest]。这是一个示例函数：

def simulate(df, N_trials):
    # replace with actual sizes for ['apple', 'peach', 'orange'] respectively
    sample_sizes = [N1, N2, N3]
    fruits = ['apple', 'peach', 'orange']

    results = np.empty((N_trials, len(fruits))
    for i in xrange(N_trials): # switch to range if using python3
        for j, (fruit, size) in enumerate(zip(fruits, sample_sizes)):
            sim_data = df[df.fruit_type == fruit]
            record = np.random.choice(sim_data.reading, size=size, p=sim_data.prob)
            # do something with the record
            results[i, j] = record.sum()

请注意，如果您要进行1亿次试验，结果数组可能太大而无法容纳在内存中。如果交换for循环，则结果也可能会更快，因此水果/大小之一是最外层的for循环。

还值得注意的是，for除了-looping之外，您始终可以使用生成一个巨大的样本，np.random.choice然后进行重塑：

np.random.choice([0, 1], size=1000000).reshape(10000, 100)

将为您提供10000次试用，每个试用100个样本。如果您的1亿次试验耗时太长，这可能会很有用-您可以将其分为100个循环choice，一次执行100万个样本。一个例子可能是

def simulate(df, N_trials, chunk_size=10000):
    # replace with actual sizes for ['apple', 'peach', 'orange'] respectively
    sample_sizes = [N1, N2, N3]
    fruits = ['apple', 'peach', 'orange']

    for i in xrange(N_trials/chunk_size): # switch to range if using python3
        chunk_results = np.empty((chunk_size, len(fruits))
        for j, (fruit, size) in enumerate(zip(fruits, sample_sizes)):
            sim_data = df[df.fruit_type == fruit]
            record = np.random.choice(sim_data.reading, size=(chunk_size, size), 
                                      p=sim_data.prob)
            chunk_results[:, j] = record.sum(axis=1)

        # do something intermediate with this chunk

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。