Efficiently create arrays from a next n elements from an array

trust512

Short version:

I'm trying to efficiently create an array like x:

input = [0, 1, 2, 3, 4, 5, 6]

x = [ [0,1,2], [1,2,3], [2,3,4], [3,4,5], [4,5,6] ]

I've tried simple for looping and it takes too long for the real usecase.

Long version:

(extends short version)

I've got a 400k rows long dataframe, which I need to partition into arrays of a next n elements from the element currently iterated over. Currently I group it just like presented below in the process_data function.

A simple for based iteration takes forever here (2.5min on my hardware to be specific). I've searched itertools and pandas documentation, tried searching here too and couldn't find any fitting solution.

My current super time consuming implementation:

class ModelInputParsing(object):
    def __init__(self, data):
        self.parsed_dataframe = data.fillna(0)
    
    def process_data(self, lb=50):
        self.X, self.Y = [],[]
        for i in range(len(self.parsed_dataframe)-lb):
            self.X.append(self.parsed_dataframe.iloc[i:(i+lb),-2])
            self.Y.append(self.parsed_dataframe.iloc[(i+lb),-1])
        return (np.array(self.X), np.array(self.Y))

The input data looks like this (where Bid is the mentioned input):

    Bid     Changes     Expected
0   1.20102 NaN         0.000000
1   1.20102 0.000000    0.000000
2   1.20102 0.000000    0.000042
3   1.20102 0.000000    0.000017
4   1.20102 0.000000    0.000025
5   1.20102 0.000000    0.000025
6   1.20102 0.000000    0.000100
...

And the output should look like this:

array([[  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          8.34465027e-06,  -8.34465027e-06,   0.00000000e+00],
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
         -8.34465027e-06,   0.00000000e+00,   3.33786011e-05],
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   3.33786011e-05,   0.00000000e+00],
       ..., 
       [  0.00000000e+00,   8.34465027e-06,   1.66893005e-05, ...,
         -8.34465027e-06,   0.00000000e+00,   0.00000000e+00],
       [  8.34465027e-06,   1.66893005e-05,  -8.34465027e-06, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  1.66893005e-05,  -8.34465027e-06,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   1.66893005e-05]], dtype=float32)
len(x)
399950

Below I've presented x[0] and x[1]. Key here is how the the values move one position back in the next array. For example a first non-zero value moved from 7 to 6 position (0 based position).

The first element:

x[0]
array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,  -4.16040421e-05,   2.49147415e-05,
        -8.34465027e-06,   0.00000000e+00,  -7.49230385e-05,
         ...,
         2.50339508e-05,  -8.34465027e-06,   3.33786011e-05,
        -2.50339508e-05,  -8.34465027e-06,   8.34465027e-06,
        -8.34465027e-06,   0.00000000e+00], dtype=float32)
len(x[0])
50

The second element:

x[1]
array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -4.16040421e-05,   2.49147415e-05,  -8.34465027e-06,
         0.00000000e+00,  -7.49230385e-05,  -1.58131123e-04,
         ....,
        -8.34465027e-06,   3.33786011e-05,  -2.50339508e-05,
        -8.34465027e-06,   8.34465027e-06,  -8.34465027e-06,
         0.00000000e+00,   3.33786011e-05], dtype=float32)
len(x[1])
50

I'm curious if there is a way to get this done more efficiently as I'm soon planning to parse +20m rows long datasets.

fferri

zip() plus some slicing can do that:

>>> list(zip(input[0:], input[1:], input[2:]))
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6)]

if you need the list elements to be lists, use this:

>>> list(map(list, zip(input[0:], input[1:], input[2:])))
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

In general, if you need n-tuples instead of triples, you can do:

>>> list(zip(*(input[i:] for i in range(3))))
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6)]

or

>>> list(map(list, zip(*(input[i:] for i in range(3)))))
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

Another way to do it:

>>> [input[i:i+3] for i in range(len(input)-3+1)]
[[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6]]

Some benchmarks:

Setup:

import timeit

def ff1(input):
    return list(map(list, zip(input[0:], input[1:], input[2:])))

def ff2(input):
    return list(map(list, zip(*(input[i:] for i in range(3)))))

def ff3(input):
    return [input[i:i+3] for i in range(len(input)-3+1)]

def jg(input):
    for i in range(0, len(input) - 2):
        yield input[i:i+3]

def jg1(input):
    return list(jg(input))

import itertools

def n(input, n=3):
    i = list(itertoopls.tee(input, n))
    for p, it in enumerate(i):
        next(itertools.slice(it, p, p), None)
    return zip(*i)

def n1(input, _n=3):
    return list(map(list, n(input, _n)))

from numpy.lib.stride_tricks import as_strided

def strided_groupby(n, l=3):
    s = n.strides[0]
    return as_strided(n, shape=(n.size-l+1,l), strides=(s,s))

Results:

>>> input = list(range(10000))
>>> timeit.timeit(stmt='ff1(input)', globals=globals(), number=1000)
1.4750333260162733
>>> timeit.timeit(stmt='ff2(input)', globals=globals(), number=1000)
1.486136345018167
>>> timeit.timeit(stmt='ff3(input)', globals=globals(), number=1000)
1.6864491199958138
>>> timeit.timeit(stmt='jg1(input)', globals=globals(), number=1000)
2.300399674975779
>>> timeit.timeit(stmt='n1(input)', globals=globals(), number=1000)
2.2269885840360075
>>> input_arr = np.array(input)
>>> timeit.timeit(stmt='strided_groupby(input_arr)', globals=globals(), number=1000)
0.01855822204379365

Note that the inner list conversion waste a significant amount of CPU cycles. If you can afford to have tuples instead of lists, as the innermost sequences (i.e. (0,1,2), (1,2,3), ...) that is going to perform better.

For fairness of comparison I applied the same list conversion to all algorithms.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

Efficiently create arrays from a next n elements from a array

分類Dev

Efficiently pick n random elements from PHP array (without shuffle)

分類Dev

Create a mapping of elements from two arrays

分類Dev

Create a 2D array from paired elements of two 1D arrays in JavaScript

分類Dev

Create a sorted list from n sorted sublists (efficiently)

分類Dev

Efficiently remove duplicates and similar elements from a list

分類Dev

Efficiently removing elements from unit range (Julia)

分類Dev

How to create an array of unique elements from a string/array in bash?

分類Dev

Use map function to create one array from Object that has arrays in it

分類Dev

Create an Array from sub arrays vb.net

分類Dev

How to efficiently create nested loops and append the output (as a multi-dimensional array) from all the loops?

分類Dev

Create nested hashes from an array of elements for JSON encoding

分類Dev

From 4 given arrays(not sorted), find the elements from each array whose sum is equal to some number X

分類Dev

Remove first n elements from array of Int in Swift

分類Dev

How can I remove all but the last N elements from an array?

分類Dev

Get count from Array of arrays

分類Dev

How can I create array elements dynamically and assign values (from variable with the same names) to the elements

分類Dev

How to query mongo to find arrays that contain any of the elements from another array?

分類Dev

Removing matching elements from two numpy arrays

分類Dev

Cannot delete elements from array

分類Dev

Pick specific elements from array

分類Dev

Count elements from an array in php

分類Dev

Remove array from array of arrays using Ramda?

分類Dev

Iterate an array of arrays for a match from another array

分類Dev

Javascript Next、Prev、Random from a array

分類Dev

Create element wise dictionary from python arrays

分類Dev

Create an array from a loop in MATLAB?

分類Dev

Create vectors by extracting elements from list in R

分類Dev

scanner next() only -n characters from a file

Related 関連記事

  1. 1

    Efficiently create arrays from a next n elements from a array

  2. 2

    Efficiently pick n random elements from PHP array (without shuffle)

  3. 3

    Create a mapping of elements from two arrays

  4. 4

    Create a 2D array from paired elements of two 1D arrays in JavaScript

  5. 5

    Create a sorted list from n sorted sublists (efficiently)

  6. 6

    Efficiently remove duplicates and similar elements from a list

  7. 7

    Efficiently removing elements from unit range (Julia)

  8. 8

    How to create an array of unique elements from a string/array in bash?

  9. 9

    Use map function to create one array from Object that has arrays in it

  10. 10

    Create an Array from sub arrays vb.net

  11. 11

    How to efficiently create nested loops and append the output (as a multi-dimensional array) from all the loops?

  12. 12

    Create nested hashes from an array of elements for JSON encoding

  13. 13

    From 4 given arrays(not sorted), find the elements from each array whose sum is equal to some number X

  14. 14

    Remove first n elements from array of Int in Swift

  15. 15

    How can I remove all but the last N elements from an array?

  16. 16

    Get count from Array of arrays

  17. 17

    How can I create array elements dynamically and assign values (from variable with the same names) to the elements

  18. 18

    How to query mongo to find arrays that contain any of the elements from another array?

  19. 19

    Removing matching elements from two numpy arrays

  20. 20

    Cannot delete elements from array

  21. 21

    Pick specific elements from array

  22. 22

    Count elements from an array in php

  23. 23

    Remove array from array of arrays using Ramda?

  24. 24

    Iterate an array of arrays for a match from another array

  25. 25

    Javascript Next、Prev、Random from a array

  26. 26

    Create element wise dictionary from python arrays

  27. 27

    Create an array from a loop in MATLAB?

  28. 28

    Create vectors by extracting elements from list in R

  29. 29

    scanner next() only -n characters from a file

ホットタグ

アーカイブ