How to find number of rows that fall within a time period of each row, while satisfying criteria in other columns?

WillacyMe

I have an example dataframe with a column of names and a column of datetimes.

import random
np.random.seed(1)

numberList = ['Mark','James','Sarah']
df = pd.DataFrame({'Date':pd.date_range(start='1/1/2020', freq='BH', periods=20),
             'Name':[random.choice(numberList) for x in range(20)]})

    Date                Name
0   2020-01-01 09:00:00 James
1   2020-01-01 10:00:00 Sarah
2   2020-01-01 11:00:00 Sarah
3   2020-01-01 12:00:00 James
4   2020-01-01 13:00:00 Mark
5   2020-01-01 14:00:00 James
6   2020-01-01 15:00:00 Mark
7   2020-01-01 16:00:00 Sarah
8   2020-01-02 09:00:00 Mark
9   2020-01-02 10:00:00 Sarah
10  2020-01-02 11:00:00 Sarah
11  2020-01-02 12:00:00 Mark
12  2020-01-02 13:00:00 Sarah
13  2020-01-02 14:00:00 Sarah
14  2020-01-02 15:00:00 Mark
15  2020-01-02 16:00:00 Mark
16  2020-01-03 09:00:00 Sarah
17  2020-01-03 10:00:00 Sarah
18  2020-01-03 11:00:00 Mark
19  2020-01-03 12:00:00 Sarah

For each row I am trying to find the total number of rows that have a datetime within 10 hours after, and the name matches.

I've managed to do this with the code below, however on a much larger dataset this takes forever. Is there a better was accomplish this?

df['Total'] = 0
for i in df.Name.unique():
    df2 = df[df.Name == i]
    total = df2['Date'].apply(lambda x: len(df2[(df2.Date>=x) & (df2.Date<x + datetime.timedelta(hours = 10))]))
    df.loc[total.index,'Total'] = total.values
df

Result:

    Date                Name    Total
0   2020-01-01 09:00:00 James   3
1   2020-01-01 10:00:00 Sarah   3
2   2020-01-01 11:00:00 Sarah   2
3   2020-01-01 12:00:00 James   2
4   2020-01-01 13:00:00 Mark    2
5   2020-01-01 14:00:00 James   1
6   2020-01-01 15:00:00 Mark    1
7   2020-01-01 16:00:00 Sarah   1
8   2020-01-02 09:00:00 Mark    4
9   2020-01-02 10:00:00 Sarah   4
10  2020-01-02 11:00:00 Sarah   3
11  2020-01-02 12:00:00 Mark    3
12  2020-01-02 13:00:00 Sarah   2
13  2020-01-02 14:00:00 Sarah   1
14  2020-01-02 15:00:00 Mark    2
15  2020-01-02 16:00:00 Mark    1
16  2020-01-03 09:00:00 Sarah   3
17  2020-01-03 10:00:00 Sarah   2
18  2020-01-03 11:00:00 Mark    1
19  2020-01-03 12:00:00 Sarah   1

EDIT: The actual data is at least 80000 rows and with 200+ names. The Date column is specific up to the second., The Date column contains duplicate entries where two different Names can have the same datetime, but no single Name will have more than one of the same datetime entries.

EDIT-----------------------------------------------

I've marked Rik Kraan's answer although it did produce slower results when using my own data. Because of this I wanted to compare the performance of both methods. Below runs a test of both comparing sample sizes up to 50000 rows in 1000 row increments. For my specific use case it looks like Rik's solution is faster up 48/49 thousand rows, after which the original solution seems better.

import time
import random
import datetime

Rows = []
Rik_Kraan = []
Willacya = []

for i in range(1000,50000,1000):
    
    Rows.append(i)
    
    # Creates Dataframe where number of names is 20% the length of the Dataframe.
    numberList = ["Name_"+str(j) for j in range(1,int(i*.2))]
    df_test = pd.DataFrame({'Date':pd.date_range(start='1/1/2020', freq='S', periods=i),
                 'Name':[random.choice(numberList) for x in range(i)]})
    
    # Rik_Kraan solution using masking
    start = time.time() 
    dates = df_test['Date'].values
    name = df_test['Name'].values
    df_test.assign(Total=np.sum((dates[:, None] <= (dates+pd.Timedelta(10, 'H'))) & (dates[:, None] >= dates) & (name[:, None] == name), axis=0))
    end = time.time()
    Rik_Kraan.append(end-start)

    # Original Solution
    start = time.time()
    for j in df_test.Name.unique():
        df2 = df_test[df_test.Name == j].copy()
        total = df2['Date'].apply(lambda x: len(df2[(df2.Date<=x) & (df2.Date>x - datetime.timedelta(hours = 1))]))
        df_test.loc[total.index,'Total'] = total.values
    end = time.time()
    Willacya.append(end-start)   
    
pd.DataFrame({'Num_Rows':Rows,'Rik_Kraan':Rik_Kraan,'Willacya':Willacya}).set_index('Num_Rows').plot()

Comparison of algorithms

Rik Kraan

We can also use numpy broadcasting. Essentially, for each row we want to count how many rows within a time interval of 10 hours have the same name .

First make numpy arrays of the columns of interest

dates = df['Date'].values
name = df['Name'].values

Second create a mask by comparing the rows with each other. This yields an array of shape number_of_rows * number_of_rows

(dates[:, None] <= (dates+pd.Timedelta(10, 'H'))) & (dates[:, None] >= dates) & (name[:, None] == name)

Finally We can take the sum of each columns, which provide us with the total number of identical names in the next 10 hours, and assign this to a new column.

df.assign(Total=np.sum((dates[:, None] <= (dates+pd.Timedelta(10, 'H'))) & (dates[:, None] >= dates) & (name[:, None] == name), axis=0))

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

Select rows where initial criteria is met whilst other criteria isn't for each row

分類Dev

SparkR. SQL. Count records satisfying criteria within rolling time window using timestamps

分類Dev

In a Pandas dataframe, how to filter a set of rows based on a start row and end row both satisfying different conditions?

分類Dev

Count how many times a rows enter time is within the enter and exit times of all other rows

分類Dev

How to get rows with min values in one column, grouped by other column, while keeping other columns?

分類Dev

How to span columns differently for each row within a nested grid using CSS Grid?

分類Dev

FInd number of rows in given time window in Pyspark

分類Dev

Collapse columns in some rows to the row with least number of columns in R

分類Dev

How do I determine the number of rows and columns in a gnome-terminal window while resizing?

分類Dev

How to copy rows of specific columns based on a criteria (and then continuously update)?

分類Dev

How to find row which has the minimum overall distance from the other rows of the pandas dataframe

分類Dev

for each data frame row, find points lying within a certain range

分類Dev

Transposing rows into a single column while copying data in other columns

分類Dev

How to find number of rows column-wise while parsing an excel sheet

分類Dev

Mapping columns/rows from one dataframe to another based on row number

分類Dev

Pandas - Stack dataframes with different name and number of columns on top of each other

分類Dev

Count number of columns with some values for each row in pandas

分類Dev

How to find if a date is within a given time interval

分類Dev

R find min and max for each group based on other row

分類Dev

How to get the row number of the cell in a for each loop

分類Dev

How to find an element in other rows of a dataframe in R

分類Dev

How to find two strings that are close to each other

分類Dev

Find pairs of teams that played against each other most number of times

分類Dev

How to find position of nonzero elements in each row?

分類Dev

How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte

分類Dev

how to find time difference from row names?

分類Dev

How to reduce some values from a list within each row in a dataframe

分類Dev

INDEX/MATCH with multiple criteria in rows and columns

分類Dev

Find rows in data frame with certain columns are duplicated, then combine the the elements in other columns

Related 関連記事

  1. 1

    Select rows where initial criteria is met whilst other criteria isn't for each row

  2. 2

    SparkR. SQL. Count records satisfying criteria within rolling time window using timestamps

  3. 3

    In a Pandas dataframe, how to filter a set of rows based on a start row and end row both satisfying different conditions?

  4. 4

    Count how many times a rows enter time is within the enter and exit times of all other rows

  5. 5

    How to get rows with min values in one column, grouped by other column, while keeping other columns?

  6. 6

    How to span columns differently for each row within a nested grid using CSS Grid?

  7. 7

    FInd number of rows in given time window in Pyspark

  8. 8

    Collapse columns in some rows to the row with least number of columns in R

  9. 9

    How do I determine the number of rows and columns in a gnome-terminal window while resizing?

  10. 10

    How to copy rows of specific columns based on a criteria (and then continuously update)?

  11. 11

    How to find row which has the minimum overall distance from the other rows of the pandas dataframe

  12. 12

    for each data frame row, find points lying within a certain range

  13. 13

    Transposing rows into a single column while copying data in other columns

  14. 14

    How to find number of rows column-wise while parsing an excel sheet

  15. 15

    Mapping columns/rows from one dataframe to another based on row number

  16. 16

    Pandas - Stack dataframes with different name and number of columns on top of each other

  17. 17

    Count number of columns with some values for each row in pandas

  18. 18

    How to find if a date is within a given time interval

  19. 19

    R find min and max for each group based on other row

  20. 20

    How to get the row number of the cell in a for each loop

  21. 21

    How to find an element in other rows of a dataframe in R

  22. 22

    How to find two strings that are close to each other

  23. 23

    Find pairs of teams that played against each other most number of times

  24. 24

    How to find position of nonzero elements in each row?

  25. 25

    How can find size of each Row in Apache spark sql dataframe and discrad the rows having size more than a threshold size in Kilobyte

  26. 26

    how to find time difference from row names?

  27. 27

    How to reduce some values from a list within each row in a dataframe

  28. 28

    INDEX/MATCH with multiple criteria in rows and columns

  29. 29

    Find rows in data frame with certain columns are duplicated, then combine the the elements in other columns

ホットタグ

アーカイブ