How do I optimize sapply in R to calculate running totals on a dataframe

Jonathan

I wrote a function in R to calculate cumulative totals by month number, but the execution time of my method grows exponentially as the dataset gets larger. I'm a novice R programmer, can you help me make this more efficient?
The function and the way I invoke the function:

accumulate <- function(recordnum,df){
    sumthese <- (df$subject == df$subject[recordnum]) &
        (df$month <= df$month[recordnum])
    sum(df$measurement[sumthese])
}
set.seed(42)
datalength = 10
df <- data.frame(measurement = runif(1:datalength),
                 subject=rep(c("dog","cat"),each =datalength/2),
                 month=rep(seq(datalength/2,1,by=-1)))
system.time(df$cumulative <- sapply(1:datalength,accumulate,df))

The input dataframe:

> df
   measurement subject month
1    0.4577418     dog     5
2    0.7191123     dog     4
3    0.9346722     dog     3
4    0.2554288     dog     2
5    0.4622928     dog     1
6    0.9400145     cat     5
7    0.9782264     cat     4
8    0.1174874     cat     3
9    0.4749971     cat     2
10   0.5603327     cat     1

The output dataframe:

> df
   measurement subject month cumulative
1    0.9148060     dog     5  3.6102141
2    0.9370754     dog     4  2.6954081
3    0.2861395     dog     3  1.7583327
4    0.8304476     dog     2  1.4721931
5    0.6417455     dog     1  0.6417455
6    0.5190959     cat     5  2.7524079
7    0.7365883     cat     4  2.2333120
8    0.1346666     cat     3  1.4967237
9    0.6569923     cat     2  1.3620571
10   0.7050648     cat     1  0.7050648

Notice the cumulative column shows the accumulation of all measurements up to and including the current month. The function does not require the dataframe to be sorted. When the datalength equals 100, the elapsed time is 0.3. 1000 is 0.58. 10,000 = 27.72. I need this to run for 200K+ records.
Thanks!

G. Grothendieck

This is non-destructive, i.e. the original df is not modified. No packages are used. The original order of the rows of df is preserved; however, if that is not important then [order(o), ] on the last line can be omitted.

o <- order(df$subject, df$month)
transform(df[o, ], cumulative = ave(measurement, subject, FUN = cumsum))[order(o), ]

giving:

   measurement subject month cumulative
1   0.37955924     dog     5  2.2580530
2   0.43577158     dog     4  1.8784938
3   0.03743103     dog     3  1.4427222
4   0.97353991     dog     2  1.4052912
5   0.43175125     dog     1  0.4317512
6   0.95757660     cat     5  4.0751151
7   0.88775491     cat     4  3.1175385
8   0.63997877     cat     3  2.2297836
9   0.97096661     cat     2  1.5898048
10  0.61883821     cat     1  0.6188382

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

How do I calculate partial totals in crossfilter?

From Dev

How do i do running totals from second column

From Dev

Function to calculate multiple running totals

From Dev

How do I optimize* this?

From Dev

How Do I Calculate A Running Total That Factors Both Date and Time?

From Dev

R: How to make sapply calculate mean over rows?

From Dev

How do I optimize TeamViewer?

From Dev

How do I calculate total number of unique values in a dataframe?

From Dev

How do I calculate an average of a range from a series within in a dataframe?

From Dev

How do I calculate total number of unique values in a dataframe?

From Dev

How to efficiently calculate running maxima in a Pandas dataframe?

From Dev

How do I add a totals / averages row to a crosstab row group?

From Dev

How do I transform my maths totals into a time duration?

From Dev

How do I add variable pricing and totals to my booking form?

From Dev

elasticsearch: how do I groupby a field and avg up totals?

From Dev

How do I update data frame variables with sapply results?

From Dev

How do I use arguments of a function when using sapply?

From Dev

How do I add rows to a data frame using sapply?

From Dev

How do I calculate the gradient of a matrix to draw a vector field in R?

From Dev

How do I calculate R-squared value in JavaScript?

From Dev

How do I calculate mean on filtered rows of a pandas dataframe and append means to all columns of original dataframe?

From Dev

How do I optimize the SQL for this Django view?

From Dev

How do I optimize my query in MySQL?

From Dev

How do I optimize a select query in Activerecord?

From Dev

How do I optimize this SharePoint query for Terms?

From Dev

How do I best optimize Picasso in a GridView?

From Dev

How do I optimize this query with whereHas?

From Dev

How do I optimize my query in MySQL?

From Dev

How do I optimize this simple matlab code?

Related Related

  1. 1

    How do I calculate partial totals in crossfilter?

  2. 2

    How do i do running totals from second column

  3. 3

    Function to calculate multiple running totals

  4. 4

    How do I optimize* this?

  5. 5

    How Do I Calculate A Running Total That Factors Both Date and Time?

  6. 6

    R: How to make sapply calculate mean over rows?

  7. 7

    How do I optimize TeamViewer?

  8. 8

    How do I calculate total number of unique values in a dataframe?

  9. 9

    How do I calculate an average of a range from a series within in a dataframe?

  10. 10

    How do I calculate total number of unique values in a dataframe?

  11. 11

    How to efficiently calculate running maxima in a Pandas dataframe?

  12. 12

    How do I add a totals / averages row to a crosstab row group?

  13. 13

    How do I transform my maths totals into a time duration?

  14. 14

    How do I add variable pricing and totals to my booking form?

  15. 15

    elasticsearch: how do I groupby a field and avg up totals?

  16. 16

    How do I update data frame variables with sapply results?

  17. 17

    How do I use arguments of a function when using sapply?

  18. 18

    How do I add rows to a data frame using sapply?

  19. 19

    How do I calculate the gradient of a matrix to draw a vector field in R?

  20. 20

    How do I calculate R-squared value in JavaScript?

  21. 21

    How do I calculate mean on filtered rows of a pandas dataframe and append means to all columns of original dataframe?

  22. 22

    How do I optimize the SQL for this Django view?

  23. 23

    How do I optimize my query in MySQL?

  24. 24

    How do I optimize a select query in Activerecord?

  25. 25

    How do I optimize this SharePoint query for Terms?

  26. 26

    How do I best optimize Picasso in a GridView?

  27. 27

    How do I optimize this query with whereHas?

  28. 28

    How do I optimize my query in MySQL?

  29. 29

    How do I optimize this simple matlab code?

HotTag

Archive