基本上,我有一个函数需要运行接近1M次,并且要花很多时间,因为它没有向量化(我的猜测)
这个想法是,有一个pf.d.day
包含日期的参数,输出将是该日期的转换(添加/删除天数)
pf.s.Freq
将日期偏移到下一个期间。
2020年4月8日,频率=“月”将变为2020年5月1日
2020年4月8日,Freq =“ week”将变为2020年4月13日,#week从星期一开始
2020年4月8日,频率=“年”将变为2021年1月1日
library(dplyr)
library(lubridate)
fn.Delay <- function(pf.d.day, pf.s.Freq){
d.DateWithouthDelay <- as.Date(
#note: using chained ifs instead of parsing pf.s.Freq into unit to avoid errors from misspells on excel file
ifelse(pf.s.Freq == "day", as.character(ceiling_date(pf.d.day + days(1), unit = "day" )),
ifelse(pf.s.Freq == "week", as.character(ceiling_date(pf.d.day + days(1), unit = "week", week_start = 1)),
ifelse(pf.s.Freq == "month", as.character(ceiling_date(pf.d.day + days(1), unit = "month" )),
ifelse(pf.s.Freq == "quarter", as.character(ceiling_date(pf.d.day + days(1), unit = "quarter")),
ifelse(pf.s.Freq == "year", as.character(ceiling_date(pf.d.day + days(1), unit = "year" )),
ifelse(pf.s.Freq != "BiWeek", "1900-1-2", #default date if pf.s.Freq is wrong
ifelse( day(pf.d.day) < 15,
as.character(pf.d.day - day(pf.d.day) +15),
as.character(ceiling_date(pf.d.day, unit = "month")))
)))))))
return(d.DateWithouthDelay)
}
举个小例子:
data.frame(
Di = as.Date(c("2020-4-8", "2020-4-8", "2020-4-8", "2020-4-8", "2020-4-8", "2020-4-8", "2020-4-8")),
Fr = c("day", "week", "month", "quarter", "year", "BiWeek", "ups")) %>%
rowwise() %>%
mutate(Df = fn.Delay(Di, Fr)) %>%
data.frame()
此代码的主要问题是速度。主要是因为它不是矢量化的,但也可能是因为我不得不不断地在日期和字符之间切换,只是因为ifelse
喜欢弄乱日期
您的功能已向量化。删除rowwise
可以提高速度并获得相同的结果:
identical(
dd %>% mutate(Df = fn.Delay(Di, Fr)) %>% pull(Df),
dd %>%rowwise() %>% mutate(Df = fn.Delay(Di, Fr)) %>% pull(Df)
)
# TRUE
ifelse
其实不是那么糟糕。这是使用的简化版本case_when
,但性能差异vsifelse
可以忽略不计-实际上要慢一点。但是代码更干净。
fn.Delay2 <- function(pf.d.day, pf.s.Freq){
case_when(
pf.s.Freq == "day" ~ ceiling_date(pf.d.day + days(1), unit = "day"),
pf.s.Freq == "week" ~ ceiling_date(pf.d.day + days(1), unit = "week", week_start = 1),
pf.s.Freq == "month" ~ ceiling_date(pf.d.day + days(1), unit = "month" ),
pf.s.Freq == "quarter" ~ ceiling_date(pf.d.day + days(1), unit = "quarter"),
pf.s.Freq == "year" ~ ceiling_date(pf.d.day + days(1), unit = "year" ),
pf.s.Freq != "BiWeek" ~ as.Date("1900-1-2"), #default date if pf.s.Freq is wrong
day(pf.d.day) < 15 ~ pf.d.day - day(pf.d.day) + 15,
TRUE ~ ceiling_date(pf.d.day, unit = "month")
)
}
microbenchmark::microbenchmark(
rowwise = dd %>%rowwise() %>% mutate(Df = fn.Delay(Di, Fr)),
vectorized = dd %>% mutate(Df = fn.Delay(Di, Fr)),
case_when = dd %>% mutate(Df = fn.Delay2(Di, Fr))
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# rowwise 10.0593 12.47230 13.59725 13.00590 14.1138 30.3810 100
# vectorized 7.5237 7.97235 10.21504 10.26205 10.7905 25.7858 100
# case_when 7.7331 8.43595 10.42024 10.54705 11.1035 21.4732 100
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句