我想从开始日期和结束日期重叠的数据框中删除产品,以避免在后续步骤中重复。
示例数据:
library(dplyr)
d <-
bind_rows(
data.frame(product = 1,
start_date = as.Date("2016-01-01"),
end_date = as.Date("2016-01-10"),
stringsAsFactors = FALSE),
data.frame(product = 1,
start_date = as.Date("2016-01-02"),
end_date = as.Date("2016-01-04"),
stringsAsFactors = FALSE),
data.frame(product = 1,
start_date = as.Date("2016-01-05"),
end_date = as.Date("2016-06-09"),
stringsAsFactors = FALSE),
data.frame(product = 2,
start_date = as.Date("2016-01-03"),
end_date = as.Date("2016-01-07"),
stringsAsFactors = FALSE)
)
product start_date end_date
1 1 2016-01-01 2016-01-10
2 1 2016-01-02 2016-01-04
3 1 2016-01-05 2016-06-09
4 2 2016-01-03 2016-01-07
由于重叠,我想从此示例中删除第2行和第3行。
我使用了lag函数来移除彼此相邻的重叠:
d_cleaned <-
d %>%
arrange(product, start_date, end_date) %>%
mutate(overlapping = product == lag(product) & start_date <= lag(end_date) & end_date >= lag(start_date)) %>% # define overlaps
mutate(overlapping = ifelse(is.na(overlapping), FALSE, overlapping)) %>% # dont delete the first row
filter(overlapping == FALSE) %>% # remove overlaps
select(-overlapping)
product start_date end_date
1 1 2016-01-01 2016-01-10
2 1 2016-01-05 2016-06-09
3 2 2016-01-03 2016-01-07
从上面可以看出,该步骤将除去连续行上的重叠,但不是全部。
我可以通过循环来解决此问题,但我希望有人可以提出非循环解决方案,因为数据帧很大,而且每个步骤都需要一段时间。
使用non-equi
来自当前data.table
v1.9.7开发版本的联接:
require(data.table) # v1.9.7+
setDT(d) # convert 'd' to a data.table by reference
idx = d[d, on=.(product, end_date>=start_date, start_date<=end_date), mult="first", which=TRUE]
d[idx == seq_len(.N)] # .N contains the number of rows = nrow(d)
# product start_date end_date
# 1: 1 2016-01-01 2016-01-10
# 2: 1 2016-06-10 2016-06-12
# 3: 2 2016-01-03 2016-01-07
对于每一行d
(方括号内的一个),我们发现任何与样重叠的d
(在外面),即,自连接,基于提供给所述的条件on
参数,我们提取索引的的第一重叠(因为which=TRUE
和mult="first"
)。
当且仅当第一个重叠与自身重叠时,我们才将它们返回。我们丢弃所有其他间隔。
要安装devel版本,请参阅此处的安装说明。
这是更多行的基准(数据绝对不是很大):
set.seed(1L)
require(data.table) # v1.9.7+
dates = as.Date(sample(16000:17000, 1e5, TRUE), origin="1970-01-01")
dt = data.table(product=sample(100, 1e5, TRUE),
start_date = sample(dates, 1e5, TRUE),
end_date = sample(dates, 1e5, TRUE))
dt[, `:=`(start_date = pmin(start_date, end_date),
end_date = pmax(start_date, end_date))]
system.time({
idx = dt[dt, on=.(product, end_date>=start_date, start_date<=end_date), mult="first", which=TRUE, verbose=TRUE]
ans = dt[idx == seq_len(.N)] # .N contains the number of rows = nrow(d)
})
# Non-equi join operators detected ...
# forder took ... 0.01 secs
# Generating group lengths ... done in 0 secs
# Generating non-equi group ids ... done in 0.041 secs
# Recomputing forder with non-equi ids ... done in 0.005 secs
# Found 178 non-equi group(s) ...
# Starting bmerge ...done in 2.359 secs
# user system elapsed
# 2.402 0.011 2.421
head(ans)
# product start_date end_date
# 1: 71 2015-12-04 2016-03-22
# 2: 71 2014-04-12 2015-05-01
# 3: 32 2013-11-23 2015-03-18
# 4: 56 2014-07-29 2015-12-26
# 5: 88 2015-03-08 2015-03-21
# 6: 69 2014-10-31 2015-07-05
nrow(ans)
# [1] 186
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句