我在数据表中有一个纵向数据集,类似于下面的简化示例:
> head(data)
Country ID Date Value
1: AT AT6306 2012-11-01 16.2
2: AT AT6306 2012-11-02 12.2
3: AT AT6306 2012-11-03 11.3
4: AT AT6306 2012-11-04 14.2
5: AT AT6306 2012-11-05 17.3
6: AT AT6306 2012-11-06 12.5
> tail(data)
Country ID Date Value
1: SE SE0935 2014-06-25 16.2
2: SE SE0935 2014-06-26 12.2
3: SE SE0935 2014-06-27 11.3
4: SE SE0935 2014-06-28 14.2
5: SE SE0935 2014-06-29 17.3
6: SE SE0935 2014-06-30 12.5
ID
是面板变量,它是唯一的,在国家之间没有重叠。仅查看唯一值的日期范围为2012-10-23
到2014-09-30
。显然,每个的范围Date
都不相同ID
。此外,可能存在缺失值。为了拥有一个平衡的面板,我想填补我的数据集的空白。
根据@akron的建议,在此处改编答案,我将执行以下操作:
data2 <- data[, CJ(ID=unique(ID), Date=unique(Date))]
setkey(data2, ID, Date)
data.new <- merge(data, data2, by=c("ID", "Date"), all.y = TRUE)
setkey(data.new, ID, Date)
all.y = TRUE
因此,使用该选项,R会为中的每个缺失日期添加行data
。但是,如果ID
和Date
中的行之前不存在,则除和以外的所有其他字段均为空白data
。也就是说,我的数据看起来像这样
> head(data.new)
Country ID Date Value
1: NA AT6306 2012-10-23 NA
2: NA AT6306 2012-10-24 NA
3: NA AT6306 2012-10-25 NA
4: NA AT6306 2012-10-26 NA
5: NA AT6306 2012-10-27 NA
6: NA AT6306 2012-10-28 NA
我确实希望该名称Value
为NA,因为它已丢失。但是,由于Country
对于给定来说不会改变ID
,因此我希望填写该字段。
library(data.table)
DT <- data.table(dat)
setkey(DT, Date, Country, ID)
res <- DT[CJ(seq(min(Date), max(Date), by='1 day'),
unique(Country), unique(ID))]
head(res)
# Country ID Date Value
#1: AT 935 2012-11-01 NA
#2: AT 6306 2012-11-01 16.2
#3: SE 935 2012-11-01 NA
#4: SE 6306 2012-11-01 NA
#5: AT 935 2012-11-02 NA
#6: AT 6306 2012-11-02 12.2
您可以做的一种选择是
DT <- data.table(dat)
DT[,CountryID:= paste(Country,ID)]
setkey(DT, Date, CountryID)
DT1 <- DT[CJ(unique(Date), unique(CountryID))][,
c('Country', 'ID'):= list(gsub("[ 0-9]", "", CountryID),
gsub("[^ 0-9]", "", CountryID)),][,-5L]
head(DT1,3)
# Country ID Date Value
#1: AT 6306 2012-11-01 16.2
#2: SE 935 2012-11-01 NA
#3: AT 6306 2012-11-02 12.2
nrow(DT1)
#[1] 24
dat <- structure(list(Country = c("AT", "AT", "AT", "AT", "AT", "AT",
"SE", "SE", "SE", "SE", "SE", "SE"), ID = c(6306L, 6306L, 6306L,
6306L, 6306L, 6306L, 935L, 935L, 935L, 935L, 935L, 935L), Date = structure(c(15645,
15646, 15647, 15648, 15649, 15650, 15669, 15670, 15671, 15672,
15673, 15674), class = "Date"), Value = c(16.2, 12.2, 11.3, 14.2,
17.3, 12.5, 16.2, 12.2, 11.3, 14.2, 17.3, 12.5)), .Names = c("Country",
"ID", "Date", "Value"), row.names = c("1:", "2:", "3:", "4:",
"5:", "6:", "1:1", "2:1", "3:1", "4:1", "5:1", "6:1"), class = "data.frame")
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句