我有2个数据集-第一个具有每日天气信息,包括平均温度和发热量天数。我每天都有一套完整的天气数据。一小段天气数据如下:
weather:
weather.station | date | temp | HDD
A | 11/30/2013 | 30 | 35
A | 12/01/2013 | 28 | 37
第二个数据集包含多个房屋的数据,每个房屋都有各自的日期范围,该数据显示每天的油耗。例如:
home.bills:
home.id | start.date | end.date | electric.usage | weather.station
1 | 11/15/2013 | 12/14/2013 | 80 | A
1 | 12/15/2013 | 1/14/2014 | 85 | A
2 | 11/18/2013 | 12/15/2013 | 60 | A
2 | 12/16/2013 | 1/13/2014 | 57 | A
我正在寻找一种有效的方式来批量组合两个数据集,因此我可以将home.bills信息与天气条件进行扩展(例如,日期范围内的平均温度和日期范围内的HDD之和)。
输出可能如下所示:
output:
home.id | start.date | end.date | electric.usage | mean.temp | sum.HDD
1 | 11/15/2013 | 12/14/2013 | 80 | 32.8 | 937
1 | 12/15/2013 | 1/14/2014 | 85 | 29.7 | 1122
2 | 11/18/2013 | 12/15/2013 | 60 | 31.7 | 944
2 | 12/16/2013 | 1/13/2014 | 57 | 28.8 | 1201
任何人都可以建议使用一种更简单的方法来连接这两个数据集吗?我知道该怎么做,不知道如何在一定的日期范围内将这两个数据结合起来。
这有两种方法可以做到这一点。第一种使用sqldf
但要求(?)重命名列(因为SQL不喜欢带有“。”的列名)。从概念上讲,它更简单。
weather <- structure(list(weather.station = c("A", "A"), date = c("11/30/2013 ", "12/01/2013 "), temp = c(30, 28), HDD = c(35L, 37L)), .Names = c("weather.station", "date", "temp", "HDD"), class = "data.frame", row.names = c(NA, -2L))
home.bills <- structure(list(home.id = c(1, 1, 2, 2), start.date = c(" 11/15/2013 ", " 12/15/2013 ", " 11/18/2013 ", " 12/16/2013 "), end.date = c(" 12/14/2013 ", " 1/14/2014 ", " 12/15/2013 ", " 1/13/2014 "), electric.usage = c(80, 85, 60, 57), weather.station = c("A", "A", "A", "A")), .Names = c("home.id", "start.date", "end.date", "electric.usage", "weather.station"), class = "data.frame", row.names = c(NA, -4L))
# dates need to be Dates, not character
weather$date <- as.Date(weather$date,format="%m/%d/%Y")
home.bills$start.date <- as.Date(home.bills$start.date,format="%m/%d/%Y")
home.bills$end.date <- as.Date(home.bills$end.date,format="%m/%d/%Y")
# sqldf does not like "." in column names!!!
colnames(weather) <- gsub(".","_",colnames(weather),fixed=T)
colnames(home.bills) <- gsub(".","_",colnames(home.bills),fixed=T)
library(sqldf)
sqldf("select a.*, avg(temp) as mean_temp, sum(HDD) as sum_HDD
from [home.bills] a join weather b
on b.date>=a.start_date and b.date<=a.end_date
and a.weather_station=b.weather_station
group by home_id, start_date, end_date")
# home_id start_date end_date electric_usage weather_station mean_temp sum_HDD
# 1 1 2013-11-15 2013-12-14 80 A 29 72
# 2 2 2013-11-18 2013-12-15 60 A 29 72
第二种使用软件包中的foverlaps(...)
功能data.table
。这是非常强大且非常快的。请注意,此解决方案几乎与@akrun几个小时前发布的解决方案相同,然后将其删除(我很想知道为什么?)。
# data.table solution
weather <- structure(list(weather.station = c("A", "A"), date = c("11/30/2013 ", "12/01/2013 "), temp = c(30, 28), HDD = c(35L, 37L)), .Names = c("weather.station", "date", "temp", "HDD"), class = "data.frame", row.names = c(NA, -2L))
home.bills <- structure(list(home.id = c(1, 1, 2, 2), start.date = c(" 11/15/2013 ", " 12/15/2013 ", " 11/18/2013 ", " 12/16/2013 "), end.date = c(" 12/14/2013 ", " 1/14/2014 ", " 12/15/2013 ", " 1/13/2014 "), electric.usage = c(80, 85, 60, 57), weather.station = c("A", "A", "A", "A")), .Names = c("home.id", "start.date", "end.date", "electric.usage", "weather.station"), class = "data.frame", row.names = c(NA, -4L))
library(data.table) >= 1.9.4
# convert to data.tables and convert date to Date
setDT(weather)[,date:=as.Date(date,format="%m/%d/%Y")]
setDT(home.bills)[,(2:3):=lapply(.SD,as.Date,format="%m/%d/%Y"),.SDcols=2:3]
# need start.date and end.date in weather data.table (both = date)
weather[,c("start.date","end.date"):=list(date,date)]
setkey(home.bills,weather.station,start.date,end.date)
# calaculate overlaps
result <- foverlaps(weather,home.bills,nomatch=0)
# aggregate
result[,list(mean.temp=mean(temp),sum.HDD=sum(HDD)),
by=list(home.id,start.date,end.date,electric.usage,weather.station)]
# home.id start.date end.date electric.usage weather.station mean.temp sum.HDD
# 1: 1 2013-11-15 2013-12-14 80 A 29 72
# 2: 2 2013-11-18 2013-12-15 60 A 29 72
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句