我有两个向量。对于向量A的每个元素,我想知道向量B满足特定条件的所有元素。因此,例如,两个包含向量的数据帧:
person <- data.frame(name = c("Albert", "Becca", "Celine", "Dagwood"),
tickets = c(20, 24, 16, 17))
prize <- data.frame(type = c("potato", "lollipop", "yo-yo", "stickyhand",
"moodring", "figurine", "whistle", "saxophone"),
cost = c(6, 11, 13, 17, 21, 23, 25, 30))
对于此示例,“人”数据框中的每个人都具有来自狂欢节游戏的许多票,而“奖品”数据框中的每个奖品都具有成本。但是我不是在寻找完美的搭配。他们不只是简单地购买奖品,而是随机获得在其所持票券的5英镑成本容限范围内的任何奖品。
我正在寻找的输出是每个人可能赢得的所有可能奖品的数据框。就像这样:
person prize
1 Albert stickyhand
2 Albert moodring
3 Albert figurine
4 Albert whistle
5 Becca moodring
6 Becca figurine
... ...
等等。现在,我正在使用进行此操作lapply()
,但这实际上并不比for()
R中的循环快。
library(dplyr)
matching_Function <- function(person, prize, tolerance = 5){
matchlist <- lapply(split(person, list(person$name)),
function(x) filter(prize, abs(x$tickets-cost)<=tolerance)$type)
longlist <- data.frame("person" = rep(names(matchlist),
times = unlist(lapply(matchlist, length))),
"prize" = unname(unlist(matchlist))
)
return(longlist)
}
matching_Function(person, prize)
我的实际数据集大得多(在几十万),和我的匹配条件比较复杂(检查从坐标乙,看看他们是否从坐标的设定范围内一个),所以这是考虑永远(几个小时) 。
是否有更聪明的方法不是for()
和lapply()
解决呢?
用另一种foverlaps
从data.table
做你希望的东西:
require(data.table)
# Turn the datasets into data.table
setDT(person)
setDT(prize)
# Add the min and max from tolerance
person[,`:=`(start=tickets-tolerance,end=tickets+tolerance)]
# add a dummy column for use as range
prize[,dummy:=cost]
# Key the person table on start and end
setkey(person,start,end)
# As foverlaps to get the corresponding rows from prize into person, filter the NA results and return only the name and type of prize
r<-foverlaps(prize,person,type="within",by.x=c("cost","dummy"))[!is.na(name),list(name=name,prize=type)]
# Re order the result by name instead of prize cost
setorder(r,name)
输出:
name prize
1: Albert stickyhand
2: Albert moodring
3: Albert figurine
4: Albert whistle
5: Becca moodring
6: Becca figurine
7: Becca whistle
8: Celine lollipop
9: Celine yo-yo
10: Celine stickyhand
11: Celine moodring
12: Dagwood yo-yo
13: Dagwood stickyhand
14: Dagwood moodring
我希望我对代码的注释足以说明自己。
对于问题的第二部分,使用坐标并在半径范围内进行测试。
person <- structure(list(name = c("Albert", "Becca", "Celine", "Dagwood"),
x = c(26, 16, 32, 51),
y = c(92, 51, 25, 4)),
.Names = c("name", "x", "y"), row.names = c(NA, -4L), class = "data.frame")
antenas <- structure(list(name = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L"),
x = c(40, 25, 38, 17, 58, 19, 34, 38, 67, 26, 46, 17),
y = c(36, 72, 48, 6, 78, 41, 18, 28, 54, 8, 28, 47)),
.Names = c("name", "x", "y"), row.names = c(NA, -12L), class = "data.frame")
setDT(person)
setDT(antenas)
r<-10
results <- person[,{dx=x-antenas$x;dy=y-antenas$y; list(antena=antenas$name[dx^2+dy^2<=r^2])},by=name]
Data.table允许在中进行表达式j
,因此我们可以针对每个人对天线进行外部联接的数学运算,并仅返回具有天线名称的相关行。
这不应该消耗太多内存,因为它是针对个人的每一行而不是整个行完成的。
受此问题启发的数学
这给:
> results
name antena
1: Becca L
2: Celine G
3: Celine H
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句