我正在获取一系列中位数,并检查它们是否在多个范围之间,然后存储确实匹配的中位数以及与之关联的标签。这段代码有效,但是对于这种迭代方法,文件太大了。有没有更快的方法进行这些比较并将匹配项记录在数据框中?
tfFile的结构为:
V1 V2 V3 V4 Center_Point
1 chr3 158289024 158289224 CMYC 158289124
2 chr1 242601432 242601632 KLF4 242601532
3 chr11 85912879 85913079 CMYC 85912979
4 chr14 86369800 86370000 SOX2 86369900
5 chr6 8397251 8397451 SOX2 8397351
6 chr3 123709437 123709637 SOX2 123709537
范围的结构是:
V1 V2 V3
1 chr1 11323785 11617177
2 chr1 12645605 13926923
3 chr1 14750216 15119039
4 chr1 18102157 19080189
5 chr1 29491029 30934636
6 chr1 33716472 35395979
看一下代码:
tfFile = read.table("medianfile.txt", sep= "", stringsAsFactors=FALSE)
ranges = read.table("ranges.txt", sep= "", stringsAsFactors=FALSE)
centerdf <- data.frame('Center_Point' = numeric(0))
Center_Point<-apply(tfFile[c('V2', 'V3')], 1, median, na.rm=TRUE)
data<-cbind(tfFile,Center_Point)
tempdf = data.frame( 'Center_Point' = numeric(0), "TF" = character(0),stringsAsFactors = FALSE)
generatedata<-function(data, lamina){
matchesdf <- data.frame( 'Center_Point' = numeric(0), "TF" = character(0), stringsAsFactors = FALSE)
#Making the comparisons
for(j in 1:nrow(data)){
for(k in 1:nrow(ranges)){
#if the value falls within the LADs
if(data$Center_Point[j]< ranges$V3[k] && data$Center_Point[j]>ranges$V2[k]){
tempdf<-data.frame(Center_Point = data$Center_Point[j], TF = data$V4[j])
matchesdf <- rbind(matchesdf, tempdf)
}
}
}
return(matchesdf)
}
a<-generatedata(data, ranges)
请查看我的评论-我不确定您要做什么,但似乎具有联接的味道data.table
。我已将您的表复制为data.table
s,以便:
> d1
chr low high sthg mid
1: chr1 242601432 242601632 KLF4 242601532
2: chr11 85912879 85913079 CMYC 85912979
3: chr14 86369800 86370000 SOX2 86369900
4: chr3 158289024 158289224 CMYC 158289124
5: chr3 123709437 123709637 SOX2 123709537
6: chr6 8397251 8397451 SOX2 8397351
> d2
chr range.low range.high
1: chr1 11323785 11617177
2: chr1 12645605 13926923
3: chr1 14750216 15119039
4: chr1 18102157 19080189
5: chr1 29491029 30934636
6: chr1 33716472 35395979
而且我也做了
setkey(d1,chr)
setkey(d2,chr)
现在,我可以将它们加入chr
列中,因此在chr
匹配时,您将看到每个范围:
> d2[d1]
chr range.low range.high low high sthg mid
1: chr1 11323785 11617177 242601432 242601632 KLF4 242601532
2: chr1 12645605 13926923 242601432 242601632 KLF4 242601532
3: chr1 14750216 15119039 242601432 242601632 KLF4 242601532
4: chr1 18102157 19080189 242601432 242601632 KLF4 242601532
5: chr1 29491029 30934636 242601432 242601632 KLF4 242601532
6: chr1 33716472 35395979 242601432 242601632 KLF4 242601532
7: chr11 NA NA 85912879 85913079 CMYC 85912979
8: chr14 NA NA 86369800 86370000 SOX2 86369900
9: chr3 NA NA 158289024 158289224 CMYC 158289124
10: chr3 NA NA 123709437 123709637 SOX2 123709537
11: chr6 NA NA 8397251 8397451 SOX2 8397351
现在,您可以使用一个简单的data.table
操作进行一次穿越,并确定中点在该范围内的位置:
d <- d2[d1]
d[!is.na(range.low+range.high),
falls.in.range:=(range.low <= mid & mid <= range.high)]
d
chr range.low range.high low high sthg mid falls.in.range
1: chr1 11323785 11617177 242601432 242601632 KLF4 242601532 FALSE
2: chr1 12645605 13926923 242601432 242601632 KLF4 242601532 FALSE
3: chr1 14750216 15119039 242601432 242601632 KLF4 242601532 FALSE
4: chr1 18102157 19080189 242601432 242601632 KLF4 242601532 FALSE
5: chr1 29491029 30934636 242601432 242601632 KLF4 242601532 FALSE
6: chr1 33716472 35395979 242601432 242601632 KLF4 242601532 FALSE
7: chr11 NA NA 85912879 85913079 CMYC 85912979 NA
8: chr14 NA NA 86369800 86370000 SOX2 86369900 NA
9: chr3 NA NA 158289024 158289224 CMYC 158289124 NA
10: chr3 NA NA 123709437 123709637 SOX2 123709537 NA
11: chr6 NA NA 8397251 8397451 SOX2 8397351 NA
这不是一个很好的例子,因为chr1
似乎没有一个案例符合条件,但是希望这可以理解这一点。
需要注意的关键是,data.table
连接速度非常快,因此,如果正确选择连接列,即使在大型表上,您也应该能够利用快速连接,然后单次通过该大型表。您可能需要根据特定问题考虑交叉连接。(另请参见:,?CJ
并可能allow.cartesian
在中?data.table
。)
编辑是否真的是想让您知道每个范围的每个中点是否都在该范围内,那么您是在交叉联接区域中。请注意,这意味着您本质上认为“ chr1”样式和“ KLF4”样式列与该问题无关。在这种情况下,我可能会执行以下操作:
d1[,observation.ID:=.I]
setkey(d1,observation.ID)
d2[,range.ID:=.I]
setkey(d2,range.ID)
d <- CJ(observation.ID=d1[,observation.ID],range.ID=d2[,range.ID])
setkey(d,observation.ID)
d[d1,mid:=i.mid]
setkey(d,range.ID)
d[d2,c("range.low","range.high"):=.(i.range.low,i.range.high)]
d[,falls.in.range:=range.low <= mid & mid <= range.high]
> d
observation.ID range.ID mid range.low range.high falls.in.range
1: 1 1 242601532 11323785 11617177 FALSE
2: 2 1 85912979 11323785 11617177 FALSE
3: 3 1 86369900 11323785 11617177 FALSE
4: 4 1 158289124 11323785 11617177 FALSE
5: 5 1 123709537 11323785 11617177 FALSE
6: 6 1 8397351 11323785 11617177 FALSE
7: 1 2 242601532 12645605 13926923 FALSE
8: 2 2 85912979 12645605 13926923 FALSE
9: 3 2 86369900 12645605 13926923 FALSE
10: 4 2 158289124 12645605 13926923 FALSE
11: 5 2 123709537 12645605 13926923 FALSE
12: 6 2 8397351 12645605 13926923 FALSE
13: 1 3 242601532 14750216 15119039 FALSE
14: 2 3 85912979 14750216 15119039 FALSE
15: 3 3 86369900 14750216 15119039 FALSE
16: 4 3 158289124 14750216 15119039 FALSE
17: 5 3 123709537 14750216 15119039 FALSE
18: 6 3 8397351 14750216 15119039 FALSE
19: 1 4 242601532 18102157 19080189 FALSE
20: 2 4 85912979 18102157 19080189 FALSE
21: 3 4 86369900 18102157 19080189 FALSE
22: 4 4 158289124 18102157 19080189 FALSE
23: 5 4 123709537 18102157 19080189 FALSE
24: 6 4 8397351 18102157 19080189 FALSE
25: 1 5 242601532 29491029 30934636 FALSE
26: 2 5 85912979 29491029 30934636 FALSE
27: 3 5 86369900 29491029 30934636 FALSE
28: 4 5 158289124 29491029 30934636 FALSE
29: 5 5 123709537 29491029 30934636 FALSE
30: 6 5 8397351 29491029 30934636 FALSE
31: 1 6 242601532 33716472 35395979 FALSE
32: 2 6 85912979 33716472 35395979 FALSE
33: 3 6 86369900 33716472 35395979 FALSE
34: 4 6 158289124 33716472 35395979 FALSE
35: 5 6 123709537 33716472 35395979 FALSE
36: 6 6 8397351 33716472 35395979 FALSE
(您可以在事后加入其他详细信息列,例如setkey(d,observation.ID);setkey(d1,observation.ID);d[d1,sthg:=i.sthg]
,以我命名的方式获取“ KLF4”列。)但是请注意,这可能不会节省大量时间;如果您要对所有范围的所有中点进行全面检查,则提速只能在向量化更好的data.table
表达式中进行,而在嵌套的for循环中要进行。所以我不确定这对您的大桌子是否会更好。也许尝试一下并回报?
更新拼写错误:请参见以下示例,以比较&&
(在这种情况下不正确)和&
(在这种情况下正确)。&&
如您所指出的,仅求向量的第一个元素,而对向量&
进行比较并返回一个向量。因此&&
,当您打算逐行比较时,的输出将被回收,从而产生错误的结果:
> d1[,using.double.and:=low < mid && mid==242601532]
> d1[,using.single.and:=low < mid & mid==242601532]
> d1
chr low high sthg mid observation.ID using.double.and using.single.and
1: chr1 242601432 242601632 KLF4 242601532 1 TRUE TRUE
2: chr11 85912879 85913079 CMYC 85912979 2 TRUE FALSE
3: chr14 86369800 86370000 SOX2 86369900 3 TRUE FALSE
4: chr3 158289024 158289224 CMYC 158289124 4 TRUE FALSE
5: chr3 123709437 123709637 SOX2 123709537 5 TRUE FALSE
6: chr6 8397251 8397451 SOX2 8397351 6 TRUE FALSE
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句