如何加速或并行化此 R 代码？

debugcn 发表于 Dev

雅克斯科

这段代码运行良好，但速度有点慢。我注意到它只在处理器的一个核心上运行。如果它使用多个核心，它可能会快一点。

### proximity filter
options("scipen"=100)
library(geosphere)

# split up data into regions
splitdt<-split(geocities, geocities$airport_code)

## reduce cities
dat=geocities[FALSE,][]
currentregion=1

while (currentregion <= NROW(splitdt)){
    workingregion <- as.data.frame(splitdt[[currentregion]]) ## set region
    workingregion$remove = FALSE
    setDT(workingregion)
    #plot(workingregion$longitude,workingregion$latitude)
    currentorigin=1

    while (currentorigin <= NROW(workingregion)) {
        # choose which row to use
        # as the first part of the distance formula
        workingorigin <- workingregion[,c("longitude","latitude")] %>% slice(currentorigin) ## set LeadingRow city
        setDT(workingorigin)

        # calculate the distance from the specific row chosen
        # and only keep ones which are further than 20km
        workingregion<-workingregion %>% rowwise() %>% mutate(remove =
        ifelse(distHaversine(c(longitude, latitude), workingorigin) != 0 &  # keep workingorigin city
        distHaversine(c(longitude, latitude), workingorigin) < 17000,TRUE,workingregion$remove))

        # remove matched cities
        workingregion <- workingregion[workingregion$remove!=TRUE,]

        currentorigin = currentorigin+1
    }
    currentregion = currentregion+1
    # save results
    workingregion <- workingregion[workingregion$remove!=TRUE,]
    dat <- rbind(dat, workingregion) #, fill=TRUE
}

兹科尔曼

我注意到的第一件事是： dat <- rbind(dat, workingregion)

这行代码在循环中动态增长一个向量，这是不建议的并且会很慢。

我知道这不能回答你关于并行化这个循环的问题。然而，我只是通过一个类似的练习来收集 100,000 个 SQL 查询的结果，并通过内存意识将我的代码加速了 60 倍。

我还将我的代码与foreach和%dopar%并行。这是 Windows 的理想选择，并且很容易建立一个集群（每个核心上的 R 实例）。

下面是一个有帮助的例子：

library(parallel)
library(doParallel)
library(snow)

# Uses all but one core
cl = makeCluster(detectCores() - 1)

# Necessary to give your instances of R on each core the necessary tools to do what 
# happens in loop 
clusterExport(cl, '<variable names>')
clusterEvalQ(cl, library(packages ))

# parallel loop for going through each region (in your case)
foreach(currentregion = splitdt) %dopar% # iterates over splitdt to cores
{
<body of loop>
}

# Shut down cluster
stopCluster(cl)
stopImplicitCluster()

以下是一些有关加速 R 代码的资源：http : //adv-r.had.co.nz/Performance.html（由该人自己编写）https://csgillespie.github.io/efficientR/performance.html

希望这会有所帮助，祝你好运！

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。