我正在尝试编写一个R脚本,该脚本从站点上多个页面上的表中抓取数据。为此,我想首先创建要抓取的特定页面的列表。要抓取的页面的地址格式为“ www.urlpart1 / [year] / urlpart2 / [page]”,其中[year]是2003到2015的范围(13个元素),[page]的值是1到281增量为40(8个元素);最终,我想要的最终列表将包含104个元素。这是我的代码:
#specify components of URLs
url1 <- "www.urlpart1/"
url2 <- "/urlpart2/"
#specify range of years to scrape
years <- as.list(seq(from = 2003, to = 2015, by = 1)) #13 elements
#specify specific pages within each year to scrape
pages <- as.list(seq(from = 1, to = 281, by = 40)) #8 elements
#specify length of final list of URLs for scraping
loops <- as.list(seq(from = 1, to = (length(years)*length(pages)), by = 1)) #104 elements
#create empty list for storing output of for-loop
list1 <- list()
#initialize loop
for (i in loops){
for (j in years){
for (k in pages){
list1[[i]] <- paste0(url1,j,url2,k)
}
}
}
list1 #outputs 104 elements of last iteration of loop
最终,该列表将包含104个看起来像这样的元素:
"www.urlpart1/2003/urlpart2/1",
"www.urlpart1/2003/urlpart2/41",
"www.urlpart1/2003/urlpart2/81",
"www.urlpart1/2003/urlpart2/121",
"www.urlpart1/2003/urlpart2/161",
"www.urlpart1/2003/urlpart2/201",
"www.urlpart1/2003/urlpart2/241",
"www.urlpart1/2003/urlpart2/281",
"www.urlpart1/2004/urlpart2/1",
"www.urlpart1/2004/urlpart2/41",
"www.urlpart1/2004/urlpart2/81",
"www.urlpart1/2004/urlpart2/121",
"www.urlpart1/2004/urlpart2/161",
"www.urlpart1/2004/urlpart2/201",
"www.urlpart1/2004/urlpart2/241",
"www.urlpart1/2004/urlpart2/281",
...
"www.urlpart1/2015/urlpart2/1",
"www.urlpart1/2015/urlpart2/41",
"www.urlpart1/2015/urlpart2/81",
"www.urlpart1/2015/urlpart2/121",
"www.urlpart1/2015/urlpart2/161",
"www.urlpart1/2015/urlpart2/201",
"www.urlpart1/2015/urlpart2/241",
"www.urlpart1/2015/urlpart2/281"
不幸的是,我得到了正确长度的列表,但是所有值都是循环的最后一次迭代。解决类似问题的先前线程似乎并未解决写入嵌套循环中的列表的问题。我完全不依赖于for循环的解决方案。我可以使用Excel的GUI轻松完成此操作,但是我需要提高编码技巧,以使其更易于重现。谢谢!
我们可以expand.grid
用来创建所有变量的组合以产生data.frame
输出,然后paste
是data.frame(do.call(paste0,
)的每一行并将其转换为vector
。
res <- do.call(paste0,expand.grid(url1, years, url2, pages))
length(res)
#[1] 104
如果我们需要for
循环,这可能会有所帮助
v1 <- c()
for(i in seq_along(url1)){
for(j in seq_along(years)){
for(k in seq_along(url2)){
for(m in seq_along(pages)){
v1 <- c(v1, paste0(url1[i], years[[j]], url2[k], pages[[m]]))
}
}
}
}
identical(sort(res), sort(v1))
#[1] TRUE
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句