R 中的解析与抓取

debugcn 发表于 Dev

数据先知

我是 R 新手，对构建数据库的最有效方法有疑问。我想建立一个 NFL 统计数据库。这些统计数据可以在网络上的许多地方轻松获得，但我发现最彻底的分析是在 Pro-Football-Reference ( http://www.pro-football-reference.com/ ) 上。这将是面板数据，其中时间间隔是每个赛季的每周，我的观察是每场比赛中的每个球员，我的列是 Pro-Football-Reference 的 boxscores ( http://www .pro-football-reference.com/boxscores/201702050atl.htm）。

我可以用以下内容抓取每个赛季每场比赛的每张桌子：

#PACKAGES
library(rvest)
library(XML)
page.201702050atl = read_html("http://www.pro-football-reference.com/boxscores/201702050atl.htm")
comments.201702050atl = page.201702050atl %>% html_nodes(xpath = "//comment()")
scoring.201702050atl = readHTMLTable("http://www.pro-football-reference.com/boxscores/201702050atl.htm", which = 2)
game.info.201702050atl = comments.201702050atl[17] %>% html_text() %>% read_html() %>% html_node("#game_info") %>% html_table()
officials.201702050atl = comments.201702050atl[21] %>% html_text() %>% read_html() %>% html_node("#officials") %>% html_table()
team.stats.201702050atl = comments.201702050atl[27] %>% html_text() %>% read_html() %>% html_node("#team_stats") %>% html_table()
scorebox.201702050atl = readHTMLTable("http://www.pro-football-reference.com/boxscores/201702050atl.htm", which = 1)
expected.points.201702050atl = comments.201702050atl[22] %>% html_text() %>% read_html() %>% html_node("#expected_points") %>% html_table()
player.offense.201702050atl = comments.201702050atl[31] %>% html_text() %>% read_html() %>% html_node("#player_offense") %>% html_table()
player.defense.201702050atl = comments.201702050atl[32] %>% html_text() %>% read_html() %>% html_node("#player_defense") %>% html_table()
returns.201702050atl = comments.201702050atl[33] %>% html_text() %>% read_html() %>% html_node("#returns") %>% html_table()
kicking.201702050atl = comments.201702050atl[34] %>% html_text() %>% read_html() %>% html_node("#kicking") %>% html_table()
home.starters.201702050atl = comments.201702050atl[35] %>% html_text() %>% read_html() %>% html_node("#home_starters") %>% html_table()
vis.starters.201702050atl = comments.201702050atl[36] %>% html_text() %>% read_html() %>% html_node("#vis_starters") %>% html_table()
home.snap.counts.201702050atl = comments.201702050atl[37] %>% html_text() %>% read_html() %>% html_node("#home_snap_counts") %>% html_table()
vis.snap.counts.201702050atl = comments.201702050atl[38] %>% html_text() %>% read_html() %>% html_node("#vis_snap_counts") %>% html_table()
targets.directions.201702050atl = comments.201702050atl[39] %>% html_text() %>% read_html() %>% html_node("#targets_directions") %>% html_table()
rush.directions.201702050atl = comments.201702050atl[40] %>% html_text() %>% read_html() %>% html_node("#rush_directions") %>% html_table()
pass.tackles.201702050atl = comments.201702050atl[41] %>% html_text() %>% read_html() %>% html_node("#pass_tackles") %>% html_table()
rush.tackles.201702050atl = comments.201702050atl[42] %>% html_text() %>% read_html() %>% html_node("#rush_tackles") %>% html_table()
home.drives.201702050atl = comments.201702050atl[43] %>% html_text() %>% read_html() %>% html_node("#home_drives") %>% html_table()
vis.drives.201702050atl = comments.201702050atl[44] %>% html_text() %>% read_html() %>% html_node("#vis_drives") %>% html_table()
pbp.201702050atl = comments.201702050atl[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()

然而，每年清理 256 场比赛的每个刮表所需的代码行数似乎表明可能存在更有效的方法。

NFL 在他们的游戏手册中正式记录了统计数据 ( http://www.nfl.com/liveupdate/gamecenter/57167/ATL_Gamebook.pdf )。由于像 Pro-Football-Reference 这样的网站包含未在官方游戏手册中统计的统计数据，并且因为这样做所需的识别语言包含在游戏手册的 Play-by-Play 中，我推断它们正在运行一个功能解析 Play-by-Play 并统计他们的统计数据。作为我的新手，我以前从未在 R 中编写过函数或解析过任何东西；但是，我想我可以将一个功能应用于每本游戏书，这比抓取每个单独的桌子更有效。我在这里走在正确的道路上吗？我不想在错误的方向上投入大量精力。

由于游戏书籍是 PDF 文件，因此出现了另一个问题。Play-by-Plays 以表格形式存在于其他网站上，但没有一个是完整的。我在这个网站上阅读了一些关于如何将 PDF 转换为文本的优秀教程

library(tm)

但是，为了我自己的目的，我还没有弄清楚。

一旦我将整个 PDF 转换为文本，我是否只需识别 Play-by-Play 部分，解析它，然后从那里解析出每个统计数据？我有限的经验是否有其他障碍使我无法预见？

对于本网站来说，这可能是一个过于“初学者”的问题；但是，有人能帮我安排在这里吗？或者，给我提供一个可以的资源？非常感谢帮忙。

完美的

考虑通过将 html 表存储在所有 256 款游戏的增长列表中，将您的一款游戏解析推广到所有游戏。以下是第 1 周的示例。

doc <- htmlParse(readLines("http://www.pro-football-reference.com/years/2016/week_1.htm"))

# EXTRACT ALL GAME PAGES
games <- xpathSApply(doc, "//a[text()='Final']/@href")

# FUNCTION TO BUILD HTML TABLE LIST
getwebdata <- function(url) {
    print(url)
    boxscoreurl <- paste0("http://www.pro-football-reference.com", url)
    page <- read_html(boxscoreurl)
    comments <- page %>% html_nodes(xpath = "//comment()")

    list(
      scoring = readHTMLTable(boxscoreurl, which = 2),
      game.info = comments[17] %>% html_text() %>% read_html() %>% html_node("#game_info") %>% html_table(),
      officials = comments[21] %>% html_text() %>% read_html() %>% html_node("#officials") %>% html_table(),
      team.stats = comments[27] %>% html_text() %>% read_html() %>% html_node("#team_stats") %>% html_table(),
      scorebox = readHTMLTable(boxscoreurl, which = 1),
      expected.points = comments[22] %>% html_text() %>% read_html() %>% html_node("#expected_points") %>% html_table(),
      player.offense = comments[31] %>% html_text() %>% read_html() %>% html_node("#player_offense") %>% html_table(),
      player.defense = comments[32] %>% html_text() %>% read_html() %>% html_node("#player_defense") %>% html_table(),
      returns = comments[33] %>% html_text() %>% read_html() %>% html_node("#returns") %>% html_table(),
      kicking = comments[34] %>% html_text() %>% read_html() %>% html_node("#kicking") %>% html_table(),
      home.starters = comments[35] %>% html_text() %>% read_html() %>% html_node("#home_starters") %>% html_table(),
      vis.starters = comments[36] %>% html_text() %>% read_html() %>% html_node("#vis_starters") %>% html_table(),
      home.snap.counts = comments[37] %>% html_text() %>% read_html() %>% html_node("#home_snap_counts") %>% html_table(),
      vis.snap.counts = comments[38] %>% html_text() %>% read_html() %>% html_node("#vis_snap_counts") %>% html_table(),
      targets.directions = comments[39] %>% html_text() %>% read_html() %>% html_node("#targets_directions") %>% html_table(),
      rush.directions = comments[40] %>% html_text() %>% read_html() %>% html_node("#rush_directions") %>% html_table(),
      pass.tackles = comments[41] %>% html_text() %>% read_html() %>% html_node("#pass_tackles") %>% html_table(),
      rush.tackles = comments[42] %>% html_text() %>% read_html() %>% html_node("#rush_tackles") %>% html_table(),
      home.drives = comments[43] %>% html_text() %>% read_html() %>% html_node("#home_drives") %>% html_table(),
      vis.drives = comments[44] %>% html_text() %>% read_html() %>% html_node("#vis_drives") %>% html_table(),
      pbp = comments[45] %>% html_text() %>% read_html() %>% html_node("#pbp") %>% html_table()
    )
}

# ALL WEEK ONE LIST OF HTML TABLE(S) DATA
week1datalist <- lapply(games, getwebdata)

# TRY/CATCH VERSION (ANY PARSING ERROR TO RETURN EMPTY LIST)
week1datalist <- lapply(games, function(g) {
   tryCatch({ return(getwebdata(g)) 
      }, error = function(e) return(list())
})

# NAME EACH LIST ELEMENT BY CORRESPONDING GAME
shortgames <- gsub("/", "", gsub(".htm", "", games))
week1datalist <- setNames(week1datalist, shortgames)

最终，您可以按名称引用一款游戏的特定统计数据表：

week1datalist$boxscores201609080den$scoring

week1datalist$boxscores201609110atl$game.info

此外，您可能需要包含tryCatch在，lapply因为某些页面可能不一致。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-07-12

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

R 中的解析与抓取

R 中的解析与抓取

从 R 中的链接抓取表格

从R中的网页中抓取多个表

如何从R中抓取的文字中删除“Ã”？

在R中抓取URL目录ID

循环以从R中的Wikipedia抓取数据

如何在R中抓取JavaScript表？

R编程中的Web抓取（RVest）

将数据抓取到R中

如何在R中抓取多个表？

如何在R中抓取JavaScript表？

将表格抓取到 R 中

如何在 R 中抓取网络表

在R中解析XML

R中的解析日期

在R中解析日期

在Python中从网络抓取时获取\ r \ n \ r \ n

如何从R中的搜索框中抓取URL链接？

在R中解析HTML文件

在R中解析iTunes RSS

在R中解析xml文件

从R中的文本解析日期

解析R列表中的值

在R中解析iTunes RSS

在R中解析XML响应

从R中的xml解析数据

解析R中的对象集合

重新格式化R中的抓取日期

在R中抓取网页时出现问题

在R和rvest中抓取多个链接的HTML表