さまざまな方法でRで分析するためにXMLファイルから値を抽出する

debugcn 投稿 Dev

krofox

さまざまな方法でRで分析するためにXMLファイルから値を抽出しようとしています。xmlファイルのURLは次のとおりです：http：//reports.ieso.ca/public/GenOutputbyFuelHourly/PUB_GenOutputbyFuelHourly_2015.xml。

library(XML)
library(plyr)
library(ggplot2)
library(gridExtra)  

data<-"http://reports.ieso.ca/public/GenOutputbyFuelHourly/PUB_GenOutputbyFuelHourly_2015.xml"
xmlfile=xmlParse(data)
class(xmlfile) #"XMLInternalDocument" "XMLAbstractDocument"

xmltop = xmlRoot(xmlfile) #gives content of root
class(xmltop)#"XMLInternalElementNode" "XMLInternalNode" "XMLAbstractNode"
xmlName(xmltop) #give name of node, PubmedArticleSet
xmlSize(xmltop) #how many children in node, 19
xmlName(xmltop[[1]]) #name of root's children

class(data)
str(data)
topxml <- xmlRoot(data)
topxml <- xmlSApply(topxml,
                   function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml),
                     row.names=NULL)

以前は常にcsvファイルを使用していました。xmlデータは初めてで、昨日から試しています。

どんな助けでもいただければ幸いです。

ありがとう。

カール・ボネリ

これをdata.frameに取り込もうとしていると思います。あなたが提供したドキュメントのxml構造を調べましたが、収集できたものから、次の列で確認したいデータがありますc("date", "hour", "fuel_type", "quality", "fuel_output")。xpathを使用するのが良い方法です。これは良いスタートです。

その場合は、これを実現するために使用したコードを次に示します。非常に遅いことに注意してください...そして何らかの理由でノード84がエラーをスローします。しかし、これで開始できます。

# Read in the raw document
library(xml2)
library(dplyr)
raw <- read_html('http://reports.ieso.ca/public/GenOutputbyFuelHourly/PUB_GenOutputbyFuelHourly_2015.xml')

# Use xpath to get all nodes for <DailyData>, which is actually lowercase in the doc
daily_nodes <- xml_find_all(raw, ".//dailydata")

# loop through and extract the information from the childnoded
# 
o <- lapply(daily_nodes,function(i){
    day <- xml_find_all(i, ".//day") %>% xml_text()
    hourly <- xml_find_all(i, ".//hourlydata")
    tryCatch({
        rbind_pages(lapply(hourly, function(j){
        n_hour <- xml_find_all(hourly, ".//hour") %>% xml_text() %>% as.numeric()
        fuel_type <- xml_find_all(j, ".//fueltotal/fuel") %>% xml_text()
        fuel_qual <- xml_find_all(j, ".//descendant::outputquality") %>% xml_text() %>% as.numeric()
        fuel_output <- xml_find_all(j, ".//descendant::output") %>% xml_text() %>% as.numeric()
        data.frame(hour = n_hour, type = fuel_type, qual = fuel_qual, output = fuel_output)
    })) %>% mutate(date = day) %>% select(date, 1:4)
    }, error = function(e){
        NA
    })
})
# This will be the data.frame with all info
odf <- rbind_pages(o[mapply(is.data.frame,o)])
# Number of rows
> nrow(odf)
[1] 209664
# Showing number of hourly records for just one date...
> filter(odf, date == "2015-01-01") %>% count(hour)
# A tibble: 24 x 2
    hour     n
   <dbl> <int>
 1     1    24
 2     2    24
 3     3    24
 4     4    24
 5     5    24
 6     6    24
 7     7    24
 8     8    24
 9     9    24
10    10    24
# … with 14 more rows

# Showing how many records per date exist
> count(odf, date)
# A tibble: 364 x 2
   date           n
   <chr>      <int>
 1 2015-01-01   576
 2 2015-01-02   576
 3 2015-01-03   576
 4 2015-01-04   576
 5 2015-01-05   576
 6 2015-01-06   576
 7 2015-01-07   576
 8 2015-01-08   576
 9 2015-01-09   576
10 2015-01-10   576
# … with 354 more rows

で分析を実行します odf

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]