PythonでXMLタグから値を取得する方法は？

debugcn 投稿 Dev

theteddyboy

以下のようなXMLファイルがあります。

<?xml version="1.0" encoding="UTF-8"?><searching>
   <query>query01</query>
   <document id="0">
      <title>lord of the rings.</title>
    <snippet>
      this is a snippet of a document.
    </snippet>
      <url>http://www.google.com/</url>
   </document>
   <document id="1">
      <title>harry potter.</title>
    <snippet>
            this is a snippet of a document.
    </snippet>
      <url>http://www.google.com/</url>
   </document>
   ........ #and other documents .....

  <group id="0" size="298" score="145">
      <title>
         <phrase>GROUP A</phrase>
      </title>
      <document refid="0"/>
      <document refid="1"/>
      <document refid="84"/>
   </group>
  <group id="0" size="298" score="55">
      <title>
         <phrase>GROUP B</phrase>
      </title>
      <document refid="2"/>
      <document refid="13"/>
      <document refid="3"/>
   </group>
   </<searching>>

上記のグループ名と、各グループのドキュメントID（およびそのタイトル）を取得したいと思います。私の考えは、ドキュメントIDとドキュメントタイトルを次のように辞書に保存することです。

import codecs
documentID = {}    
group = {}

myfile = codecs.open("file.xml", mode = 'r', encoding = "utf8")
for line in myfile:
    line = line.strip()
    #get id from tags
    #get title from tag
    #store in documentID 


    #get group name and document reference

さらに、BeautifulSoupを試しましたが、非常に新しいものです。やり方がわかりません。これは私がやっているコードです。

def outputCluster(rFile):
    documentInReadFile = {}         #dictionary to store all document in readFile

    myfile = codecs.open(rFile, mode='r', encoding="utf8")
    soup = BeautifulSoup(myfile)
    # print all text in readFile:
    # print soup.prettify()

    # print soup.find+_all('title')

outputCluster("file.xml")

いくつかの提案をお願いします。ありがとうございました。

TheSoundDefense

以前のポスターにはその権利があります。etreeのドキュメントはここにあります：

https://docs.python.org/2/library/xml.etree.elementtree.html#module-xml.etree.ElementTree

そしてあなたを助けることができます。トリックを実行する可能性のあるコードサンプルを次に示します（上記のリンクから部分的に取得）。

import xml.etree.ElementTree as ET
tree = ET.parse('your_file.xml')
root = tree.getroot()

for group in root.findall('group'):
  title = group.find('title')
  titlephrase = title.find('phrase').text
  for doc in group.findall('document'):
    refid = doc.get('refid')

または、IDをグループタグに保存する場合はid = group.get('id')、すべてを検索する代わりに使用しますrefid。

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]