I have the following xml file:
<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>
</article_date>
</root>
I would like to transform it to the following file:
<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1+aaa3+aaa5</article_name>
<article_link>1aaaaaaa+3aaaaaaa+5aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa2+aaa4</article_name>
<article_link>2aaaaaaa+4aaaaaaa</article_link>
</article_time>
</root>
How can I do it in python?
My approach to do this task is the following: 1) loop through tags 2) form dictionary key- either 0 or 1, value - 3) for each element in this dictionary find all child nodes: and and append them
Since that, I wrote the following code to implement this (ps I am currently struggling with adding elements to the dictionary, but I will overcome this issue):
def parse():
list_of_inique_timestamps=[]
text_to_merge=""
tree=et.parse("~/Documents/test1.xml")
root=tree.getroot()
for children in root:
print children.tag, children.text
for child in children:
print (child.tag,int(child.text))
if not child.text in list_of_inique_timestamps:
list_of_inique_timestamps.append(child.text)
print list_of_inique_timestamps
Here's the solution using xml.etree.ElementTree
from python standard library.
The idea is to gather items into defaultdict(list)
per article_time
text value:
from collections import defaultdict
import xml.etree.ElementTree as ET
data = """<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1</article_name>
<article_link>1aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa2</article_name>
<article_link>2aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa3</article_name>
<article_link>3aaaaaaa</article_link>
</article_time>
<article_time>0
<article_name>aaa4</article_name>
<article_link>4aaaaaaa</article_link>
</article_time>
<article_time>1
<article_name>aaa5</article_name>
<article_link>5aaaaaaa</article_link>
</article_time>
</article_date>
</root>
"""
tree = ET.fromstring(data)
root = ET.Element('root')
article_date = ET.SubElement(root, 'article_date')
article_date.text = tree.find('.//article_date').text
data = defaultdict(list)
for article_time in tree.findall('.//article_time'):
text = article_time.text.strip()
name = article_time.find('./article_name').text
link = article_time.find('./article_link').text
data[text].append((name, link))
for time_value, items in data.iteritems():
article_time = ET.SubElement(article_date, 'article_time')
article_name = ET.SubElement(article_time, 'article_name')
article_link = ET.SubElement(article_time, 'article_name')
article_time.text = time_value
article_name.text = '+'.join(name for (name, _) in items)
article_link.text = '+'.join(link for (_, link) in items)
print ET.tostring(root)
prints (prettified):
<root>
<article_date>09/09/2013
<article_time>1
<article_name>aaa1+aaa3+aaa5</article_name>
<article_name>1aaaaaaa+3aaaaaaa+5aaaaaaa</article_name>
</article_time>
<article_time>0
<article_name>aaa2+aaa4</article_name>
<article_name>2aaaaaaa+4aaaaaaa</article_name>
</article_time>
</article_date>
</root>
See, the result is exactly what you were aiming to.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments