我有一个XML文件,希望根据条件从中删除元素。然而,XML文件的命名空间这对于一些不明原因不允许我执行的程序描述:1,2,3,4和5。
我的XML如下所示:
<?xml version='1.0' encoding='UTF-8'?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
<Page imageFilename="1.png">
<TextRegion custom="a">
<TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
<TextEquiv>
<Unicode> abc </Unicode>
</TextEquiv>
</TextLine>
<TextLine custom="readingOrder {index:1;}" id="Ad0010100l2">
<TextEquiv>
<Unicode />
</TextEquiv>
</TextRegion>
</Page>
</PcGts>
我的目标是清除“ Unicode”标签中没有文本的所有TextLine节点。因此输出将是:
<?xml version='1.0' encoding='UTF-8'?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15/pagecontent.xsd">
<Page imageFilename="1.png">
<TextRegion custom="a">
<TextLine custom="readingOrder {index:0;}" id="Ar0010001l1">
<TextEquiv>
<Unicode> abc </Unicode>
</TextEquiv>
</TextLine>
</TextRegion>
</Page>
</PcGts>
我尝试使用上面链接中的一些建议。但:
import lxml.etree as ET
data = ET.parse(file)
root = data.getroot()
for x in root.xpath("//Unicode"):
print(x.text)
找不到任何标签。另一个尝试:
for x in root.xpath("//{http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15}Unicode"):
print(x.text)
抛出“ XPathEvalError:无效的表达式”
那么,从此XML文件中删除所有Unicode标记为空的节点的最简单方法是什么(以及如何找到它们?)?
谢谢。
好吧,我终于找到了解决问题的办法。
import lxml.etree as ET
my_xml = """...xml content..."""
data = ET.XML(my_xml.encode('UTF-8'))
#this loop remove "<Unicode />" tags.
for target in data.xpath("//*[local-name() = 'Unicode'][not(text())]"):
target.getparent().remove(target)
#and this loop remove nodes without children like "<TextEquiv><Unicode /></TextEquiv>"
#(after the removing of "<Unicode />")
for el in data.iter():
if len(list(el.iterchildren())) or ''.join([_.strip() for _ in el.itertext()]):
pass
else:
parent = el.getparent()
if parent is not None:
parent.remove(el)
#and this loop remove nodes without children again, but now - it's "<TextLine>" tag
for el in data.iter():
if len(list(el.iterchildren())) or ''.join([_.strip() for _ in el.itertext()]):
pass
else:
parent = el.getparent()
if parent is not None:
parent.remove(el)
print(ET.tostring(data, xml_declaration=True))
这个想法来自使用python删除没有子节点的xml节点
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句