I have a problem with the XPath function of pythons lxml. A minimal example is the following python code:
from lxml import html, etree
text = """
<p class="goal">
<strong>Goal</strong> <br />
<ul><li>test</li></ul>
</p>
"""
tree = html.fromstring(text)
thesis_goal = tree.xpath('//p[@class="goal"]')[0]
print etree.tostring(thesis_goal)
Running the code produces
<p class="goal">
<strong>Goal</strong> <br/>
</p>
As you can see, the entire <ul>
block is lost. This also means that it is not possible to address the <ul>
with an XPath along the lines of //p[@class="goal"]/ul
, as the <ul>
is not counted as a child of the <p>
.
Is this a bug or a feature of lxml, and if it is the latter, how can I get access to the entire contents of the <p>
? The thing is embedded in a larger website, and it is not guaranteed that there will even be a <ul>
tag (there may be another <p>
inside, or anything else, for that matter).
Update: Updated title after answer was received to make finding this question easier for people with the same problem.
ul
elements (or more generally flow content) are not allowed inside p
elements (which can only contain phrasing content). Therefore lxml.html
parses text
as
In [45]: print(html.tostring(tree))
<div><p class="goal">
<strong>Goal</strong> <br>
</p><ul><li>test</li></ul>
</div>
The ul
follows the p
element. So you could find the ul
element using the XPath
In [47]: print(html.tostring(tree.xpath('//p[@class="goal"]/following::ul')[0]))
<ul><li>test</li></ul>
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments