Python lxml's XPath not finding <ul> in <p> tags

debugcn Published at Dev

malexmave

I have a problem with the XPath function of pythons lxml. A minimal example is the following python code:

from lxml import html, etree

text = """
      <p class="goal">
            <strong>Goal</strong> <br />
            <ul><li>test</li></ul>
        </p>
"""

tree = html.fromstring(text)
thesis_goal = tree.xpath('//p[@class="goal"]')[0]
print etree.tostring(thesis_goal)

Running the code produces

<p class="goal">
            <strong>Goal</strong> <br/>
            </p>

As you can see, the entire <ul> block is lost. This also means that it is not possible to address the <ul> with an XPath along the lines of //p[@class="goal"]/ul, as the <ul> is not counted as a child of the <p>.

Is this a bug or a feature of lxml, and if it is the latter, how can I get access to the entire contents of the <p>? The thing is embedded in a larger website, and it is not guaranteed that there will even be a <ul> tag (there may be another <p> inside, or anything else, for that matter).

Update: Updated title after answer was received to make finding this question easier for people with the same problem.

unutbu

ul elements (or more generally flow content) are not allowed inside p elements (which can only contain phrasing content). Therefore lxml.html parses text as

In [45]: print(html.tostring(tree))
<div><p class="goal">
            <strong>Goal</strong> <br>
            </p><ul><li>test</li></ul>

</div>

The ul follows the p element. So you could find the ul element using the XPath

In [47]: print(html.tostring(tree.xpath('//p[@class="goal"]/following::ul')[0]))
<ul><li>test</li></ul>

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-07-14

Comments

0 comments

From Dev

Related Related

Article