Python lxml's XPath not finding <ul> in <p> tags

malexmave

I have a problem with the XPath function of pythons lxml. A minimal example is the following python code:

from lxml import html, etree

text = """
      <p class="goal">
            <strong>Goal</strong> <br />
            <ul><li>test</li></ul>
        </p>
"""

tree = html.fromstring(text)
thesis_goal = tree.xpath('//p[@class="goal"]')[0]
print etree.tostring(thesis_goal)

Running the code produces

<p class="goal">
            <strong>Goal</strong> <br/>
            </p>

As you can see, the entire <ul> block is lost. This also means that it is not possible to address the <ul> with an XPath along the lines of //p[@class="goal"]/ul, as the <ul> is not counted as a child of the <p>.

Is this a bug or a feature of lxml, and if it is the latter, how can I get access to the entire contents of the <p>? The thing is embedded in a larger website, and it is not guaranteed that there will even be a <ul> tag (there may be another <p> inside, or anything else, for that matter).

Update: Updated title after answer was received to make finding this question easier for people with the same problem.

unutbu

ul elements (or more generally flow content) are not allowed inside p elements (which can only contain phrasing content). Therefore lxml.html parses text as

In [45]: print(html.tostring(tree))
<div><p class="goal">
            <strong>Goal</strong> <br>
            </p><ul><li>test</li></ul>

</div>

The ul follows the p element. So you could find the ul element using the XPath

In [47]: print(html.tostring(tree.xpath('//p[@class="goal"]/following::ul')[0]))
<ul><li>test</li></ul>

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Python lxml's XPath not finding <ul> in <p> tags

From Dev

lxml xpath not finding anchor text

From Dev

Python lxml xpath no output

From Dev

Python lxml xpath not working

From Dev

Finding attachement, h/li/ul tags

From Dev

Python, How to use lxml XPath?

From Dev

xpath with lxml for Python to get data

From Dev

xpath to dic python, lxml and xml

From Dev

LXML using XPath and ETXPath in Python

From Dev

Why lxml isn't finding xpath given by Chrome inspector?

From Dev

finding p tag between two div tags

From Dev

Finding tags within comment tags - Python

From Dev

Can I use xpath (in lxml) to find the names of tags not known at the start?

From Dev

lxml xpath - Get all text within span tags

From Dev

Python crawler not finding specific Xpath

From Dev

python lxml loop through all tags

From Dev

Python - lxml to re-order xml tags

From Dev

Python : lxml xpath get two different classes

From Dev

Why does this xpath fail using lxml in python?

From Dev

Python - xpath syntax for a form (lxml.html)

From Dev

Merging xpath generated lists, Python, lxml

From Dev

Python lxml HTML xpath query code not working

From Dev

Why does this xpath fail using lxml in python?

From Dev

problems with xpath in python using lxml on xml file

From Dev

Use unicode as predicate in xpath with lxml and python 2.7

From Dev

parse selective table rows in python with lxml and xpath

From Dev

Python lxml XPath SyntaxError: invalid predicate

From Dev

How can I get the text with xPath between </ul> and <p>?

From Java

Selenium - Python Safari WebDriver - not finding element by xpath