parse html with regex, sometimes it doesn't work

debugcn Published at Dev

Erin

I have the following text which I would like to parse with regex. I want to have everything inside the td with class "postcell". I am using this code, but it gives me nothing.

re.finditer('<td class="postcell">(.+?)</td>', doc)

I def don't want to use beautifulsoup

<td class="postcell">
<div>
<div class="post-text" itemprop="text">
<p>Let us consider</p>
<pre class="lang-py prettyprint"><code>x = ['1', '2', '3', '4', '5']
y = ['a', 'b', 'c', 'd', 'e']
</code></pre>
<p>How do I get the required output <code>z</code>?</p>
<pre class="lang-py prettyprint"><code>z = [('1', 'a') , ('b', '2') , ('c', '3') , ('d', '4') , ('e', '5')]
</code></pre>
</div>
<div class="post-taglist">
<a href="/questions/tagged/python" class="post-tag js-gps-track" title="show questions tagged 'python'" rel="tag">python</a> <a href="/questions/tagged/list" class="post-tag js-gps-track" title="show questions tagged 'list'" rel="tag">list</a>
</div>
<table class="fw">
<tbody><tr>
<td class="vt">
<div class="post-menu"><a href="/q/9853438" title="short permalink to this question" class="short-link" id="link-post-9853438">share</a><span class="lsep">|</span><a href="/posts/9853438/edit" class="suggest-edit-post" title="">improve this question</a></div>
</td>
<td align="right" class="post-signature">
<div class="user-info user-hover">
<div class="user-action-time">
<a href="/posts/9853438/revisions" title="show all edits to this post">edited <span title="2012-03-24 16:42:59Z" class="relativetime">Mar 24 '12 at 16:42</span></a>
</div>
<div class="user-gravatar32">
<a href="/users/35070/phihag"><div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/6f92354195e8874dbee44d5c8714d506?s=32&amp;d=identicon&amp;r=PG" alt="" width="32" height="32" /></div></a>
</div>
<div class="user-details">
<a href="/users/35070/phihag">phihag</a>
<div class="-flair">
<span class="reputation-score" title="reputation score 132,147" dir="ltr">132k</span><span title="31 gold badges"><span class="badge1"></span><span class="badgecount">31</span></span><span title="252 silver badges"><span class="badge2"></span><span class="badgecount">252</span></span><span title="308 bronze badges"><span class="badge3"></span><span class="badgecount">308</span></span>
</div>
</div>
</div> </td>
<td class="post-signature owner">
<div class="user-info ">
<div class="user-action-time">
        asked <span title="2012-03-24 16:40:17Z" class="relativetime">Mar 24 '12 at 16:40</span>
</div>
<div class="user-gravatar32">
<a href="/users/1168528/karthik-reddi"><div class="gravatar-wrapper-32"><img src="https://www.gravatar.com/avatar/acce3b34402cd7646c175c273dee1616?s=32&amp;d=identicon&amp;r=PG" alt="" width="32" height="32" /></div></a>
</div>
<div class="user-details">
<a href="/users/1168528/karthik-reddi">Karthik Reddi</a>
<div class="-flair">
<span class="reputation-score" title="reputation score " dir="ltr">10</span><span title="2 bronze badges"><span class="badge3"></span><span class="badgecount">2</span></span>
</div>
</div>
</div>
</td>
</tr>
</tbody></table>
</div>
</td>

Trevor Merrifield

You forgot to escape the /.

re.finditer('<td class="postcell">(.+?)<\/td>', doc)

Other commenters are right that it's impossible to parse html with regex in general. For your case it might be good enough. Just know the limitations like that regular expressions are blind to nesting, so you may run into edge cases like that if there's a <\td> inside one of your post cells your match will end early.

Collected from the Internet

Please contact [email protected] to delete if infringement.