使用正则表达式从网页中提取表格

Juicy 发表于 Dev

多汁的

我想从该站点提取包含IP块的表。

查看HTML源代码，我可以清楚地看到所需区域的结构如下：

[CONTENT BEFORE TABLE]
<table border="1" cellpadding="6" bordercolor="#000000">
[IP ADDRESSES AND OTHER INFO]
</table>
[CONTENT AFTER TABLE]

所以我写了这个小片段：

import urllib2,re
from lxml import html
response = urllib2.urlopen('http://www.nirsoft.net/countryip/za.html')

content = response.read()

print re.match(r"(.*)<table border=\"1\" cellpadding=\"6\" bordercolor=\"#000000\">(.*)</table>(.*)",content)

页面的内容被提取（并且正确）而没有问题。None但是，总是返回正则表达式匹配项（此处的打印仅用于调试）。

考虑到页面的结构，我不明白为什么没有匹配项。我希望有三组，第二组是表内容。

用户2555451

默认情况下，.不匹配换行符。您需要指定dot-all标志以使其执行以下操作：

re.match(..., content, re.DOTALL)

下面是一个演示：

>>> import re
>>> content = '''
... [CONTENT BEFORE TABLE]
... <table border="1" cellpadding="6" bordercolor="#000000">
... [IP ADDRESSES AND OTHER INFO]
... </table>
... [CONTENT AFTER TABLE]
... '''
>>> pat = r"(.*)<table border=\"1\" cellpadding=\"6\" bordercolor=\"#000000\">(.*)</table>(.*)"
>>> re.match(pat, content, re.DOTALL)
<_sre.SRE_Match object at 0x02520520>
>>> re.match(pat, content, re.DOTALL).group(2)
'\n[IP ADDRESSES AND OTHER INFO]\n'
>>>

也可以通过使用re.S或将其放置(?s)在图案的开头来激活全点标记。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。