python BeautifulSoup从垃圾页面获取变量

麦格

乡亲们,试图从格式不正确的页面中获取一些变量。

html =  response.read()
soup = BeautifulSoup(html)
links = soup.findAll('a')

for link in links:
    for x in link.attrs:
       print x

输出:

(u'href', u"javascript:Set_Variables('FIRSTNAME,LASTNAME', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'123456789123', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'FOOOOOOO',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'54',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'2014',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BAZZZZ',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BARRRRRRRRRR',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'07/31/2015',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'')")
(u'onmouseover', u"javascript: return window.status=''")
(u'href', u"javascript:Set_Variables('FIRSTNAME,LASTNAME', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'123456789123', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'FOOOOOOO',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'54',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'2014',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BAZZZZ',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BARRRRRRRRRR',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'07/31/2015',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'')")
(u'onmouseover', u"javascript: return window.status=''")

:如何将我抓住FIRSTNAME,LASTNAMEFOOOOOOOBARRRRRBAZZZZZ123456789123所有这个烂摊子?

谢谢!

马丁·彼得斯(Martijn Pieters)

首先,您只需要关注href此处属性。

将所有内容放在括号之间,在空格处分割并删除逗号和引号:

args = link['href'].partition('(')[-1].rpartition(')')[0]
args = [v.rstrip(',').strip("'") for v in args.split()]

演示:

>>> href = u"javascript:Set_Variables('FIRSTNAME,LASTNAME', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'123456789123', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'FOOOOOOO',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'54',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'2014',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BAZZZZ',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BARRRRRRRRRR',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'07/31/2015',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'')"
>>> href.partition('(')[-1].rpartition(')')[0]
u"'FIRSTNAME,LASTNAME', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'123456789123', \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'FOOOOOOO',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'54',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'2014',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BAZZZZ',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'BARRRRRRRRRR',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t'07/31/2015',\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t''"
>>> [v.rstrip(',').strip("'") for v in href.partition('(')[-1].rpartition(')')[0].split()]
[u'FIRSTNAME,LASTNAME', u'123456789123', u'FOOOOOOO', u'54', u'2014', u'BAZZZZ', u'BARRRRRRRRRR', u'07/31/2015', u'']

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

来自分类Dev

Python:BeautifulSoup返回垃圾

来自分类Dev

python中的beautifulsoup解析错误-垃圾字符

来自分类Dev

python beautifulsoup抓取存档页面

来自分类Dev

想要使用python&BeautifulSoup从RCSB页面中获取期刊标题

来自分类Dev

如何在python中获取垃圾值

来自分类Dev

使用Python的BeautifulSoup获取articleBody

来自分类Dev

python BeautifulSoup获取特定元素

来自分类Dev

使用Python的BeautifulSoup获取articleBody

来自分类Dev

在python的BeautifulSoup中获取NextSibling

来自分类Dev

Python BeautifulSoup获取div的字段

来自分类Dev

Python Beautifulsoup 获取超链接

来自分类Dev

使用BeautifulSoup和Python刮取多个页面

来自分类Dev

如何使用BeautifulSoup和Python抓取页面?

来自分类Dev

使用beautifulsoup / python解析html页面

来自分类Dev

使用 BeautifulSoup 在 python 中抓取多个页面

来自分类Dev

Python垃圾收集器和类变量

来自分类Dev

Python垃圾收集器和类变量

来自分类Dev

Python BeautifulSoup在变量中查找数据

来自分类Dev

Python:使用Beautifulsoup从html获取文本

来自分类Dev

python beautifulsoup获取html标签内容

来自分类Dev

python BeautifulSoup无法从网页获取文本

来自分类Dev

如何使用BeautifulSoup和Python获取元素

来自分类Dev

如何从网页获取链接-BeautifulSoup / Python

来自分类Dev

Python BeautifulSoup无法获取完整表

来自分类Dev

python beautifulsoup获取html标签内容

来自分类Dev

Python循环外获取变量

来自分类Dev

Python循环外获取变量

来自分类Dev

在异常Python外获取变量

来自分类Dev

在Python中获取变量的名称