如何在python中刮取网页上的嵌入式脚本

so3

例如,我有网页http://www.amazon.com/dp/1597805483

我想用xpath刮这句话 Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.

page = requests.get(url)
tree = html.fromstring(page.text)
feature_bullets = tree.xpath('//*[@id="iframeContent"]/div/text()')
print feature_bullets

以上代码未返回任何内容。原因是浏览器解释的xpath与源代码不同。但是我不知道如何从源代码获取xpath。

ec

建立您要在网页上抓取的页面涉及很多事情。

至于描述,具体而言,底层HTML是在javascript函数内部构造的:

<script type="text/javascript">

    P.when('DynamicIframe').execute(function (DynamicIframe) {
        var BookDescriptionIframe = null,
                bookDescEncodedData = "%3Cdiv%3E%3CB%3EA%20Fantastic%20Anthology%20Combining%20the%20Love%20of%20Science%20Fiction%20with%20Our%20National%20Pastime%3C%2FB%3E%3CBR%3E%3CBR%3EOf%20all%20the%20sports%20played%20across%20the%20globe%2C%20none%20has%20more%20curses%20and%20superstitions%20than%20baseball%2C%20America%26%238217%3Bs%20national%20pastime.%3Cbr%3E%3CBR%3E%3CI%3EField%20of%20Fantasies%3C%2FI%3E%20delves%20right%20into%20that%20superstition%20with%20short%20stories%20written%20by%20several%20key%20authors%20about%20baseball%20and%20the%20supernatural.%20%20Here%20you%27ll%20encounter%20ghostly%20apparitions%20in%20the%20stands%2C%20a%20strangely%20charming%20vampire%20double-play%20combination%2C%20one%20fan%20who%20can%20call%20every%20shot%20and%20another%20who%20can%20see%20the%20past%2C%20a%20sad%20alternate-reality%20for%20the%20game%27s%20most%20famous%20player%2C%20unlikely%20appearances%20on%20the%20field%20by%20famous%20personalities%20from%20Stephen%20Crane%20to%20Fidel%20Castro%2C%20a%20hilariously%20humble%20teenage%20phenom%2C%20and%20much%20more.%20In%20this%20wonderful%20anthology%20are%20stories%20from%20such%20award-winning%20writers%20as%3A%3CBR%3E%3CBR%3EStephen%20King%20and%20Stewart%20O%26%238217%3BNan%3Cbr%3EJack%20Kerouac%3CBR%3EKaren%20Joy%20Fowler%3CBR%3ERod%20Serling%3CBR%3EW.%20P.%20Kinsella%3CBR%3EAnd%20many%20more%21%3CBR%3E%3CBR%3ENever%20has%20a%20book%20combined%20the%20incredible%20with%20great%20baseball%20fiction%20like%20%3CI%3EField%20of%20Fantasies%3C%2FI%3E.%20This%20wide-ranging%20collection%20reaches%20from%20some%20of%20the%20earliest%20classics%20from%20the%20pulp%20era%20and%20baseball%27s%20golden%20age%2C%20all%20the%20way%20to%20material%20appearing%20here%20for%20the%20first%20time%20in%20a%20print%20edition.%20Whether%20you%20love%20the%20game%20or%20just%20great%20fiction%2C%20these%20stories%20will%20appeal%20to%20all%2C%20as%20the%20writers%20in%20this%20anthology%20bring%20great%20storytelling%20of%20the%20strange%20and%20supernatural%20to%20the%20plate%2C%20inning%20after%20inning.%3CBR%3E%3C%2Fdiv%3E",
                bookDescriptionAvailableHeight,
                minBookDescriptionInitialHeight = 112,
                options = {};
    ...

</script>

这里的想法是获取脚本标签的文本,使用正则表达式提取描述值,取消对HTML的引用,然后对其进行解析lxml.html并获得.text_content()

import re
from urlparse import unquote

from lxml import html
import requests

url = "http://rads.stackoverflow.com/amzn/click/1597805483"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
tree = html.fromstring(page.content)

script = tree.xpath('//script[contains(., "bookDescEncodedData")]')[0]
match = re.search(r'bookDescEncodedData = "(.*?)",', script.text)
if match:
    description_html = html.fromstring(unquote(match.group(1)))
    print description_html.text_content()

印刷:

A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime. 
Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural.  
Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more. 
In this wonderful anthology are stories from such award-winning writers as:Stephen King and Stewart O’NanJack KerouacKaren Joy FowlerRod SerlingW. P. KinsellaAnd many more!Never has a book combined the incredible with great baseball fiction like Field of Fantasies. 
This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning.

类似的解决方案,但使用BeautifulSoup

import re
from urlparse import unquote

from bs4 import BeautifulSoup
import requests

url = "http://rads.stackoverflow.com/amzn/click/1597805483"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'})
soup = BeautifulSoup(page.content)

script = soup.find('script', text=lambda x:'bookDescEncodedData' in x)
match = re.search(r'bookDescEncodedData = "(.*?)",', script.text)
if match:
    description_html = BeautifulSoup(unquote(match.group(1)))
    print description_html.text

另外,您可以采用高级方法,并在以下帮助下使用真实的浏览器selenium

from selenium import webdriver

url = "http://rads.stackoverflow.com/amzn/click/1597805483"

driver = webdriver.Firefox()
driver.get(url)

iframe = driver.find_element_by_id('bookDesc_iframe')
driver.switch_to.frame(iframe)

print driver.find_element_by_id('iframeContent').text

driver.close()

产生更好的格式化输出:

A Fantastic Anthology Combining the Love of Science Fiction with Our National Pastime

Of all the sports played across the globe, none has more curses and superstitions than baseball, America’s national pastime.

Field of Fantasies delves right into that superstition with short stories written by several key authors about baseball and the supernatural. Here you'll encounter ghostly apparitions in the stands, a strangely charming vampire double-play combination, one fan who can call every shot and another who can see the past, a sad alternate-reality for the game's most famous player, unlikely appearances on the field by famous personalities from Stephen Crane to Fidel Castro, a hilariously humble teenage phenom, and much more. In this wonderful anthology are stories from such award-winning writers as:

Stephen King and Stewart O’Nan
Jack Kerouac
Karen Joy Fowler
Rod Serling
W. P. Kinsella
And many more!

Never has a book combined the incredible with great baseball fiction like Field of Fantasies. This wide-ranging collection reaches from some of the earliest classics from the pulp era and baseball's golden age, all the way to material appearing here for the first time in a print edition. Whether you love the game or just great fiction, these stories will appeal to all, as the writers in this anthology bring great storytelling of the strange and supernatural to the plate, inning after inning.

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

来自分类Dev

如何在嵌入式Python脚本的错误消息中打印PyObject的类型?

来自分类Dev

如何访问网页中的嵌入式pdf文件

来自分类Dev

TEdgeBrowser组件:从嵌入式网页上运行的脚本调用本机代码

来自分类Dev

如何在嵌入式数组mongodb中查询嵌入式文档

来自分类Dev

如何在嵌入式Flash上更改鼠标光标?

来自分类Dev

如何在嵌入式文档数组上使用$ geoNear?

来自分类Dev

如何在嵌入式Linux上找到显示库

来自分类Dev

如何在嵌入式 Jetty 上配置 StdErrLog

来自分类Dev

如何在Windows 10上为嵌入式Python设置virtualenv

来自分类Dev

如何将用户脚本注入嵌入式网页内容?

来自分类Dev

如何在SVG中制作嵌入式投影

来自分类Dev

如何在OrientDB中插入嵌入式文档

来自分类Dev

如何在Ember中卸载嵌入式记录

来自分类Dev

如何在SVG中制作嵌入式投影

来自分类Dev

如何在Android嵌入式卡中设置setText()

来自分类Dev

如何在嵌入式函数中存储值

来自分类Dev

如何在Meteor中循环浏览嵌入式文档

来自分类Dev

如何在 MongoDb 中查询嵌入式文档?

来自分类Dev

如何在嵌入式tomcat中添加ServletContextListener

来自分类Dev

如何在python eve中为嵌入式字典列表建模

来自分类Dev

如何在嵌入式Python中动态添加C函数

来自分类Dev

如何从网页中的嵌入式pdf获取所选文本?

来自分类Dev

如何从网页下载嵌入式RTF文件?

来自分类Dev

如何在嵌入式脚本所在的位置添加DOM元素

来自分类Dev

在python中运行嵌入式mongodb

来自分类Dev

在Bash脚本中,continue命令如何与嵌入式循环一起使用?

来自分类Dev

使用内联/嵌入式图在IPython中运行python脚本

来自分类Dev

删除嵌入式网页的边距

来自分类Dev

嵌入式YouTube:如何在iOS 7 iPad上获得高质量

Related 相关文章

  1. 1

    如何在嵌入式Python脚本的错误消息中打印PyObject的类型?

  2. 2

    如何访问网页中的嵌入式pdf文件

  3. 3

    TEdgeBrowser组件:从嵌入式网页上运行的脚本调用本机代码

  4. 4

    如何在嵌入式数组mongodb中查询嵌入式文档

  5. 5

    如何在嵌入式Flash上更改鼠标光标?

  6. 6

    如何在嵌入式文档数组上使用$ geoNear?

  7. 7

    如何在嵌入式Linux上找到显示库

  8. 8

    如何在嵌入式 Jetty 上配置 StdErrLog

  9. 9

    如何在Windows 10上为嵌入式Python设置virtualenv

  10. 10

    如何将用户脚本注入嵌入式网页内容?

  11. 11

    如何在SVG中制作嵌入式投影

  12. 12

    如何在OrientDB中插入嵌入式文档

  13. 13

    如何在Ember中卸载嵌入式记录

  14. 14

    如何在SVG中制作嵌入式投影

  15. 15

    如何在Android嵌入式卡中设置setText()

  16. 16

    如何在嵌入式函数中存储值

  17. 17

    如何在Meteor中循环浏览嵌入式文档

  18. 18

    如何在 MongoDb 中查询嵌入式文档?

  19. 19

    如何在嵌入式tomcat中添加ServletContextListener

  20. 20

    如何在python eve中为嵌入式字典列表建模

  21. 21

    如何在嵌入式Python中动态添加C函数

  22. 22

    如何从网页中的嵌入式pdf获取所选文本?

  23. 23

    如何从网页下载嵌入式RTF文件?

  24. 24

    如何在嵌入式脚本所在的位置添加DOM元素

  25. 25

    在python中运行嵌入式mongodb

  26. 26

    在Bash脚本中,continue命令如何与嵌入式循环一起使用?

  27. 27

    使用内联/嵌入式图在IPython中运行python脚本

  28. 28

    删除嵌入式网页的边距

  29. 29

    嵌入式YouTube:如何在iOS 7 iPad上获得高质量

热门标签

归档