如何通过Python抓取动态网页

dixhom 发表于 Dev

迪克森

[我想做的事]

刮擦下面的网页以获取二手车数据。
http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1

[问题]

刮整个页面。在上面的网址中，仅显示前30个项目。我可以在下面编写的代码中删除这些内容。指向其他页面的链接显示为1 2 3 ...，但是链接地址似乎在Javascript中。我在Google上搜索了有用的信息，但找不到任何信息。

from bs4 import BeautifulSoup
import urllib.request

html = urllib.request.urlopen("http://www.goo-net.com/php/search/summary.php?price_range=&pref_c=08,09,10,11,12,13,14&easysearch_flg=1")

soup = BeautifulSoup(html, "lxml")
total_cars = soup.find(class_="change change_01").find('em').string
tmp = soup.find(class_="change change_01").find_all('span')
car_start, car_end = tmp[0].string, tmp[1].string

# get urls to car detail pages
car_urls = []
heading_inners = soup.find_all(class_="heading_inner")
for heading_inner in heading_inners:
    href = heading_inner.find('h4').find('a').get('href')
    car_urls.append('http://www.goo-net.com' + href)

for url in car_urls:
    html = urllib.request.urlopen(url)
    soup = BeautifulSoup(html, "lxml")
    #title
    print(soup.find(class_='hdBlockTop').find('p', class_='tit').string)
    #price of car itself
    print(soup.find(class_='price1').string)
    #price of car including tax
    print(soup.find(class_='price2').string)

    tds = soup.find(class_='subData').find_all('td')
    # year
    print(tds[0].string)
    # distance
    print(tds[1].string)
    # displacement
    print(tds[2].string)
    # inspection
    print(tds[3].string)

[我想知道的]

如何刮整个页面。我更喜欢使用BeautifulSoup4（Python）。但是，如果那不是合适的工具，请向我展示其他工具。

[我的环境]

Windows 8.1
Python 3.5
PyDev（Eclipse）
美丽的汤4

任何指导将不胜感激。谢谢你。

艾哈迈德·瓦利普

您可以像下面的示例一样使用硒：

from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://example.com')
element = driver.find_element_by_class_name("yourClassName") #or find by text or etc
element.click()

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。