我正在尝试使用BeautifulSoup从LinkedIn上抓取一些网页,并且不断收到错误“ HTTP错误999:请求被拒绝”。有没有办法避免此错误。如果您查看我的代码,我已经尝试过Mechanize和URLLIB2,并且两者都给了我相同的错误。
from __future__ import unicode_literals
from bs4 import BeautifulSoup
import urllib2
import csv
import os
import re
import requests
import pandas as pd
import urlparse
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
from BeautifulSoup import BeautifulStoneSoup
import urllib
import urlparse
import pdb
import codecs
from BeautifulSoup import UnicodeDammit
import codecs
import webbrowser
from urlgrabber import urlopen
from urlgrabber.grabber import URLGrabber
import mechanize
fout5 = codecs.open('data.csv','r', encoding='utf-8', errors='replace')
for y in range(2,10,1):
url = "https://www.linkedin.com/job/analytics-%2b-data-jobs-united-kingdom/?sort=relevance&page_num=1"
params = {'page_num':y}
url_parts = list(urlparse.urlparse(url))
query = dict(urlparse.parse_qsl(url_parts[4]))
query.update(params)
url_parts[4] = urllib.urlencode(query)
y = urlparse.urlunparse(url_parts)
#print y
#url = urllib2.urlopen(y)
#f = urllib2.urlopen(y)
op = mechanize.Browser() # use mecahnize's browser
op.set_handle_robots(False) #tell the webpage you're not a robot
j = op.open(y)
#print op.title()
#g = URLGrabber()
#data = g.urlread(y)
#data = fo.read()
#print data
#html = response.read()
soup1 = BeautifulSoup(y)
print soup1
尝试设置User-Agent
标题。之后添加此行op.set_handle_robots(False)
op.addheaders = [('User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36")]
编辑:如果要抓取网站,请首先检查它是否具有处理API的API或库。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句