我已经使用scrapy 编写了一个脚本来从网站获取name
,phone
数字email
。我之后的内容有两种diferent链接可用,在name
和phone
一个链接,该email
是另一个链接。我以这里yellowpages.com
为例并尝试以这种方式实现逻辑,以便email
即使我在它的登录页面中也可以解析它。这是我不能使用meta的要求。但是,我使用requests
并BeautifulSoup
结合scrapy来完成符合上述条件的工作,但它确实很慢。
工作一(连同requests
和BeautifulSoup
):
import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
def get_email(target_link):
res = requests.get(target_link)
soup = BeautifulSoup(res.text,"lxml")
email = soup.select_one("a.email-business[href^='mailto:']")
if email:
return email.get("href")
else:
return None
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
yield {"Name":name,"Phone":phone,"Email":email}
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(YellowpagesSpider)
c.start()
我试图模仿上述概念requests
,BeautifulSoup
但无法使其发挥作用。
import scrapy
from scrapy.crawler import CrawlerProcess
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
def parse(self,response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_link = response.urljoin(items.css("a.business-name::attr(href)").get())
#CANT APPLY THE LOGIC IN THE FOLLOWING LINE
email = self.get_email(email_link)
yield {"Name":name,"Phone":phone,"Email":email}
def get_email(self,link):
email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
return email
if __name__ == "__main__":
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
})
c.crawl(YellowpagesSpider)
c.start()
如何让我的第二个脚本模仿第一个脚本工作?
我会使用response.meta
,但如果需要避免它,好吧,让我们以另一种方式尝试:检查 lib https://pypi.org/project/scrapy-inline-requests/
from inline_requests import inline_requests
class YellowpagesSpider(scrapy.Spider):
name = "yellowpages"
start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]
@inline_requests
def parse(self, response):
for items in response.css("div.v-card .info"):
name = items.css("a.business-name > span::text").get()
phone = items.css("div.phones::text").get()
email_url = items.css("a.business-name::attr(href)").get()
email_resp = yield scrapy.Request(response.urljoin(email_url), meta={'handle_httpstatus_all': True})
email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
yield {"Name": name, "Phone": phone, "Email": email}
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句