无法使用自定义方法解析某些内容

debugcn 发表于 Dev

机器人.txt

我已经使用scrapy 编写了一个脚本来从网站获取name,phone数字email。我之后的内容有两种diferent链接可用，在name和phone一个链接，该email是另一个链接。我以这里yellowpages.com为例并尝试以这种方式实现逻辑，以便email即使我在它的登录页面中也可以解析它。这是我不能使用meta的要求。但是，我使用requests并BeautifulSoup结合scrapy来完成符合上述条件的工作，但它确实很慢。

工作一（连同requests和BeautifulSoup）：

import scrapy
import requests
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess

def get_email(target_link):
    res = requests.get(target_link)
    soup = BeautifulSoup(res.text,"lxml")
    email = soup.select_one("a.email-business[href^='mailto:']")
    if email:
        return email.get("href")
    else:
        return None

class YellowpagesSpider(scrapy.Spider):
    name = "yellowpages"
    start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

    def parse(self,response):
        for items in response.css("div.v-card .info"):
            name = items.css("a.business-name > span::text").get()
            phone = items.css("div.phones::text").get()
            email = get_email(response.urljoin(items.css("a.business-name::attr(href)").get()))
            yield {"Name":name,"Phone":phone,"Email":email}

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
    })
    c.crawl(YellowpagesSpider)
    c.start()

我试图模仿上述概念requests，BeautifulSoup但无法使其发挥作用。

import scrapy
from scrapy.crawler import CrawlerProcess

class YellowpagesSpider(scrapy.Spider):
    name = "yellowpages"
    start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

    def parse(self,response):
        for items in response.css("div.v-card .info"):
            name = items.css("a.business-name > span::text").get()
            phone = items.css("div.phones::text").get()
            email_link = response.urljoin(items.css("a.business-name::attr(href)").get())

            #CANT APPLY THE LOGIC IN THE FOLLOWING LINE

            email = self.get_email(email_link)
            yield {"Name":name,"Phone":phone,"Email":email}

    def get_email(self,link):
        email = response.css("a.email-business[href^='mailto:']::attr(href)").get()
        return email

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
    })
    c.crawl(YellowpagesSpider)
    c.start()

如何让我的第二个脚本模仿第一个脚本工作？

韦祖奇克

我会使用response.meta，但如果需要避免它，好吧，让我们以另一种方式尝试：检查 lib https://pypi.org/project/scrapy-inline-requests/

from inline_requests import inline_requests


class YellowpagesSpider(scrapy.Spider):
    name = "yellowpages"
    start_urls = ["https://www.yellowpages.com/search?search_terms=Coffee+Shops&geo_location_terms=San+Francisco%2C+CA"]

    @inline_requests
    def parse(self, response):
        for items in response.css("div.v-card .info"):
            name = items.css("a.business-name > span::text").get()
            phone = items.css("div.phones::text").get()

            email_url = items.css("a.business-name::attr(href)").get()
            email_resp = yield scrapy.Request(response.urljoin(email_url), meta={'handle_httpstatus_all': True})
            email = email_resp.css("a.email-business[href^='mailto:']::attr(href)").get() if email_resp.status == 200 else None
            yield {"Name": name, "Phone": phone, "Email": email}

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-07-24

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

无法使用自定义方法解析某些内容

无法使用自定义方法解析某些内容

使用自定义ArrayAdapter时无法解析方法超级

无法以某些自定义方式安排抓取的内容

Elasticsearch无法使用自定义格式解析日期

如何使用自定义内容扩展超类的方法

无法解析自定义格式的DateTime

解析和验证自定义的方法

解析自定义Filtersyntax的最佳方法

使用NSDateFormatter解析自定义日期

使用自定义WPF控件时无法设置某些属性

在以某些自定义方式使用并发。未来时，无法从函数中打印结果

无法填写某些日期输入以执行自定义搜索

使用自定义数据方法的自定义QStandardItemModel

自定义适配器，无法解析listview的getFilter方法

无法使用.xib文件在自定义单元中添加任何内容？

JodaTime LocalDate / LocalTime无法使用自定义JSON序列化器类进行解析

无法使用自定义路径在Swift Package Manager中解析清单文件

Scrapy CrawlSpider - 无法使用自定义处理程序跟踪特定链接或解析

无法使用自定义结构的属性

无法绑定自定义控件内容（WPF）

自定义按钮无法正确显示其内容

克隆无法自定义结构的自定义结构是您每次使用它的最佳方法吗？

克隆无法自定义结构的自定义结构是您每次使用它的最佳方法吗？

getFragmentManager（）无法在“自定义”对话框上解析

Python HTMLParser无法解析复杂的自定义标签属性

Gradle自定义插件的构建引发“无法解析类”

在Hibernate + Spring中无法解析自定义验证消息

无法将json解析为自定义对象列表

无法在解析中保存自定义类对象

在Laravel中无法解析的自定义路线