在BeautifulSoup中提取多个Span标签中的内容

debugcn 发表于 Dev

daOnlyBG

我正在尝试从多个span标签中提取字符串内容。HTML页面的快照是：

<div class="secondary-attributes">
    <span class="neighborhood-str-list">
        Southeast
    </span>
    <address>
        1234 Python Blvd S<br>Somewhere, NV 98765
    </address>
    <span class="biz-phone">
        (555) 123-4567
    </span>
</div>

具体来说，我试图提取位于标签之间的电话号码。我尝试使用以下代码进行操作：

import requests
from bs4 import BeautifulSoup

res = requests.get(url)
soup = BeautifulSoup(res.text, "html.parser")

phone_number_results = [phone_numbers for phone_numbers in soup.find_all('span','biz-phone')]

编译后的代码没有任何语法错误，但是并没有完全满足我的期望：

['<span class="biz-phone">\n        (702) 476-5050\n    </span>', '<span class="biz-phone">\n        (702) 253-7296\n    </span>', '<
span class="biz-phone">\n        (702) 385-7912\n    </span>', '<span class="biz-phone">\n        (702) 776-7061\n    </span>', '<spa
n class="biz-phone">\n        (702) 221-7296\n    </span>', '<span class="biz-phone">\n        (702) 252-7296\n    </span>', '<span c
lass="biz-phone">\n        (702) 659-9101\n    </span>', '<span class="biz-phone">\n        (702) 355-9445\n    </span>', '<span clas
s="biz-phone">\n        (702) 396-3333\n    </span>', '<span class="biz-phone">\n        (702) 643-9851\n    </span>', '<span class="

biz-phone">\n        (702) 222-1441\n    </span>']

我的问题分为两部分：

为什么span在运行程序时出现标签？
我如何摆脱它们？我可以进行字符串编辑，但是我觉得我不会充分利用BeautifulSoup软件包。有没有更优雅的方式？

注意：还有更多HTML代码片段，如整个页面中上面显示的代码片段；有更多的 (555) 123-4567 代码实例（例如，更多的电话号码）需要提取，因此我为什么要使用find_all()。

先感谢您。

dmcc

find_all()返回标记列表（bs4.element.Tag），而不是字符串。
正如@furas指出的那样，您想访问text每个标签上的属性以提取标签内的文本：

phone_number_results = [phone_numbers.text.strip() for phone_numbers in soup.find_all('span', 'biz-phone')]