我将如何选择此页面中的所有标题
http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext
例如:我正在尝试获得与此相似的所有行:
AFAS C1001 Introduction to African-American Studies. 3 points.
main_page从这里开始遍历所有学校课程,因此我可以像上面一样抓住所有标题:
http://bulletin.columbia.edu/columbia-college/departments-instruction/
for page in main_page:
sub_abbrev = page.find("div", {"class": "courseblock"})
我有这段代码,但是我无法确切地知道如何选择第一个孩子的所有('strong')标签。使用最新的python和漂亮的汤4进行网络抓取。如果还有其他需要,请输入LMK。谢谢
遍历具有courseblock
类的元素,然后在每门课程中都获得具有courseblocktitle
类的元素。使用select()
和select_one()
方法的工作示例:
import requests
from bs4 import BeautifulSoup
url = "http://bulletin.columbia.edu/columbia-college/departments-instruction/african-american-studies/#coursestext"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
for course in soup.select(".courseblock"):
title = course.select_one("p.courseblocktitle").get_text(strip=True)
print(title)
印刷:
AFAS C1001 Introduction to African-American Studies.3 points.
AFAS W3030 African-American Music.3 points.
AFAS C3930 (Section 3) Topics in the Black Experience: Concepts of Race and Racism.4 points.
AFAS C3936 Black Intellectuals Seminar.4 points.
AFAS W4031 Protest Music and Popular Culture.3 points.
AFAS W4032 Image and Identity in Contemporary Advertising.4 points.
AFAS W4035 Criminal Justice and the Carceral State in the 20th Century United States.4 points.
AFAS W4037 (Section 1) Third World Studies.4 points.
AFAS W4039 Afro-Latin America.4 points.
来自@double_j的一个很好的后续问题:
在OP的示例中,他在两点之间有一个空格。您将如何保留?这就是数据在网站上显示的方式,甚至认为它实际上不在源代码中。
我虽然要使用方法的separator
参数,但这也会在最后一个点之前添加一个额外的空间。相反,我将通过以下方式加入元素文本:get_text()
strong
str.join()
for course in soup.select(".courseblock"):
title = " ".join(strong.get_text() for strong in course.select("p.courseblocktitle > strong"))
print(title)
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句