我有一些通过Wikipedia上的节目或电影的演员表清单的代码。刮除所有演员的名字并将其存储。我在当前代码中找到<a>
列表中的所有代码并存储其标题标签。目前情况如下:
from bs4 import BeautifulSoup
URL = input()
website_url = requests.get(URL).text
section = soup.find('span', id='Cast').parent
Stars = []
for x in section.find_next('ul').find_all('a'):
title = x.get('title')
print (title)
if title is not None:
Stars.append(title)
else:
continue
尽管这部分起作用,但有两个缺点:
['Harrison Ford', 'Indiana Jones (character)', 'Bullwhip', 'Cate Blanchett', 'Irina Spalko', 'Bob cut', 'Rosa Klebb', 'From Russia with Love (film)', 'Karen Allen', 'Marion Ravenwood', 'Ray Winstone', 'Sallah', 'List of characters in the Indiana Jones series', 'Sexy Beast', 'Hamstring', 'Double agent', 'John Hurt', 'Ben Gunn (Treasure Island)', 'Treasure Island', 'Courier', 'Jim Broadbent', 'Marcus Brody', 'Denholm Elliott', 'Shia LaBeouf', 'List of Indiana Jones characters', 'The Young Indiana Jones Chronicles', 'Frank Darabont', 'The Lost World: Jurassic Park', 'Jeff Nathanson', 'Marlon Brando', 'The Wild One', 'Holes (film)', 'Blackboard Jungle', 'Rebel Without a Cause', 'Switchblade', 'American Graffiti', 'Rotator cuff']
有没有办法让BeautifulSoup刮擦每个单词之后的前两个单词<li>
?甚至是我尝试做的更好的解决方案?
您可以使用css选择器仅获取<a>
a中的第一个<li>
:
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
例
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull#Cast'
website_url = requests.get(URL).text
soup = BeautifulSoup(website_url,'lxml')
section = soup.find('span', id='Cast').parent
Stars = []
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
Stars.append(x.get('title'))
Stars
输出量
['Harrison Ford',
'Cate Blanchett',
'Karen Allen',
'Ray Winstone',
'John Hurt',
'Jim Broadbent',
'Shia LaBeouf']
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句