Web在python中抓取xml页面？

debugcn 发表于 Dev

骑单车的人

我对如何从给定的xml页面中删除所有链接（仅包含字符串“ mp3”）感到困惑。以下代码仅返回空括号：

# Import required modules 
from lxml import html 
import requests 
  
# Request the page 
page = requests.get('https://feeds.megaphone.fm/darknetdiaries') 
  
# Parsing the page 
# (We need to use page.content rather than  
# page.text because html.fromstring implicitly 
# expects bytes as input.) 
tree = html.fromstring(page.content)   
  
# Get element using XPath 
buyers = tree.xpath('//enclosure[@url="mp3"]/text()') 
print(buyers)

我使用@url错误吗？

我正在寻找的链接：

任何帮助将不胜感激！

对冲猪

怎么了？

以下将xpath无法正常工作，正如您提到的那样，它也是@url和text()

//enclosure[@url="mp3"]/text()

解

url任何属性//enclosure都应包含mp3然后返回/@url

更改此行：

buyers = tree.xpath('//enclosure[@url="mp3"]/text()')

至

buyers = tree.xpath('//enclosure[contains(@url,"mp3")]/@url')

输出量

['https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9231072845.mp3?updated=1610644901',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2643452814.mp3?updated=1609788944',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV5381316822.mp3?updated=1607279433',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV9145504181.mp3?updated=1607280708',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV4345070838.mp3?updated=1606110384',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV8112097820.mp3?updated=1604866665',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV2164178070.mp3?updated=1603781321',
 'https://www.podtrac.com/pts/redirect.mp3/traffic.megaphone.fm/ADV1107638673.mp3?updated=1610220449',
...]

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。