使用Python和BeautifulSoup根据属性解析“ a”标签

Michael T 发表于 Dev

迈克尔·T

使用这段HTML：

    <td align="left">
     <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/2000032">
      Russell, Addison
     </a>
     SS OAK  - Won at $0
     <br>
      <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/556425">
       Vargas, Jason
      </a>
      SP LAA
      <span title="Angels interested in bringing back Jason Vargas">
       <a class="playerLink" href="http://bbroto.baseball.cbssports.com/players/playerpage/556425" subtab="Update">
        <img border="0" height="10" src="http://sports.cbsimg.net/images/news-note-recent.gif" width="10"/>
       </a>
      </span>
      - Dropped
     </br>
    </td>

我只想显示没有子选项卡=“ Update”的块。但是我无法弄清楚如何使用BeautifulSoup在Python循环中引用子选项卡。这是我尝试的：

        soup = BeautifulSoup(html)
        pl = soup.findAll('a',{'class': 'playerLink'})
        for a in pl:
            if a.subtab == "Update":
                print "UPDATE"
            else:
                print "Player Name: " + a.text

我还尝试引用findAll部分中的子类型：

        pl = soup.findAll('a',{'class': 'playerLink'}, {'subtype':0})

这些方法都不起作用。我的问题是，在所有情况下，该类都是“ playerLink”，因此子类型是我区分它的唯一方法。我是BS的新手，所以我不太擅长处理标签和属性。在第二个示例中，如果我只想要subtype = Update，但是我想要每个不存在子类型的标记，也许它会起作用。

a.attrs返回<a>的属性作为字典。您可以使用来检查<a>标记是否没有subtab属性'subtab' not in a.attrs：

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4

player_links = SoupStrainer('a', 'playerLink')
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip()
         for a in soup.find_all(player_links) if 'subtab' not in a.attrs]
print(names)
# -> ['Russell, Addison', 'Vargas, Jason']

我找不到文档中提到的位置，但似乎指定subtab=False也可以排除具有subtab属性的任何标签：

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4

player_links = SoupStrainer('a', 'playerLink', subtab=False)
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip()
         for a in soup.find_all(player_links)]
print(names)

如果找到的标签（player_links）没有嵌套，则可以忽略以下.find_all(player_links)调用：

from bs4 import BeautifulSoup, SoupStrainer # pip install beautifulsoup4

player_links = SoupStrainer('a', 'playerLink', subtab=False)
soup = BeautifulSoup(html, parse_only=player_links)
names = [a.get_text().strip() for a in soup]
print(names)

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。