在Python中使用BeautifulSoup忽略其他字体时，如何获取特定的“字体大小”？

debugcn 发表于 Dev

道尔顿

我目前正在抓取一个网站，并且需要获取某些字体大小，以便在style =“ font-size：140％”中，我希望获取140％或最好仅是140，这样我就可以在某些计算中使用它，因为每次一个将具有不同的字体大小。

更具体地说，我想从这样的标签中获取字体大小...

<div style="font-size: 141%; line-height: 110%"><a href="engenremap-latinrock.html" style="color: #AF7E1C">latin rock</a></div>
<div style="font-size: 139%; line-height: 110%"><a href="engenremap-mexicanindie.html" style="color: #B18230">mexican indie</a></div>

通常，我可以毫无问题地做到这一点。但是，我遇到的问题是我要抓取的东西上没有可区分的标签，并且有一堆前面的行由像这样的font-size标签组成...

<tr valign=top class="datarow firstrow" style="white-space: nowrap"><td align=right class=note  style="font-size: 20px; line-height: 24px">1</td><td  style="font-size: 20px; line-height: 24px"> <a href="spotify:playlist:0DsV1U8e3xXsmsDSaW88XT" class=note target=spotify title="See this playlist in Spotify.">&#x260A;</a></td><td class=note  style="font-size: 20px; line-height: 24px"><a href="?scope=MX&vector=activity" title="Show only schools from Mexico." style="color: #BA890D">Mexico</a></td><td class=note  style="font-size: 20px; line-height: 24px"><a href="?root=Universidad%20Nacional%20Aut%C3%B3noma%20De%20M%C3%A9xico%20%28UNAM%29&scope=all" title="Re-sort the list by similarity to Universidad Nacional Autónoma De México (UNAM)." style="color: #BA890D">Universidad Nacional Autónoma De México (UNAM)</a></td></tr>
<tr valign=top class="datarow " style="white-space: nowrap"><td align=right class=note  style="font-size: 20px; line-height: 24px">2</td><td  style="font-size: 20px; line-height: 24px"> <a href="spotify:playlist:5QAomgXhxwYjg975DWtTTv" class=note target=spotify title="See this playlist in Spotify.">&#x260A;</a></td><td class=note  style="font-size: 20px; line-height: 24px"><a href="?scope=US&vector=activity" title="Show only schools from USA." style="color: #948F04">USA</a></td><td class=note  style="font-size: 20px; line-height: 24px"><a href="?root=Texas%20A%20%26%20M%20University-College%20Station&scope=all" title="Re-sort the list by similarity to Texas A & M University-College Station." style="color: #948F04">Texas A &amp; M University-College Station</a></td></tr>

请记住，我已经遍历了上一个代码段中的链接（它们是静态的，并且与我遍历的位置相同），并且对于每个链接（学校）来说，第一个代码段（学校的流派）中的标签都会发生变化，我打算忽略tr标记的字体大小，而只从第一个HTML代码段中获取字体大小？我敢肯定，这是一个简单的解决方案，但希望能为您提供帮助。

**我已经遍历了链接并掌握了每所学校的相应流派，我只需要获取这些特定流派的字体大小。**

这是我的一些代码，可提供更多上下文...

data = []   # used to sort between country and university <td> tags
links = []  # stores links from clicking on the university name and used to get genres

countries = []          #
universities = []       # indices match for these lists
spotifyLinks = []       #
fontSizes = []          #
genres = [[]]           #
genres_weight = [[]]    #

page = requests.get("http://everynoise.com/everyschool.cgi")    # stores response from Every Noise into page
soup = BeautifulSoup(page.content, 'html.parser')   # used to create data list
soup1 = BeautifulSoup(page.content, 'html.parser')  # used to create links and spotifyLinks lists

soupList = list(soup.find_all('td', class_="note")) # creates list of <td> tags where class="note"

for soup in soupList:                   #
    if not soup.get_text().isnumeric(): # stores all country and university names in data list
        data.append(soup.get_text())    #

for i in range(len(data)):              #
    if i%2 == 0:                        # separates data list into two individual lists
        countries.append(data[i])       # for country and university names respectively
    else:                               #
        universities.append(data[i])    #

for a in soup1.find_all('a', attrs={'href': re.compile("\?root=")}):
    links.append('http://everynoise.com/everyschool.cgi' + a['href'])

for a in soup1.find_all('a', attrs={'href': re.compile("spotify:playlist:")}):
    spotifyLinks.append('https://open.spotify.com/playlist/' + a['href'][17:])

spotifyLinks = spotifyLinks[:-1]

linkSubset = links[0:4] # subset of links for quicker testing
j=1

for link in linkSubset: # switch out linkSubset with links for full dataset
    time.sleep(1)   # so we don't spam their servers
    schoolGenres = []
    nextPage = urllib.urlopen(url=link)
    bs_obj = BeautifulSoup(nextPage, "html.parser")

    for a in bs_obj.find_all('a', attrs={'href': re.compile("^engenremap-")}):
        schoolGenres.append(a.get_text())

    genres.append(schoolGenres)
    print "Scraping...", j
    j=j+1

genres = genres[1:]
distinct_genres = set()

for genre in genres:
    distinct_genres.update(genre)

print "\nDistinct Genres:", distinct_genres

编辑/答案：最终通过使用所选答案的稍加修改的版本来解决。

pattern = re.compile(r'font-size: (\d+)')

for a in bs_obj.select('div[style*="font-size"]'):
    genreWeights.append(int(pattern.search(str(a)).group(1)))

孟德尔

您可以搜索包含某些文本的标签，然后提取的值font-size。例如：

import re
from bs4 import BeautifulSoup

txt = """<div style="font-size: 141%; line-height: 110%"><a href="engenremap-latinrock.html" style="color: #AF7E1C">latin rock</a></div>
<div style="font-size: 139%; line-height: 110%"><a href="engenremap-mexicanindie.html" style="color: #B18230">mexican indie</a></div>
<div style="font-size: 139%; line-height: 110%"><a href="engenremap-mexicanindie.html" style="color: #B18230">hello world/a></div>"""

soup = BeautifulSoup(txt, "html.parser")

pattern = re.compile(r'font-size: (\d+)')
for tag in soup.select("div:contains(latin, mexican)"):
    font_size = pattern.search(str(tag)).group(1)
    print(font_size)

输出：

141
139

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-04-5

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

在Python中使用BeautifulSoup忽略其他字体时，如何获取特定的“字体大小”？

在Python中使用BeautifulSoup忽略其他字体时，如何获取特定的“字体大小”？

当字体大小超过1的CSS时如何获取元素字体大小的内联样式

LibreOffice writer-如何在段落样式中使用特定的字体大小（8.5磅）

在SwiftUI中使用时如何更改NSTextField字体大小

如何在asp.net MVC中使用string.format时更改字体大小

如何获取HTML中的字体大小？

如何使用xlsxwriter更改图例字体大小-Python

如何在BeautifulSoup中使用元素的样式定义（例如填充，字体大小等）对元素进行Web刮擦

使用不同的字体类型时，如何计算字体大小？

如何测量字体大小？

如何重置字体大小

如何更改字体大小？

在 Xcode 中使用约束如何使字体大小与标签大小一起放大

URxvt.font忽略字体大小

如何在pygame中使用其他字体？

如何在Excel中使字体大小加粗

将根元素字体大小设置为百分比时，在Chrome中使用rem导致textarea的字体大小超出预期

CSS 样式——在 div 中使用不同的字体大小

在PDFBundle（PHPPdf）中使用其他字体

在PDFBundle（PHPPdf）中使用其他字体

在python中，使用Bio.Phylo.draw（）生成系统树时，如何更改叶节点的字体大小？

如何以pdf格式获取准确的字体大小（高度）

如何通过openxml获取形状中文本的字体大小？

如何在Flutter中获取设备字体大小

如何在GWT中获取TextArea字体大小？

更改特定文本的字体大小-Python / Pygame

使用CSS更改特定块的字体大小

如何在PyQtGraph中使用TextItem.setText（）设置字体大小？

如何在PyQtGraph中使用TextItem.setText（）设置字体大小？

写入pdf文件时如何设置字体大小