如何通过python中的漂亮汤在html页面中找到特定单词？

Kanika Singh 发表于 Dev

卡尼卡·辛格|

我想通过该html文本中的漂亮内容查找某个单词在网页中出现了多少次？我尝试了该findAll功能，但是仅在特定标签内找到单词，就像soup.body.findAll在body标签内找到该单词一样，但是我希望它在html文本中的所有标签内搜索该单词。同样，一旦找到该单词，我需要创建该单词前后的单词列表，有人可以帮我怎么做吗？谢谢。

里塔夫

根据最新的BeautifulSoup 4 API，您可以使用recursive关键字在整个树中查找文本。您将拥有字符串，然后您可以对其进行运算并分隔单词。

这是一个完整的示例：

import bs4
import re

data = '''
<html>
<body>
<div>today is a sunny day</div>
<div>I love when it's sunny outside</div>
Call me sunny
<div>sunny is a cool word sunny</div>
</body>
</html>
'''

searched_word = 'sunny'

soup = bs4.BeautifulSoup(data, 'html.parser')
results = soup.body.find_all(string=re.compile('.*{0}.*'.format(searched_word)), recursive=True)

print 'Found the word "{0}" {1} times\n'.format(searched_word, len(results))

for content in results:
    words = content.split()
    for index, word in enumerate(words):
        # If the content contains the search word twice or more this will fire for each occurence
        if word == searched_word:
            print 'Whole content: "{0}"'.format(content)
            before = None
            after = None
            # Check if it's a first word
            if index != 0:
                before = words[index-1]
            # Check if it's a last word
            if index != len(words)-1:
                after = words[index+1]
            print '\tWord before: "{0}", word after: "{1}"'.format(before, after)

它输出：

Found the word "sunny" 4 times

Whole content: "today is a sunny day"
    Word before: "a", word after: "day"
Whole content: "I love when it's sunny outside"
    Word before: "it's", word after: "outside"
Whole content: "
Call me sunny
"
    Word before: "me", word after: "None"
Whole content: "sunny is a cool word sunny"
    Word before: "None", word after: "is"
Whole content: "sunny is a cool word sunny"
    Word before: "word", word after: "None"

另请参阅此处的字符串关键字参考

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。