如何用自定义<comment>元素替换HTML注释

user3621633 发表于 Dev

用户名

我正在使用Python中的BeautifulSoup将大量HTML文件批量转换为XML。

一个示例HTML文件如下所示：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        ...
        <!-- here is a comment inside the head tag -->
    </head>
    <body>
        ...
        <!-- Comment inside body tag -->
        <!-- Another comment inside body tag -->
        <!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
    </body>
</html>
<!-- This comment is the last line of the file -->

我想出了如何找到doctype并将其替换为标记的方法<doctype>...</doctype>，但是注释给我很大的挫败感。我想用替换HTML注释<comment>...</comment>。在此示例HTML中，我能够替换前两个HTML注释，但是html标记内的所有内容以及结束html标记后的最后一个注释都不能替换。

这是我的代码：

file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")

for child in soup.children:

    # This takes care of the first two HTML comments
    if isinstance(child, bs4.Comment):
        child.replace_with("<comment>" + child.strip() + "</comment>")

    # This should find all nested HTML comments and replace.
    # It looks like it works but the changes are not finalized
    if isinstance(child, bs4.Tag):
        re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
        re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)

# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))

这是我第一次使用BeautifulSoup。如何使用BeautifulSoup查找所有HTML注释并将其替换为<comment>标记？

我可以通过pickle，序列化，应用正则表达式将其转换为字节流，然后将其反序列化为BeautifulSoup对象吗？这项工作还是会导致更多问题？

我尝试在子标记对象上使用pickle，但反序列化失败TypeError: __new__() missing 1 required positional argument: 'name'。

然后，我尝试通过腌制标记的文本child.text，但是由于导致反序列化失败AttributeError: can't set attribute。基本上child.text是只读的，这解释了正则表达式为什么不起作用的原因。因此，我不知道如何修改文本。

零比雷埃夫斯

您有几个问题：

您无法修改child.text。它是一个只读属性，仅get_text()在后台调用，其结果是未连接到文档的全新字符串。

re.sub()不会就地修改任何内容。你的线

re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)

本来是

child.text = re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)

...但是由于第1点，无论如何这还是行不通的。

试图通过用正则表达式替换其中的文本块来修改文档是使用BeautifulSoup的错误方法。相反，您需要找到节点并将其替换为其他节点。

这是一个可行的解决方案：

import bs4

with open("example.html") as f:
    soup = bs4.BeautifulSoup(f)

for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
    tag = bs4.Tag(name="comment")
    tag.string = comment.strip()
    comment.replace_with(tag)

这段代码首先迭代访问的结果，并find_all()利用我们可以将函数作为text参数传递的事实。在BeautifulSoup中，Comment是的子类NavigableString，因此我们将其视为字符串来进行搜索，而lambda ...仅仅是（例如）的简写

def is_comment(e):
    return isinstance(e, bs4.Comment)

soup.find_all(text=is_comment)

然后，我们Tag使用适当的名称创建一个新名称，将其内容设置为原始注释的剥离内容，然后用我们刚刚创建的标签替换该注释。

结果如下：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
        ...
        <comment>here is a comment inside the head tag</comment>
</head>
<body>
        ...
        <comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-02-19

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

如何用自定义<comment>元素替换HTML注释

如何用自定义<comment>元素替换HTML注释

如何用自定义 BBCode 标签替换 HTML 标签？

如何用自定义单词替换其他内容

如何用自定义图像替换facebook按钮？

我应该如何用Java编写自己的自定义注释

Cordova Android-如何用生成的替换自定义build.gradle

如何用自定义字段中的值替换Wordpress中的永久链接？

如何用Android中的自定义操作栏完全替换AppCompatActivity的操作栏？

HAProxy 1.4：如何用自定义IP替换X-Forwarded-For

如何用自定义组件替换React-Leaflet Popup？

如何用自定义参数替换字符串参数？

iOS拖放。拖动时如何用自定义视图替换单元格

如何用我自己的自定义实现替换现有的斯巴达克斯门面？

如何用Fancybox 2.1.5中的自定义按钮替换默认控件？

HAProxy 1.4：如何用自定义IP替换X-Forwarded-For

如何用自定义字段中的值替换Wordpress中的永久链接？

Cordova Android-如何用生成的替换自定义build.gradle

如何用自定义版本正确替换软件包的服务单元文件？

如何用自定义管理员替换整个默认的 Djnago 管理员

用于添加自定义注释数据的语义HTML元素

yii2如何用“布局”中的默认元标记替换“视图”中的自定义元标记

如何用使用字符串主键而不是长主键的自定义特征替换特征IdPK？

如何自定义PHPStorm注释？

如何让用户添加自定义注释？

如何使用wordpress自定义元素？

如何访问自定义元素的主机

如何隐藏自定义的Polymer元素？

如何从自定义元素继承

如何访问自定义元素的主机

如何隐藏自定义的Polymer元素？