使用正则表达式删除相对行

DSilvis 发表于 Dev

戴维斯

使用pdftotext，创建了一个文本文件，其中包括源pdf中的页脚。页脚妨碍其他需要完成的解析。页脚的格式如下：

This is important text.

9
Title 2012 and 2013

\fCompany
Important text begins again.

公司这一行是唯一不在文件中其他地方重复出现的行。它显示为\x0cCompany\n。我想搜索此行并根据出现的位置删除它和前面的三行（页码，标题和空白行）\x0cCompany\n。这是我到目前为止所拥有的：

report = open('file.txt').readlines()
data = range(len(report))
name = []

for line_i in data:
    line = report[line_i]

    if re.match('.*\\x0cCompany', line ):
        name.append(report[line_i])

print name

这使我可以创建一个列表来存储发生此行的行号，但是我不了解如何删除这些行以及前三行。看来我需要基于此循环创建其他循环，但是我无法使其工作。

纳尔

而不是遍历并获取要删除的行的索引，而是遍历各行并仅追加要保留的行。

迭代实际的文件对象，而不是将其全部放入一个列表中，也会更加有效：

keeplines = []

with open('file.txt') as b:
    for line in b:
        if re.match('.*\\x0cCompany', line):
            keeplines = keeplines[:-3] #shave off the preceding lines
        else:
            keeplines.append(line)


file = open('file.txt', 'w'):
    for line in keeplines:
        file.write(line)

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。