在python中读取和复制特定的文本块

debugcn 发表于 Dev

地烯

我在 SO（复制触发线或确定大小的块）上看到了几个类似的问题，但它们不太适合我想要做的事情。我有一个非常大的文本文件（来自 Valgrind 的输出），我想将其缩减为仅我需要的部分。

该文件的结构如下：它们是以包含字符串的标题行开头的行块'in loss record'。我只想在那些也包含 string 的标题行上触发'definitely lost'，然后复制下面的所有行，直到到达另一个标题行（此时重复决策过程）。

如何在 Python 中实现这样的选择和复制脚本？

这是我迄今为止尝试过的。它有效，但我认为这不是最有效（或 Pythonic）的方法，因此我希望看到更快的方法，因为我正在处理的文件通常非常大。（此方法290M文件耗时1.8s）

with open("in_file.txt","r") as fin:
with open("out_file.txt","w") as fout:                                                                                                                                     
    lines = fin.read().split("\n")
    i=0
    while i<len(lines):
        if "blocks are definitely lost in loss record" in lines[i]:
            fout.write(lines[i].rstrip()+"\n")
            i+=1
            while i<len(lines) and "loss record" not in lines[i]:
                fout.write(lines[i].rstrip()+"\n")
                i+=1
        i+=1

道格

您可以尝试使用正则表达式并使用mmap

类似于：

import re, mmap

# create a regex that will define each block of text you want here:
pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
with open(fn, 'r+b') as f:
    mm = mmap.mmap(f.fileno(), 0)
    for i, m in enumerate(pat.finditer(mm)):
        # m is a block that you want.
        print m.group(1)

鉴于您没有输入示例，该正则表达式肯定不起作用 - 但您明白了。

由于mmap整个文件被视为一个字符串，但不一定都在内存中，因此可以搜索大文件并以这种方式选择其中的块。

如果您的文件适合内存，您可以直接读取文件并使用正则表达式（伪 Python）：

with open(fn) as fo: 
    pat=re.compile(r'^([^\n]*?blocks are definitely lost in loss record.*?loss record)', re.S | re.M)
    for i, block in pat.finditer(of.read()):
         # deal with each block

如果您想要逐行非正则表达式方法，请逐行读取文件（假设它是一个带\n分隔符的文本文件）：

 with open(fn) as fo: 
     for line in fo: 
         # deal with each line here 

         # DON'T do something like string=fo.read() and 
         # then iterate over the lines of the string please...
         # unless you need random access to the lines out of order

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。