如何从文件中仅取出与任何文件（两个或多个文件）中的任何其他单词都不匹配的唯一单词？

debugcn 发表于 Dev

尼基尔·切克

#!/bin/sh
for file1 in directorypath/*
do
    for file2 in directorypath/*
         do
               if [ "$file1" = "$file2" ]; then 
                      echo "files are same"
               else


                                 cp /dev/null /home/temp.txt
                 grep -f $file1 $file2 > /home/common.txt
                 grep -v -x -f /home/common.txt $file1 > /home/temp.txt
                                 cp /dev/null $file1
                                 cat /home/temp.txt >> $file1


                                 cp /dev/null /home/temp.txt
                 grep -v -x -f /home/common.txt $file2 > /home/temp.txt
                                 cp /dev/null $file2
                 cat /home/temp.txt >> $file2

                fi;
         done
done

此代码适用于小文件。由于我要处理大文本文件，因此即使在服务器计算机上，此代码也花费了太多时间。请帮忙！我如何有效地实现相同的目标？提前致谢。

熵

试试这个 python 脚本（以目录为参数）：

import sys
import os

# Keeps a mapping of word => file that contains it
# word => None means that that word exists in multiple files
words = {}

def process_line(file_name, line):
    try:
        other_file = words[line]
        if other_file is None or other_file == file_name:
            return
        words[line] = None
    except KeyError:
        words[line] = file_name

file_dir = sys.argv[1]
for file_name in os.listdir(file_dir):
    with open(os.path.join(file_dir, file_name)) as fd:
        while True:
            line = fd.readline()
            if len(line) == 0:
                break
            line = line.strip()
            if len(line) == 0:
                continue
            process_line(file_name, line)

file_descriptors = {}
# Empty all existing files before writing out the info we have
for file_name in os.listdir(file_dir):
    file_descriptors[file_name] = open(os.path.join(file_dir, file_name), "w")

for word in words:
    file_name = words[word]
    if file_name is None:
        continue
    fd = file_descriptors[file_name]
    fd.write("%s\n" % word)

for fd in file_descriptors.values():
    fd.close()

内存要求：

您需要能够一次在内存中保存所有独特的单词。假设文件之间有很多重复，这应该是可行的。否则，老实说，我没有看到比您已经拥有的方法更快的方法。

如果您最终无法在内存中容纳所需的所有内容，请查看此答案，了解使用基于磁盘的解决方案来处理 dict 而不是将其全部保存在内存中的可能方法。我不知道这会对性能有多大影响，以及它在那时是否仍然运行得足够快。

为什么它更快？（理论上，未经测试）

它只对每个文件进行一次传递就完成了。您当前的方法是文件数量O(n^2)在哪里n

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。