2番目のファイルからのデータを考慮してファイルを読み取り、置換します

debugcn 投稿 Dev

isa

BOS（Begin Of Sentence）およびEOS（End Of Sentence）とマークされた文を含むこのようなファイルがあります。

BOS 1
1 word \t\t word \t word \t\t word \t 123
1 word \t\t word \t word \t\t word \t 234
1 word \t\t word \t word \t\t word \t 567
EOS 1

BOS 2
2 word \t\t word \t word \t\t word \t 456
2 word \t\t word \t word \t\t word \t 789
EOS 2

そして、最初の番号が文番号を示す2番目のファイル：

1, 123, 567
2, 789

私が欲しいのは、最初と2番目のファイルを読んで、すべての行の終わりにある番号が2番目のファイルにあるかどうかを確認することです。もしそうなら、私は最初のファイルの行の4番目の単語だけを変更したいと思います。したがって、期待される出力は次のとおりです。

BOS 1
1 word \t\t word \t word \t\t NEW_WORD \t 123
1 word \t\t word \t word \t\t word \t 234
1 word \t\t word \t word \t\t NEW_WORD \t 567
EOS 1

BOS 2
2 word \t\t word \t word \t\t word \t 456
2 word \t\t word \t word \t\t NEW_WORD \t 789
EOS 2

まず、2つのファイルの行数が異なるため、2つのファイルの読み取り方法がわかりません。次に、たとえば最初のファイルの最初の文の行を繰り返し処理すると同時に、2番目のファイルの最初の行の値を繰り返し処理して比較する方法がわかりません。これは私がこれまでに持っているものです：

def readText(filename1, filename2):
  data1 = open(filename1).readlines()   # the first file

  data2 = open(filename2).readlines() # the second one

  list2 = [] # a list to store the values of the second file

  for line1, line2 in itertools.izip(data1, data2):
    l1 = line1.split()

    l2 = line2.split(', ')

    find = re.findall(r'.*word\t\d\d\d', line1) # find the fourth word in a line, followed by a number

    for l in l2:
      list2.append(l)

    for match in find:
      m = match.split() # split the lines of the first file

      if (m[0] == list2[0]): # for the same sentence number in the two files 
        result = re.sub(r'(.*)word\t%s' %m[5], r'\1NEW_WORD\t%s' %m[5],line1) 

if len(sys.argv)==3: 
  lines = readText(sys.argv[1], sys.argv[2])
else:
  print("file.py inputfile1 inputfile2")

助けてくれてありがとう！

ニザム・モハメド

参考までに、最初のファイルにsource.txt、2番目のファイルにcontrol.txt、出力にresult.txtという名前を付けます。
これがプログラムの骨組みです。

[modify_line(line) if line[0].isdigit() else line for line in source]

このコードは、各行をそのまままたは変更して渡します。行が渡された数字で始まる場合、modify_line変更された行、または渡された行とcontrol.txtから取得した入力に基づいて元の行が返されます。
modify_line渡された各行をチェックして変更するには、control.txtからデータを取得する必要があります。データは、行の開始番号と終了番号[1, (123, 567)]です。開始番号が一致し、終了番号の1つが一致した場合、行が変更されます。開始番号が一致しない場合、番号で始まる行modify_lineのみが渡されるため、次の行開始番号が制御ファイルから読み取られます。
状態を維持するために、ここではクロージャーを使用しました。

import re

def create_line_modification_function(fp, replacement_word):

    def get_line_number_and_end_numbers():
        for line in fp:
            if line.strip():
                line_number, rest = line.split(',', 1)
                line_number = line_number.strip()
                ends = [end.strip() for end in rest.split(',')]
                yield line_number, ends

    generate_line_numbers_and_ends = get_line_number_and_end_numbers()
    # modify_line needs to change this. So this is in a list
    line_number_and_ends = list(next(generate_line_numbers_and_ends, (None, None)))
    # for safety check if we run out of line numbers in the control file
    if line_number_and_ends[0] is None:
        raise ValueError('{} reached EOF'.format(fp.name))
    # for optimization compile once here
    pattern = re.compile(r'(.*)word(.*\d{3}$)')


    def modify_line(line):
        while True:
            # for convenience unpack the list 
            line_number, ends = line_number_and_ends
            if line.startswith(line_number):
                for end in ends:
                    if line.rstrip().endswith(end):
                        return pattern.sub(r'\1{}\2'.format(replacement_word), line)
                return line
            # If we are here the line numbers from control.txt and source.txt don't match.
            # So we have to read next line from control file
            line_number_and_ends[0], line_number_and_ends[1]  = next(generate_line_numbers_and_ends, (None, None))
            if line_number_and_ends[0] is None:
                raise ValueError('{} reached EOF'.format(fp.name))

    return modify_line

if __name__ == '__main__':

    with open('source.txt') as source, open('control.txt') as ctl, open('result.txt', 'w') as target:
        modify_line = create_line_modification_function(ctl, 'NEW_WORD')
        target.writelines(modify_line(line) if line[0].isdigit() else line for line in source)

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-07-26

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

2番目のファイルからのデータを考慮してファイルを読み取り、置換します

2番目のファイルからのデータを考慮してファイルを読み取り、置換します

Python：-テキストファイルを読み取り、大文字に変換して2番目のファイルに書き込みます

ファイルから11番目の数の列を読み取り、平均しますか？（c ++）

Python：ファイルから特定の行を読み取り、それらの行の特定の文字を置き換えてファイルを保存します

SQLファイルを読み取り、SQLファイルのWHERE句からファイルの終わりまで置換します。

ファイルから文字列を読み取り、2番目のファイルの各行と比較する方法

結合を使用して2つのCSCファイルからデータを読み取り、SSISを使用してテーブルを格納します

nodejsはファイルを読み取り、一時ファイルに書き込み、元のファイルをコピーして置き換えます

PHP配列-CSVファイルから読み取り、最初の列のキー、2番目の列の値を作成します

ファイルの最初のすべての行を値として読み取り、2番目を辞書のキーとして読み取る方法

ファイルから変数を読み取り、置換します

ファイルから特定の行を読み取り、データを1行から分離します

Webから.txtファイルを読み取り、データフレームに変換します

.txtファイルから番号を読み取り、.texファイルの文字列を置き換えます

Pythonを使用して、ファイルからフロートを読み取り、小数のみを保存します

テキストファイルから特定のデータを読み取り、結果を計算します

Pythonの2行目からファイルを読み取ります

XMLファイルからデータを読み取り、キーと値のペアとして保存する

TextIO。パターン{}を使用してGCSから複数のファイルを読み取ります

ファイルからデータを読み取り、構造体の配列に格納します

dllから取得してファイルの内容を読み取ります

バッチファイルを使用してtxtファイルから特定の値を読み取ります

JSONファイルからJSを使用してデータを読み取り、計算します

scala：ファイルをループして一度に20バイトを読み取り、3番目の位置でバイトを空白にします

ElementTreeを使用してフォルダーから複数のxmlファイルを読み取ります

Zipファイルを抽出し、ファイル内のデータを角度で読み取ります

Pythonで別のファイルからデータを読み取ろうとしています

zipからファイルをコピーし、同時にそのファイルを読み取ります

ファイルからパターンを読み取り、Pythonを使用して別のファイルに書き込みます

フラットテキストファイルを読み取り、Pythonを使用してパターンの特定のリストを置き換えます