如何将文本文件中的值解析为列表，同时用None填充缺少的值？

debugcn 发表于 Dev

本·史密斯

我有一个正在解析的原始数据的文本文件。
那里有某些表示该字段的代码。
这些值将进入列表，然后进入熊猫的数据帧，最后进入数据库

例如，具有2条记录的一小部分看起来像：

INS*Y*18*001*AL*A*E**AC**N~
REF*1L*690553677~
DTP*348*D8*20200601~
DTP*349*D8*20200630~
HD*024**FAC*KJ/165/////1M*IND~
INS*Y*18*001*AL*A*E**AC**N~
REF*1L*6905456455~
DTP*348*D8*20200601~
HD*024**FAC*KJ/165/////1M*IND~

“ DTP”指示日期，并且348表示a start_date，349表示a end_date。
每行组对应于成员资格数据中的成员。
- “ REF”是带有会员编号的行。
- “ INS”表示其新成员或数据库中的记录。
- 有些成员没有第二条记录那样的“ DTP * 349”end_date行。
  - 这些应end_date以“”追加到列表中，以将位置保留为空
遍历每一行时，请查找该行以所需代码开头的位置，然后拆分该行并采用指定的元素。
我该如何解释循环中缺少某个字段的位置，以便如果某个成员是否具有该end_date值，那么该成员索引位置将有一个值，以便可以将其全部放入pandas数据框中？

到目前为止，我的代码如下所示：

membership_type=[]
member_id=[]
startDate = []
endDate = []
with open(path2 + fileName, "r") as txtfile:
    for line in txtfile:
        # Member type
        if line.startswith("INS*"):
            line.split("*")
            membership_type.extend(line[4]
        # Member ID
        if line.startwith("REF*"):
            line.split("*")
            member_id.extend(line[2])
        # Start Dates
        if line.startswith("DTP*348*"):
            line = line.split("*")
            start_date.extend(line[3])
        # End Dates
        '''What goes here?'''

结果应如下所示：

print(membership_type)
['AL','AL']
print(member_id)
['690553677','690545645']
print(startDate)
['20200601','20200601']
print(endDate)
['20200630','']

每个记录将有一个INS和REF与HD场

特伦顿·麦金尼

使用readlines得到的字符串的所有行
- 文本清洁行，然后用re.split拆就多个项目，*而/在这种情况下。
- 分割/将正确地分隔字符串中的唯一项，但也会创建要删除的空格。
- enumerate每行使用
  - 在整个行列表中，您可以看到当前索引i，但是i+或-数字也可以用于比较不同的行。
  - 如果DTP 348之后的下一行不是DTP，则添加None或''。
    - 填充空白None以方便转换为datetime熊猫格式。
  - 请记住，line是一排lines，其中每个line是enumerated有i。当前line是lines[i]，下一个line是lines[i + 1]。

import re

membership_type = list()
member_id = list()
start_date = list()
end_date = list()
name = list()
first_name = list()
middle_name = list()
last_name = list()
with open('test.txt', "r") as f:
    lines = [re.split('\*|/', x.strip().replace('~', '')) for x in f.readlines()] # clean and split each row
    lines = [[i for i in l if i] for l in lines]  # remove blank spaces
    for i, line in enumerate(lines):
        print(line)  # only if you want to see 
        # Member type
        if line[0] == "INS":
            membership_type.append(line[4])
        # Member ID
        elif line[0] == 'REF':
            member_id.append(line[2])
        # Start Dates
        elif (line[0] == 'DTP') and (line[1] == '348'):
            start_date.append(line[3])
            if (lines[i + 1][0] != 'DTP'):  # the next line should be the end_date, if it's not, add None
                end_date.append(None)
        # End Dates
        elif (line[0] == 'DTP') and (line[1] == '349'):
            end_date.append(line[3])
        # Names
        elif line[0] == 'NM1':
            name.append(' '.join(line[3:]))
            first_name.append(line[3])
            middle_name.append(line[4])
            last_name.append(line[5])
            try:
                some_list.append(line[6])
            except IndexError:
                print('No prefix')
                some_list.append(None)

            try:
                some_list.append(line[7])
            except IndexError:
                print('No suffix')
                some_list.append(None)


print(membership_type)
print(member_id)
print(start_date)
print(end_date)
print(name)
print(first_name)
print(middle_name)
print(last_name)

['AL', 'AL']
['690553677', '6905456455']
['20200601', '20200601']
['20200630', None]
['SMITH JOHN PAUL MR JR', 'IMA MEAN TURD MR SR']
['SMITH', 'IMA']
['JOHN', 'MEAN']
['PAUL', 'TURD']

装入熊猫

import pandas as pd

data = {'start_date': start_date , 'end_date': end_date, 'member_id': member_id, 'membership_type': membership_type,
        'name': name, 'first_name': first_name, 'middle_name': middle_name, 'last_name': last_name}

df = pd.DataFrame(data)

# convert datetime columns
df.start_date = pd.to_datetime(df.start_date)
df.end_date = pd.to_datetime(df.end_date)

# display df
  start_date   end_date   member_id membership_type                   name first_name middle_name last_name
0 2020-06-01 2020-06-30   690553677              AL  SMITH JOHN PAUL MR JR      SMITH        JOHN      PAUL
1 2020-06-01        NaT  6905456455              AL    IMA MEAN TURD MR SR        IMA        MEAN      TURD

内容 `test.txt`

NM1*IL*1*SMITH*JOHN*PAUL*MR*JR~
INS*Y*18*001*AL*A*E**AC**N~
REF*1L*690553677~
DTP*348*D8*20200601~
DTP*349*D8*20200630~
HD*024**FAC*KJ/165/////1M*IND~
NM1*IL*1*IMA*MEAN*TURD*MR*SR~
INS*Y*18*001*AL*A*E**AC**N~
REF*1L*6905456455~
DTP*348*D8*20200601~
HD*024**FAC*KJ/165/////1M*IND~

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。