我最近获得了当地体育馆的数据,并试图对数据进行规范化,以便创建“健身房注册”对象,其中包含所有已注册该会话的人员。
文本文件如下所示:https : //pastebin.com/YcnSJiA7
Sep 30th '20 at 9:00AM Until Sep 30th '20 at 10:00AM
JD John Doe
AW Alice Wonderland
IM Iron Man
Sep 30th '20 at 8:00AM Until Sep 30th '20 at 9:00AM
JD John Doe
AW Alice Wonderland
IM Iron Man
我已经能够使用熊猫按列[名称,名称的首字母缩写]来分隔签收,但我不知道如何检测何时某行对应于该时隙而不是签收某人。
因此,程序运行后,每一行应包含列[名称,名称,时间段的缩写]
对于我来说,处理这些数据的最简单方法就是采用这种格式,
JD John Doe Sep 30th '20 at 9:00AM Until Sep 30th '20 at 10:00AM
AW Alice Wonderland Sep 30th '20 at 9:00AM Until Sep 30th '20 at 10:00AM
IM Iron Man Sep 30th '20 at 9:00AM Until Sep 30th '20 at 10:00AM
JD John Doe Sep 30th '20 at 8:00AM Until Sep 30th '20 at 9:00AM
AW Alice Wonderland Sep 30th '20 at 8:00AM Until Sep 30th '20 at 9:00AM
IM Iron Man Sep 30th '20 at 8:00AM Until Sep 30th '20 at 9:00AM
我尝试遍历每行,一旦出现时隙行,然后将该行追加到下一行,直到出现新的时隙。
def testSort():
with open("1-weak-gym.txt") as fp:
id= []
totalSheet=[]
timeSlot = []
lastLine=[]
for ln in fp:
if ln.startswith("Sep"): ##this is a time slot
timeSlot.clear()
timeSlot.append(ln[0:]) ##save that time slot as the lastDate variable
else:
if (timeSlot):
totalSheet.append(timeSlot) ##append the time slot
totalSheet.append(ln[0:]) ##append the name line
else:
print('Hello eror')
print(totalSheet, file=open("newOuput.txt","a"))
您可以尝试这种方法(如果您在标题行的末尾有很强的时间模式):
import re
def is_time_format(s):
time_re = re.compile(r'\b((1[0-2]|0?[1-9]):([0-5][0-9])([AaPp][Mm]))')
return bool(time_re.match(s))
with open("1-weak-gym.txt") as fp:
new_lines = []
extra_info = ''
for line in fp:
last_bit = line.split(' ')[-1]
if is_time_format(last_bit):
extra_info = line
continue
else:
new_lines.append(line.rstrip() + '\t' + extra_info)
open("newOutput", 'w').writelines(new_lines)
然后,您将获得正确格式的文件。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句