我有一个excel文件列表,其名称格式类似。我需要将其名称中的信息用作pandas Dataframe中的列。我对正则表达式不太熟悉,但是我使用了Google和堆栈溢出来弄清楚该怎么做。但是,在一些极端情况下,我需要帮助弄清楚。
这是我拥有的前40个文件的名称的列表,它可以帮助证明我面临的挑战:
Maker Month Wise Data of VADAKARA RTO - KL18 , Kerala (2020).xlsx
Maker Month Wise Data of KATHUA RTO - JK8 , Jammu & Kashmir (2020).xlsx
Maker Month Wise Data of KANCHEEPURAM RTO - TN21 , Tamil Nadu (2020).xlsx
Maker Month Wise Data of KANJIRAPPALLY SRTO - KL34 , Kerala (2020).xlsx
Maker Month Wise Data of PATHANKOT SDM - PB35 , Punjab (2020).xlsx
Maker Month Wise Data of Chiplun Chiplun Track - MH202 , Maharashtra (2020).xlsx
Maker Month Wise Data of ZUNHEBOTO DTO - NL6 , Nagaland (2020).xlsx
Maker Month Wise Data of MAJITHA SDM - PB81 , Punjab (2020).xlsx
Maker Month Wise Data of Adinath Fitness Center - RJ260 , Rajasthan (2020).xlsx
Maker Month Wise Data of MUMBAI (EAST) - MH3 , Maharashtra (2020).xlsx
Maker Month Wise Data of CHIDAMBARAM RTO - TN544 , Tamil Nadu (2020).xlsx
Maker Month Wise Data of PUDUCHERRY - PY1 , Puducherry (2020).xlsx
Maker Month Wise Data of RANIPET RTO - TN73 , Tamil Nadu (2020).xlsx
Maker Month Wise Data of RTA, HISAR - HR39 , Haryana (2020).xlsx
Maker Month Wise Data of AIZAWL RURAL DTO - MZ9 , Mizoram (2020).xlsx
Maker Month Wise Data of ANANDPUR SAHIB SDM - PB16 , Punjab (2020).xlsx
Maker Month Wise Data of PEN (RAIGAD) - MH6 , Maharashtra (2020).xlsx
Maker Month Wise Data of PEHOWA - HR41 , Haryana (2020).xlsx
Maker Month Wise Data of AKOLA - MH30 , Maharashtra (2020).xlsx
Maker Month Wise Data of CANACONA RTO - GA10 , Goa (2020).xlsx
Maker Month Wise Data of Hooghly RTO - WB15 , West Bengal (2020).xlsx
Maker Month Wise Data of DEVIKULAM SRTO - KL68 , Kerala (2020).xlsx
Maker Month Wise Data of KUTTANADU SRTO - KL66 , Kerala (2020).xlsx
Maker Month Wise Data of CHENNAI (NORTH-EAST) RTO - TN3 , Tamil Nadu (2020).xlsx
Maker Month Wise Data of RLA SHILLAI - HP85 , Himachal Pradesh (2020).xlsx
Maker Month Wise Data of Baloda Bazar DTO - CG22 , Chhattisgarh (2020).xlsx
Maker Month Wise Data of TC OFFICE - STA OFFICE - KL99 , Kerala (2020).xlsx
Maker Month Wise Data of NANDURBAR - MH39 , Maharashtra (2020).xlsx
Maker Month Wise Data of KHETRI DTO - RJ53 , Rajasthan (2020).xlsx
Maker Month Wise Data of AHMEDGARH SDM - PB82 , Punjab (2020).xlsx
Maker Month Wise Data of Alipurduar RTO - WB69 , West Bengal (2020).xlsx
Maker Month Wise Data of RLA GOHAR - HP32 , Himachal Pradesh (2020).xlsx
Maker Month Wise Data of KOLHAPUR - MH9 , Maharashtra (2020).xlsx
Maker Month Wise Data of SILVASSA - DD1 , UT of DNH and DD (2020).xlsx
Maker Month Wise Data of MANNARGHAT SRTO - KL50 , Kerala (2020).xlsx
Maker Month Wise Data of SRIVILLIPUTHUR RTO - TN605 , Tamil Nadu (2020).xlsx
Maker Month Wise Data of ZONAL OFFICE, SOUTH WEST DELHI,DWARKA - DL9 , Delhi (2020).xlsx
Maker Month Wise Data of BUDGAM ARTO - JK4 , Jammu & Kashmir (2020).xlsx
Maker Month Wise Data of Kolar RTO - KA7 , Karnataka (2020).xlsx
Maker Month Wise Data of Singtam, East Sikkim - SK8 , Sikkim (2020).xlsx
这是使用正则表达式从这些文件名中提取信息的代码段:
# Add RTO column - WORKS
rto = re.search('\s\sof\s(.*)\s\-', file_name)
df['RTO'] = rto.group(1)
# Add registration number column - NEEDS TO BE CORRECTED - See match 27
registration_number = re.search('\s\-(.*)\s\,', file_name)
df['Registration Number'] = registration_number.group(1)
# Add state column - NEEDS TO BE CORRECTED - See match 14, 34, 37
state = re.search('\,\s(.*)\s\(', file_name)
df['State'] = state.group(1)
# Add year column - NEEDS TO BE CORRECTED - See match 10, 17, 24,
year = re.search('\((.*)\)', file_name)
df['Year'] = year.group(1)
RTO的正则表达式似乎可以正常工作,但是对于注册号,州和年份,某些边缘情况需要固定。我已经在代码注释中突出显示了正则表达式错误行。请让我知道我是否可以提供其他任何输入。
我会在解决此问题方面获得任何帮助,我们将不胜感激!
我认为您可以稍微修改一下以前的解决方案,以
pattern = r'\s+of\s+(.*?)\s+-\s+(.*?)\s+,\s+(.*?)\s+\((\d{4})\)'
df[['RTO', 'Registration Number', 'State','Year']] = df['Maker'].str.extract(pattern, expand=True)
请参阅regex演示。
如果注册号只能包含大写字母和数字,则可以替换(.*?)
为([A-Z0-9]+)
并使用\s+of\s+(.*?)\s+-\s+([A-Z0-9]*)\s+,\s+(.*?)\s+\((\d{4})\)
。
详细资料:
\s+
-一个或多个空格of
- 一个字 of
\s+
一个或多个惠特空间(.*?)
-第1组:除换行符以外的任何零个或多个字符,请尽可能少\s+,\s+
-包含1个以上空格的逗号(.*?)
-第2组:除换行符以外的任何零个或多个字符,请尽可能少\s+,\s+
-包含1个以上空格的逗号(.*?)
-第3组:除换行符以外的任何零个或多个字符,尽可能少(或者,如果[A-Z0-9]+
使用的话,则任何一个或多个大写ASCII字母或数字)\s+
-1+空格\(
-一个(
字符(\d{4})
-第3组:四位数\)
-一个)
字符。本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句