How to efficiently extract date year from dataframe header in Pandas?

balandongiv Published at Java

balandongiv

The objective is to extract df under the month-year category while omitting other. The code below one way how this objective can be achieved

df = DataFrame ( [['PP1', 'LN', 'T1', 'C11', 'C21', 'C31', 'C32']] )
df.columns =['dummy1','dummy2', 'Jan-20', 'Feb-20', 'Jan 2021', 'Feb 2080','Dec 1993']
extract_header_name=list(df.columns.values)
lookup_list= ['Jan', 'Feb', 'Mar','Apr', 'May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
month_year_list=[i for e in lookup_list for i in extract_header_name if e in i]

Output

['Jan-20', 'Jan 2021', 'Feb-20', 'Feb 2080', 'Dec 1993']

However, I wonder if is another efficient or pandas built module to achieve similar result?

jezrael

Use str.contains with values joined by | for regex or - it means Jan or Feb... and filter by boolean indexing with df.columns:

month_year_list = df.columns[df.columns.str.contains('|'.join(lookup_list))].tolist()
print (month_year_list)
['Jan-20', 'Feb-20', 'Jan 2021', 'Feb 2080', 'Dec 1993']

Or use Series.str.startswith with convert list to tuple:

month_year_list = df.columns[df.columns.str.startswith(tuple(lookup_list))].tolist()

Another idea if only this 2 formats of datetimes:

s = df.columns.to_series()
s1 = pd.to_datetime(s, format='%b-%y', errors='coerce')
s2 = pd.to_datetime(s, format='%b %Y', errors='coerce')
month_year_list = df.columns[s1.fillna(s2).notna()].tolist()
print (month_year_list)
['Jan-20', 'Feb-20', 'Jan 2021', 'Feb 2080', 'Dec 1993']

Collected from the Internet

Please contact [email protected] to delete if infringement.