Select columns with name matching str with wildcard for t-test (Python)

debugcn Published at Dev

Flora

I have

         Apple f2 m  Apple f2 t  Apple f3 m   Apple f3 t
0                 3           4           5            3
1                 12          7           4            7  
2                 5           9           7            5
3                 3           3           4            8
4                 7           1           2            6

I would like to select columns with str = 'Apple f* m' to do a t-test against columns with str = 'Apple f* t'

I have tried

ttest_ind(df.loc[:,df.columns.str.contains('Apple R* m')], df.loc[:,df.columns.str.contains('Apple R* t')]

However, it doesn't recognise my wildcard has a wildcard.

Thank you if you an help me solve or guide me for this problem.

Anton vBR

For future reference. The pandas.Series.str.contains has the param regex set to True by default which means we can use Regex expressions.

To find 0 or more of any character we can simply use this (ref. Alan Moore)

.* just means "0 or more of any character"

It's broken down into two parts:

. - a "dot" indicates any character * - means "0 or more instances of the preceding regex token"

Here is a link to regex101 where you can test regex expressions:

https://regex101.com/r/QNjkch/1

And finally we can simplify your code, consider this simple example:

import pandas as pd
df = pd.DataFrame(columns=["a1a","a2a","a1b"])

mask = df.columns.str.contains('a.*a')

df.loc[:,mask] # selects mask
df.loc[:,~mask] # selects inverted (by using ~) mask

Collected from the Internet

Please contact [email protected] to delete if infringement.