Can Pandas DataFrame efficiently calculate PMI (Pointwise Mutual Information)?

jfive Published at Dev

jfive

I've looked around and surprisingly haven't found an easy use of framework or existing code for the calculation of Pointwise Mutual Information (Wiki PMI) despite libraries like Scikit-learn offering a metric for overall Mutual Information (by histogram). This is in the context of Python and Pandas!

My problem:

I have a DataFrame with a series of [x,y] examples in each row and wish to calculate a series of PMI values as per the formula (or a simpler one):

PMI(x, y) = log( p(x,y) / p(x) * p(y) )

So far my approach is:

def pmi_func(df, x, y):
    df['freq_x'] = df.groupby(x).transform('count')
    df['freq_y'] = df.groupby(y).transform('count')
    df['freq_x_y'] = df.groupby([x, y]).transform('count')
    df['pmi'] = np.log( df['freq_x_y'] / (df['freq_x'] * df['freq_y']) )

Would this give a valid and/or efficient computation?

Sample I/O:

x  y  PMI
0  0  0.176
0  0  0.176
0  1  0

Zero

I would add three bits.

def pmi(dff, x, y):
    df = dff.copy()
    df['f_x'] = df.groupby(x)[x].transform('count')
    df['f_y'] = df.groupby(y)[y].transform('count')
    df['f_xy'] = df.groupby([x, y])[x].transform('count')
    df['pmi'] = np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) )
    return df

df.groupby(x)[x].transform('count') and df.groupby(y)[y].transform('count') should be used so that only count is retured.
np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) probabilities to be used.
work on copy of dataframe, rather than modifying input dataframe.

Collected from the Internet

Please contact [email protected] to delete if infringement.