Can Pandas DataFrame efficiently calculate PMI (Pointwise Mutual Information)?

jfive

I've looked around and surprisingly haven't found an easy use of framework or existing code for the calculation of Pointwise Mutual Information (Wiki PMI) despite libraries like Scikit-learn offering a metric for overall Mutual Information (by histogram). This is in the context of Python and Pandas!

My problem:

I have a DataFrame with a series of [x,y] examples in each row and wish to calculate a series of PMI values as per the formula (or a simpler one):

PMI(x, y) = log( p(x,y) / p(x) * p(y) )

So far my approach is:

def pmi_func(df, x, y):
    df['freq_x'] = df.groupby(x).transform('count')
    df['freq_y'] = df.groupby(y).transform('count')
    df['freq_x_y'] = df.groupby([x, y]).transform('count')
    df['pmi'] = np.log( df['freq_x_y'] / (df['freq_x'] * df['freq_y']) )

Would this give a valid and/or efficient computation?

Sample I/O:

x  y  PMI
0  0  0.176
0  0  0.176
0  1  0
Zero

I would add three bits.

def pmi(dff, x, y):
    df = dff.copy()
    df['f_x'] = df.groupby(x)[x].transform('count')
    df['f_y'] = df.groupby(y)[y].transform('count')
    df['f_xy'] = df.groupby([x, y])[x].transform('count')
    df['pmi'] = np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) )
    return df
  1. df.groupby(x)[x].transform('count') and df.groupby(y)[y].transform('count') should be used so that only count is retured.
  2. np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) probabilities to be used.
  3. work on copy of dataframe, rather than modifying input dataframe.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Computing Pointwise Mutual Information in Spark

From Dev

How to efficiently calculate running maxima in a Pandas dataframe?

From Java

how to calculate mutual information of entire dataset in python

From Dev

How can I efficiently move from a Pandas dataframe to JSON

From Java

How to calculate distance for every row in a pandas dataframe from a single point efficiently?

From Dev

Convert big pandas DataFrame efficiently

From Dev

Calculate rolling time difference in pandas efficiently

From Dev

Calculate rolling time difference in pandas efficiently

From Dev

Efficiently calculate creation rate during window in pandas

From Dev

calculate statistics for the dataframe based on a column information

From Dev

Pandas DataFrame - Count rows between values efficiently

From Dev

How to replace efficiently values on a pandas DataFrame?

From Dev

Python pandas: Efficiently compare rows of a dataframe?

From Java

Efficiently converting pandas dataframe to scipy sparse matrix

From Dev

How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

From Dev

Efficiently write a Pandas dataframe to Google BigQuery

From Dev

How to efficiently change data layout of a DataFrame in pandas?

From Dev

Efficiently incrementing value by condition on pandas dataframe

From Dev

Pandas DataFrame efficiently split one column into multiple

From Dev

Pandas efficiently normalize column titles in a dataframe

From Dev

Pandas DataFrame - Count rows between values efficiently

From Dev

Shannon entropy to mutual information

From Dev

Calculate Daily Returns with Pandas DataFrame

From Dev

Calculate weighted average with pandas dataframe

From Dev

How to calculate percentage with Pandas' DataFrame

From Dev

Calculate the duration of a state with a pandas Dataframe

From Dev

Calculate percent change on a Pandas DataFrame

From Dev

Calculate stock returns in pandas DataFrame

From Dev

Calculate MRR in Python Pandas dataframe

Related Related

  1. 1

    Computing Pointwise Mutual Information in Spark

  2. 2

    How to efficiently calculate running maxima in a Pandas dataframe?

  3. 3

    how to calculate mutual information of entire dataset in python

  4. 4

    How can I efficiently move from a Pandas dataframe to JSON

  5. 5

    How to calculate distance for every row in a pandas dataframe from a single point efficiently?

  6. 6

    Convert big pandas DataFrame efficiently

  7. 7

    Calculate rolling time difference in pandas efficiently

  8. 8

    Calculate rolling time difference in pandas efficiently

  9. 9

    Efficiently calculate creation rate during window in pandas

  10. 10

    calculate statistics for the dataframe based on a column information

  11. 11

    Pandas DataFrame - Count rows between values efficiently

  12. 12

    How to replace efficiently values on a pandas DataFrame?

  13. 13

    Python pandas: Efficiently compare rows of a dataframe?

  14. 14

    Efficiently converting pandas dataframe to scipy sparse matrix

  15. 15

    How to efficiently columnize (=pivoting) pandas DataFrame (with groupby)?

  16. 16

    Efficiently write a Pandas dataframe to Google BigQuery

  17. 17

    How to efficiently change data layout of a DataFrame in pandas?

  18. 18

    Efficiently incrementing value by condition on pandas dataframe

  19. 19

    Pandas DataFrame efficiently split one column into multiple

  20. 20

    Pandas efficiently normalize column titles in a dataframe

  21. 21

    Pandas DataFrame - Count rows between values efficiently

  22. 22

    Shannon entropy to mutual information

  23. 23

    Calculate Daily Returns with Pandas DataFrame

  24. 24

    Calculate weighted average with pandas dataframe

  25. 25

    How to calculate percentage with Pandas' DataFrame

  26. 26

    Calculate the duration of a state with a pandas Dataframe

  27. 27

    Calculate percent change on a Pandas DataFrame

  28. 28

    Calculate stock returns in pandas DataFrame

  29. 29

    Calculate MRR in Python Pandas dataframe

HotTag

Archive