Split Pandas Dataframe into separate pieces based on column values

edesz

I am looking to perform some Inner Joins in Pandas, using Python 2.7. Here is the dataset that I am working with:

import pandas as pd
import numpy as np

columns = ['s_id', 'c_id', 'c_col1']
index = np.arange(46) # array of numbers for the number of samples
df = pd.DataFrame(columns=columns, index = index)

df.s_id[:15] = 144
df.s_id[15:27] = 105
df.s_id[27:46] = 52

df.c_id[:5] = 1
df.c_id[5:10] = 2
df.c_id[10:15] = 3
df.c_id[15:19] = 1
df.c_id[19:27] = 2
df.c_id[27:34] = 1
df.c_id[34:39] = 2
df.c_id[39:46] = 3

df.c_col1[:5] = ['H', 'C', 'N', 'O', 'S']
df.c_col1[5:10] = ['C', 'O','S','K','Ca']
df.c_col1[10:15] = ['H', 'O','F','Ne','Si']
df.c_col1[15:19] = ['C', 'O', 'F', 'Zn']
df.c_col1[19:27] = ['N', 'O','F','Fe','Zn','Gd','Hg','Pb']
df.c_col1[27:34] = ['H', 'He', 'Li', 'B', 'N','Al','Si']
df.c_col1[34:39] = ['N', 'F','Ne','Na','P']
df.c_col1[39:46] = ['C', 'N','O','F','K','Ca', 'Fe']

Here is the dataframe:

   s_id c_id c_col1
0   144    1      H
1   144    1      C
2   144    1      N
3   144    1      O <--
4   144    1      S
5   144    2      C
6   144    2      O <--
7   144    2      S
8   144    2      K
9   144    2     Ca
10  144    3      H
11  144    3      O <--
12  144    3      F
13  144    3     Ne
14  144    3     Si
15  105    1      C
16  105    1      O
17  105    1      F
18  105    1     Zn
19  105    2      N
20  105    2      O
21  105    2      F
22  105    2     Fe
23  105    2     Zn
24  105    2     Gd
25  105    2     Hg
26  105    2     Pb
27   52    1      H
28   52    1     He
29   52    1     Li
30   52    1      B
31   52    1      N
32   52    1     Al
33   52    1     Si
34   52    2      N
35   52    2      F
36   52    2     Ne
37   52    2     Na
38   52    2      P
39   52    3      C
40   52    3      N
41   52    3      O
42   52    3      F
43   52    3      K
44   52    3     Ca
45   52    3     Fe

I need to do the following in Pandas:

  1. In a given s_id, produce separate dataframes for each c_id value. ex. for s_id = 144, there will be 3 dataframes, while for s_id = 105 there will be 2 dataframes
  2. Inner Join the separate dataframes produced in a.), on the elements column (c_col1) in Pandas. This is a little difficult to understand so here is the dataframe what I would like to get from this step:

    index s_id c_id c_col1

    0   144    1      O
    1   144    2      O
    2   144    3      O
    3   105    1      O
    4   105    2      F
    5    52    1      N
    6    52    2      N
    7    52    3      N
    

    As you can see, what I am looking for in part 2.) is the following: Within each s_id, I am looking for those c_col1 values that occur for all the c_id values. ex. in the case of s_id = 144, only O (oxygen) occurs for c_id = 1, 2, 3. I have pointed to these entries, with "<--", in the raw data. So, I would like to have the dataframe show O 3 times in the c_col1 column and the corresponding c_id entries would be 1, 2, 3.

Conditions:

  1. the number of unique c_ids are not known ahead of time.i.e. for one particular s_id, I do not know if there will be 1, 2 and 3 or just 1 and 2. This means that if 1, 2 and 3 occur, there will be one Inner Join; if only 1 and 2 occur, then there will be only one Inner Join.

How can this be done with Pandas?

Woody Pride

Producing the separate dataframes is easy enough. How would you want to store them? One way would be in a nested dict where the outer keys are the s_id and the inner keys are the c_id and the inner values are the data. That you can do with a fairly long but straightforward dict comprehension:

DF_dict = {s_id : 
          {c_id : df[(df.s_id == s_id) & (df.c_id == c_id)] for c_id in df[df.s_id == s_id]['c_id'].unique()} 
          for s_id in df.s_id.unique()}

Then for example:

In [12]: DF_dict[52][2]
Out[12]:
   s_id c_id c_col1
34   52    2      N
35   52    2      F
36   52    2     Ne
37   52    2     Na
38   52    2      P

I do not understand part two of your question. You want then to join the data within in s_id? Could you show what the expected output would be? If you want to do something within each s_id you might be better off exploring groupby options. Perhaps someone understands what you want, but if you can clarify I might be able to show a better option that skips the first part of the question...

##################EDIT

It seems to me that you should just go straight to problem 2, if problem 1 is simply a step you believe to be necessary to get to a problem 2 solution. In fact it is entirely unnecessary. To solve your second problem you need to group the data by s_id and transform the data according to your requirements. To sum up your requirements as I see them the rule is as follows: For each data group grouped by s_id, return only those ccol_1 data for which there are equal values for each value of c_id.

You might write a function like this:

def c_id_overlap(df):
    common_vals = [] #container for values of c_col1 that are in ever c_id subgroup
    c_ids = df.c_id.unique() #get unique values of c_id
    c_col1_values = set(df.c_col1) # get a set of c_col1 values
    #create nested list of values. Each inner list contains the c_col1 values for each c_id
    nested_c_col_vals = [list(df[df.c_id == ID]['c_col1'].unique()) for ID in c_ids]
    #Iterate through the c_col1_values and see if they are in every nested list
    for val in c_col1_values:
        if all([True if val in elem else False for elem in nested_c_col_vals]):
            common_vals.append(val)
    #return a slice of the dataframe that only contains values of c_col1 that are in every
    #c_id
    return df[df.c_col1.isin(common_vals)]

and then pass it to apply on data grouped by s_id:

df.groupby('s_id', as_index = False).apply(c_id_overlap)

which gives me the following output:

     s_id c_id c_col1
0 31   52    1      N
  34   52    2      N
  40   52    3      N
1 16  105    1      O
  17  105    1      F
  18  105    1     Zn
  20  105    2      O
  21  105    2      F
  23  105    2     Zn
2 3   144    1      O
  6   144    2      O
  11  144    3      O

Which seems to be what you are looking for.

###########EDIT: Additional Explanation:

So apply passes each chunk of grouped data to the function and the the pieces are glues back together once this has been done for each group of data.

So think about the first group passed where s_id == 105. The first line of the function creates an empty list common_vals which will contain those periodic elements that appear in every subgroup of the data (i.e. relative to each of the values of c_id).

The second line gets the unique values of 'c_id', in this case [1, 2] and stores them in an array called c_ids

The third line creates a set of the values of c_col1 which in this case produces:

 {'C', 'F', 'Fe', 'Gd', 'Hg', 'N', 'O', 'Pb', 'Zn'}

The fourth line creates a nested list structure nested_c_col_vals where every inner list is a list of the unique values associated with each of the elements in the c_ids array. In this case this looks like this:

[['C', 'O', 'F', 'Zn'], ['N', 'O', 'F', 'Fe', 'Zn', 'Gd', 'Hg', 'Pb']]

Now each of the elements in the c_col1_values list is iterated over and for each of those elements the program determines whether that element appears in every inner list of the nested_c_col_vals object. The bulit in all function, determines whether every item in the sequence between the backets is True or rather whether it is non-zero (you will need to check this). So:

In [10]: all([True, True, True])
Out[10]: True

In [11]: all([True, True, True, False])
Out[11]: False

In [12]: all([True, True, True, 1])
Out[12]: True

In [13]: all([True, True, True, 0])
Out[13]: False

In [14]: all([True, 1, True, 0])
Out[14]: False 

So in this case, let's say 'C' is the first element iterated over. The list comprehension inside the all() backets says, look inside each inner list and see if the element is there. If it is then True if it is not then False. So in this case this resolves to:

all([True, False])

which is of course False. No when the element is 'Zn' the result of this operation is

all([True, True])

which resolves to True. Therefore 'Zn' is appended to the common_vals list.

Once the process is complete the values inside common_vals are:

['O', 'F', 'Zn']

The return statement simply slices the data chunk according to whether the vaues os c_col1 are in the list common_vals as per above.

This is then repeated for each of the remaining groups and the data are glued back together.

Hope this helps

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Split Pandas Dataframe into separate pieces based on column values

From Dev

Pandas - Create Separate Columns in DataFrame Based on a Specific Column's Values

From Dev

How to split pandas dataframe based on difference of values in a column

From Dev

How to split pandas dataframe based on difference of values in a column

From Dev

Python: Pandas - Separate a Dataframe based on a column value

From Dev

Split/Expand Dataframe based on column values

From Dev

Split pandas dataframe column based on number of digits

From Dev

How to split a pandas dataframe based on column prefix

From Java

Create a new column in a dataframe, based on Groupby and values in a separate column

From Dev

How to split a column into two separate ones in a DataFrame with Pandas

From Dev

Replace values in pandas dataframe based on column names

From Dev

Find Pandas dataframe column based on values, in Python

From Dev

Filter pandas Dataframe based on max values in a column

From Dev

Filter pandas dataframe based on column list values

From Dev

Find Pandas dataframe column based on values, in Python

From Dev

pandas - construct column depending on values in 2 separate columns of dataframe

From Dev

Using str.split for pandas dataframe values based on parentheses location

From Dev

select data in a column of a pandas dataframe depending on pieces of string in another column

From Dev

Select column dynamically in Pandas dataframe based on values in a list or another column

From Dev

adding a column to Pandas dataframe based on adjacent values of existing column

From Dev

Sum pandas dataframe column values based on condition of column name

From Dev

Creating Dictionary from Pandas DataFrame Column Based on Unique Values in Column

From Dev

Split a text(with names and values) column into multiple columns in Pandas DataFrame

From Dev

Pandas: Split a dataframe rows and re-arrange column values

From Dev

Add values to one column of a pandas dataframe based on the values in another

From Dev

Add values to one column of a pandas dataframe based on the values in another

From Dev

Separate values in a column of a dataframe and melt

From Dev

Separate values in a column of a dataframe and melt

From Dev

Split column into separate columns based on separator strings

Related Related

  1. 1

    Split Pandas Dataframe into separate pieces based on column values

  2. 2

    Pandas - Create Separate Columns in DataFrame Based on a Specific Column's Values

  3. 3

    How to split pandas dataframe based on difference of values in a column

  4. 4

    How to split pandas dataframe based on difference of values in a column

  5. 5

    Python: Pandas - Separate a Dataframe based on a column value

  6. 6

    Split/Expand Dataframe based on column values

  7. 7

    Split pandas dataframe column based on number of digits

  8. 8

    How to split a pandas dataframe based on column prefix

  9. 9

    Create a new column in a dataframe, based on Groupby and values in a separate column

  10. 10

    How to split a column into two separate ones in a DataFrame with Pandas

  11. 11

    Replace values in pandas dataframe based on column names

  12. 12

    Find Pandas dataframe column based on values, in Python

  13. 13

    Filter pandas Dataframe based on max values in a column

  14. 14

    Filter pandas dataframe based on column list values

  15. 15

    Find Pandas dataframe column based on values, in Python

  16. 16

    pandas - construct column depending on values in 2 separate columns of dataframe

  17. 17

    Using str.split for pandas dataframe values based on parentheses location

  18. 18

    select data in a column of a pandas dataframe depending on pieces of string in another column

  19. 19

    Select column dynamically in Pandas dataframe based on values in a list or another column

  20. 20

    adding a column to Pandas dataframe based on adjacent values of existing column

  21. 21

    Sum pandas dataframe column values based on condition of column name

  22. 22

    Creating Dictionary from Pandas DataFrame Column Based on Unique Values in Column

  23. 23

    Split a text(with names and values) column into multiple columns in Pandas DataFrame

  24. 24

    Pandas: Split a dataframe rows and re-arrange column values

  25. 25

    Add values to one column of a pandas dataframe based on the values in another

  26. 26

    Add values to one column of a pandas dataframe based on the values in another

  27. 27

    Separate values in a column of a dataframe and melt

  28. 28

    Separate values in a column of a dataframe and melt

  29. 29

    Split column into separate columns based on separator strings

HotTag

Archive