How do I flattern a pySpark dataframe by one array column?

Philipp_Kats

I have a spark dataframe like this:

+------+--------+--------------+--------------------+
|   dbn|    boro|total_students|                sBus|
+------+--------+--------------+--------------------+
|17K548|Brooklyn|           399|[B41, B43, B44-SB...|
|09X543|   Bronx|           378|[Bx13, Bx15, Bx17...|
|09X327|   Bronx|           543|[Bx1, Bx11, Bx13,...|
+------+--------+--------------+--------------------+

How do I flattern it so that each row is copied for each for each element in sBus, and sBus will be a normal string column?

So that result would be like this:

+------+--------+--------------+--------------------+
|   dbn|    boro|total_students|                sBus|
+------+--------+--------------+--------------------+
|17K548|Brooklyn|           399| B41                |
|17K548|Brooklyn|           399| B43                |
|17K548|Brooklyn|           399| B44-SB             |
+------+--------+--------------+--------------------+

and so on...

Galen Long

I can't think of a way to do this without turning it into an RDD.

# convert df to rdd
rdd = df.rdd

def extract(row, key):
    """Takes dictionary and key, returns tuple of (dict w/o key, dict[key])."""
    _dict = row.asDict()
    _list = _dict[key]
    del _dict[key]
    return (_dict, _list)


def add_to_dict(_dict, key, value):
    _dict[key] = value
    return _dict


# preserve rest of values in key, put list to flatten in value
rdd = rdd.map(lambda x: extract(x, 'sBus'))
# make a row for each item in value
rdd = rdd.flatMapValues(lambda x: x)
# add flattened value back into dictionary
rdd = rdd.map(lambda x: add_to_dict(x[0], 'sBus', x[1]))
# convert back to dataframe
df = sqlContext.createDataFrame(rdd)

df.show()

The tricky part is keeping the other columns together with the newly flattened values. I do this by mapping each row to a tuple of (dict of other columns, list to flatten) and then calling flatMapValues. This will split each element of the value list into a separate row, but keep the keys attached, i.e.

(key, ['A', 'B', 'C'])

becomes

(key, 'A')
(key, 'B')
(key, 'C')

Then, I move the flattened value back into the dictionary of other columns, and reconvert it back to a DataFrame.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

How do I flattern a pySpark dataframe by one array column?

From Dev

PySpark: How can I join one more column to a dataFrame?

From Java

How do I add a new column to a Spark DataFrame (using PySpark)?

From Dev

How do I turn an array of column names into a pandas Dataframe?

From Dev

How do I convert a dataframe consisting of a column of sentences and a column of scores into one with a column of words and average scores?

From Dev

How do I convert a dataframe consisting of a column of sentences and a column of scores into one with a column of words and average scores?

From Dev

How do I transpose a pyspark dataframe?

From Dev

how to split one column and keep other columns in pyspark dataframe?

From Dev

Pyspark DataFrame - How to convert one column from categorical values to int?

From Dev

How do i lookup a row on one dataframe based on the column cell value and append that to a row on another dataframe?

From Dev

How do I convert a Pandas Dataframe with one column into a Pandas Dataframe of two columns?

From Dev

How to count the trailing zeroes in an array column in a PySpark dataframe without a UDF

From Dev

How do I sort a dataframe by an array not in the dataframe

From Java

Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?

From Dev

How do I change array dimensions from one dimension to a two dimension with one column

From Dev

How do I filter rows in a dataframe that have whole numbers in one column

From Dev

How do I filter rows in a dataframe that have whole numbers in one column

From Dev

How do I get all values from one position in a tuple in a pandas dataframe column?

From Dev

pandas dataframe imported CSV with all rows in one column. How do I fix this?

From Dev

How can I add a column from one dataframe to another dataframe?

From Dev

How do I divide one column in gnuplot?

From Dev

How do I pass a column to substr function in pyspark

From Dev

VBA excel: How do I get data in cells as an array up one row in the same column without selecting?

From Dev

How do I combine columns of my dataframe to create one datetime column which I can use as my index?

From Dev

How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

From Dev

how to convert one column in dataframe into a 2D array in python

From Dev

How do I insert one value into one column?

From Dev

how do I replace values in one column with strings in another one?

From Dev

How do I do a search call on one column of a table in MySQL?

Related Related

  1. 1

    How do I flattern a pySpark dataframe by one array column?

  2. 2

    PySpark: How can I join one more column to a dataFrame?

  3. 3

    How do I add a new column to a Spark DataFrame (using PySpark)?

  4. 4

    How do I turn an array of column names into a pandas Dataframe?

  5. 5

    How do I convert a dataframe consisting of a column of sentences and a column of scores into one with a column of words and average scores?

  6. 6

    How do I convert a dataframe consisting of a column of sentences and a column of scores into one with a column of words and average scores?

  7. 7

    How do I transpose a pyspark dataframe?

  8. 8

    how to split one column and keep other columns in pyspark dataframe?

  9. 9

    Pyspark DataFrame - How to convert one column from categorical values to int?

  10. 10

    How do i lookup a row on one dataframe based on the column cell value and append that to a row on another dataframe?

  11. 11

    How do I convert a Pandas Dataframe with one column into a Pandas Dataframe of two columns?

  12. 12

    How to count the trailing zeroes in an array column in a PySpark dataframe without a UDF

  13. 13

    How do I sort a dataframe by an array not in the dataframe

  14. 14

    Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?

  15. 15

    How do I change array dimensions from one dimension to a two dimension with one column

  16. 16

    How do I filter rows in a dataframe that have whole numbers in one column

  17. 17

    How do I filter rows in a dataframe that have whole numbers in one column

  18. 18

    How do I get all values from one position in a tuple in a pandas dataframe column?

  19. 19

    pandas dataframe imported CSV with all rows in one column. How do I fix this?

  20. 20

    How can I add a column from one dataframe to another dataframe?

  21. 21

    How do I divide one column in gnuplot?

  22. 22

    How do I pass a column to substr function in pyspark

  23. 23

    VBA excel: How do I get data in cells as an array up one row in the same column without selecting?

  24. 24

    How do I combine columns of my dataframe to create one datetime column which I can use as my index?

  25. 25

    How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector

  26. 26

    how to convert one column in dataframe into a 2D array in python

  27. 27

    How do I insert one value into one column?

  28. 28

    how do I replace values in one column with strings in another one?

  29. 29

    How do I do a search call on one column of a table in MySQL?

HotTag

Archive