Is it possible to groupBy a Spark's dataframe when not all values are present in column?

debugcn Published at Dev

antonioACR1

For example, if I have the following dataframe

val tempDF=Seq(("a",2),("b",1),("a",3)).toDF("letter","value")

scala> tempDF.show()
+------+-----+
|letter|value|
+------+-----+
|     a|    2|
|     b|    1|
|     a|    3|
+------+-----+

and I want to perform a groupBy operation on the column letter but knowing that there could be another letter c not present in the column letter. Normally I would have

tempDF.groupBy("letter").sum()

scala> tempDF.groupBy("letter").sum().show()
+------+----------+                                                               
|letter|sum(value)|
+------+----------+
|     a|         5|
|     b|         1|
+------+----------+

but I would like something like this:

+------+----------+                                                             
|letter|sum(value)|
+------+----------+
|     a|         5|
|     b|         1|
|     c|         0|
+------+----------+

Is it possible to do this without somehow adding the letter c to the dataframe? What I mean is that I could have many dataframes in a list but I don't know which letters are missing (if any) for each dataframe, instead I do know the whole list of letters that should appear for each one.

AbhishekN

If you already know all possible values, create a separate (universal) DataSet with 'value' as 0. Then join it with any tempDF to add missing letters. Then do groupBy on final dataset.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-08-12

Comments

0 comments

From Dev

Related Related

Article

Is it possible to groupBy a Spark's dataframe when not all values are present in column?

Is it possible to groupBy a Spark's dataframe when not all values are present in column?

Retrieve all values in a column that are not present in another column

Filter spark/scala dataframe if column is present in set

spark scala dataframe adding 1 to all the values in a column

Calculate percentage when all values are in the same column of a dataframe in R

Pandas - Merge DataFrame to Series when all column values are the same.

Add column from one dataframe to another, for values present in overlapping column

Groupby given percentiles of the values of the chosen DataFrame column

UPDATE all values of a column with a single value, if that value is present, for all users

Select all rows where all array values are present in another column

How to filter a dataframe based on the values present in the list in the rows of a column in Python?

Extract column values of Dataframe as List in Apache Spark

Spark DataFrame aggregate column values by key into List

Count empty values in dataframe column in Spark (Scala)

Replacing whitespace in all column names in spark Dataframe

SQL Access, Sum one column's values only when ALL values in another column in are in specified range

Create a new column in a dataframe, based on Groupby and values in a separate column

Spark - calculating sum of all the values in a column

SQL UNION over all possible values for a column?

Dynamodb is it possible to query all values in a column

Dynamodb is it possible to query all values in a column

Spark add column to dataframe when reading csv

How to get all the column names & their types, including ENUM (and it's possible values)

How to parse all the values in a column of a DataFrame?

Check if all values in dataframe column are the same

Count all the NA values in one column of a dataframe

Flatting a dataframe with all values of a column into one

Check if all values in dataframe column are the same

How to edit all values of a column in a pandas dataframe?

Taking away all previous values in a column in dataframe