A simpler way to calculate grouped percentages in a Spark dataframe?

Sebastian Dziadzio Published at Dev

Sebastian Dziadzio

Given the following dataframe:

df = sc.parallelize([
    ('2017-05-21', 'a'),
    ('2017-05-21', 'c'),
    ('2017-05-22', 'b'),
    ('2017-05-22', 'c'),
    ('2017-05-23', 'a'),
    ('2017-05-23', 'b'),
    ('2017-05-23', 'c'),
    ('2017-05-23', 'c'),
]).toDF(['date', 'foo'])

I would like to get the daily percentages of foo == a:

+----------+----------+
|      date|percentage|
+----------+----------+
|2017-05-21|       0.5|
|2017-05-22|       0.0|
|2017-05-23|      0.25|
+----------+----------+

This is what I came up with:

df.withColumn('foo_a', df.foo == 'a')
  .groupby('date')
  .agg((func.sum(col('foo_a').cast('integer'))/func.count('*')).alias('percentage'))
  .sort('date')

This works, but I feel like there should be an easier way. Specifically, is there an aggregate function for counting the occurrences of a certain value?

zero323

mean / avg combined with when:

from pyspark.sql.functions import avg, col, when

df.groupBy("date").agg(avg(when(col("foo") == "a", 1).otherwise(0)))

or cast:

df.groupBy("date").agg(avg((col("foo") == "a").cast("integer")))

is all you need.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-04-18

Comments

0 comments

From Dev

Calculate quantile on grouped data in spark Dataframe

From Dev

Is there a way to count and calculate percentages

From Dev

Is there a simpler way of performing and grouped sum ina matrix?

From Dev

How to create a column of percentages within a grouped dataframe?

From Java

Is there a simpler and easier way to calculate denomination in Java?

From Dev

How to calculate mean of grouped dataframe?

From Dev

Calculate medians of rows in a grouped dataframe

From Dev

Scala Spark : Calculate grouped-by AUC

From Dev

Highcharts: grouped columns with percentages

From Dev

Calculate random variables from grouped dataframe

From Dev

how to calculate cumsum with depreciation in a grouped dataframe?

From Dev

calculate sum of rows in pandas dataframe grouped by date

From Dev

Is there a way to modify each grouped dataset as a whole in Spark?

From Dev

How to get a Tuple for the grouped by result on a Spark Dataframe?

From Dev

How to encode the grouped data in spark.dataframe?

From

Calculate Cosine Similarity Spark Dataframe

From Dev

Is there a way to generate percentages for multiple columns in a dataframe based on a condition?

From Dev

PortgreSQL - Function to calculate percentages

From Dev

PHP calculate percentages

From Dev

Calculate these multiple percentages in PHP?

From Dev

How to calculate percentages in mySQL

From Dev

How to calculate percentages with rounding

From Dev

Dplyr grouped percentages in different timeframes

From Dev

An efficient way to calculate deltas in the DataFrame?

From Dev

Is there a simpler way to do this?

From Dev

Simpler way to parse XML

From Dev

Is there a simpler way in python

From Dev

Merging JSON in simpler way

From Dev

Is there a simpler way of writing this code?

Related Related

Article