A simpler way to calculate grouped percentages in a Spark dataframe?

Sebastian Dziadzio

Given the following dataframe:

df = sc.parallelize([
    ('2017-05-21', 'a'),
    ('2017-05-21', 'c'),
    ('2017-05-22', 'b'),
    ('2017-05-22', 'c'),
    ('2017-05-23', 'a'),
    ('2017-05-23', 'b'),
    ('2017-05-23', 'c'),
    ('2017-05-23', 'c'),
]).toDF(['date', 'foo'])

I would like to get the daily percentages of foo == a:

+----------+----------+
|      date|percentage|
+----------+----------+
|2017-05-21|       0.5|
|2017-05-22|       0.0|
|2017-05-23|      0.25|
+----------+----------+

This is what I came up with:

df.withColumn('foo_a', df.foo == 'a')
  .groupby('date')
  .agg((func.sum(col('foo_a').cast('integer'))/func.count('*')).alias('percentage'))
  .sort('date')

This works, but I feel like there should be an easier way. Specifically, is there an aggregate function for counting the occurrences of a certain value?

zero323

mean / avg combined with when:

from pyspark.sql.functions import avg, col, when

df.groupBy("date").agg(avg(when(col("foo") == "a", 1).otherwise(0)))

or cast:

df.groupBy("date").agg(avg((col("foo") == "a").cast("integer")))

is all you need.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Calculate quantile on grouped data in spark Dataframe

From Dev

Is there a way to count and calculate percentages

From Dev

Is there a simpler way of performing and grouped sum ina matrix?

From Dev

How to create a column of percentages within a grouped dataframe?

From Java

Is there a simpler and easier way to calculate denomination in Java?

From Dev

How to calculate mean of grouped dataframe?

From Dev

Calculate medians of rows in a grouped dataframe

From Dev

Scala Spark : Calculate grouped-by AUC

From Dev

Highcharts: grouped columns with percentages

From Dev

Calculate random variables from grouped dataframe

From Dev

how to calculate cumsum with depreciation in a grouped dataframe?

From Dev

calculate sum of rows in pandas dataframe grouped by date

From Dev

Is there a way to modify each grouped dataset as a whole in Spark?

From Dev

How to get a Tuple for the grouped by result on a Spark Dataframe?

From Dev

How to encode the grouped data in spark.dataframe?

From

Calculate Cosine Similarity Spark Dataframe

From Dev

Is there a way to generate percentages for multiple columns in a dataframe based on a condition?

From Dev

PortgreSQL - Function to calculate percentages

From Dev

PHP calculate percentages

From Dev

Calculate these multiple percentages in PHP?

From Dev

How to calculate percentages in mySQL

From Dev

How to calculate percentages with rounding

From Dev

Dplyr grouped percentages in different timeframes

From Dev

An efficient way to calculate deltas in the DataFrame?

From Dev

Is there a simpler way to do this?

From Dev

Simpler way to parse XML

From Dev

Is there a simpler way in python

From Dev

Merging JSON in simpler way

From Dev

Is there a simpler way of writing this code?

Related Related

  1. 1

    Calculate quantile on grouped data in spark Dataframe

  2. 2

    Is there a way to count and calculate percentages

  3. 3

    Is there a simpler way of performing and grouped sum ina matrix?

  4. 4

    How to create a column of percentages within a grouped dataframe?

  5. 5

    Is there a simpler and easier way to calculate denomination in Java?

  6. 6

    How to calculate mean of grouped dataframe?

  7. 7

    Calculate medians of rows in a grouped dataframe

  8. 8

    Scala Spark : Calculate grouped-by AUC

  9. 9

    Highcharts: grouped columns with percentages

  10. 10

    Calculate random variables from grouped dataframe

  11. 11

    how to calculate cumsum with depreciation in a grouped dataframe?

  12. 12

    calculate sum of rows in pandas dataframe grouped by date

  13. 13

    Is there a way to modify each grouped dataset as a whole in Spark?

  14. 14

    How to get a Tuple for the grouped by result on a Spark Dataframe?

  15. 15

    How to encode the grouped data in spark.dataframe?

  16. 16

    Calculate Cosine Similarity Spark Dataframe

  17. 17

    Is there a way to generate percentages for multiple columns in a dataframe based on a condition?

  18. 18

    PortgreSQL - Function to calculate percentages

  19. 19

    PHP calculate percentages

  20. 20

    Calculate these multiple percentages in PHP?

  21. 21

    How to calculate percentages in mySQL

  22. 22

    How to calculate percentages with rounding

  23. 23

    Dplyr grouped percentages in different timeframes

  24. 24

    An efficient way to calculate deltas in the DataFrame?

  25. 25

    Is there a simpler way to do this?

  26. 26

    Simpler way to parse XML

  27. 27

    Is there a simpler way in python

  28. 28

    Merging JSON in simpler way

  29. 29

    Is there a simpler way of writing this code?

HotTag

Archive