Given the following dataframe:
df = sc.parallelize([
('2017-05-21', 'a'),
('2017-05-21', 'c'),
('2017-05-22', 'b'),
('2017-05-22', 'c'),
('2017-05-23', 'a'),
('2017-05-23', 'b'),
('2017-05-23', 'c'),
('2017-05-23', 'c'),
]).toDF(['date', 'foo'])
I would like to get the daily percentages of foo == a
:
+----------+----------+
| date|percentage|
+----------+----------+
|2017-05-21| 0.5|
|2017-05-22| 0.0|
|2017-05-23| 0.25|
+----------+----------+
This is what I came up with:
df.withColumn('foo_a', df.foo == 'a')
.groupby('date')
.agg((func.sum(col('foo_a').cast('integer'))/func.count('*')).alias('percentage'))
.sort('date')
This works, but I feel like there should be an easier way. Specifically, is there an aggregate function for counting the occurrences of a certain value?
mean
/ avg
combined with when
:
from pyspark.sql.functions import avg, col, when
df.groupBy("date").agg(avg(when(col("foo") == "a", 1).otherwise(0)))
or cast
:
df.groupBy("date").agg(avg((col("foo") == "a").cast("integer")))
is all you need.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments