Use Spark to group by consecutive same values of one column, taking Max or Min value of another column for each group

debugcn 投稿 Dev

madprogrammer

Suppose I have a following data frame

+-------------------+------+------------+
|               Date|   Val|   Condition|
+-------------------+------+------------+
|2020-10-02 10:00:00|211.39|         Max|
|2020-10-02 10:10:00|210.94|         Min|
|2020-10-02 10:30:00|209.21|         Max|
|2020-10-02 11:20:00|207.48|         Min|
|2020-10-02 11:50:00|207.22|         Min| <- take only this row because it's less than 207.48
|2020-10-02 12:10:00|207.58|         Max|
|2020-10-02 12:40:00|207.45|         Min|
|2020-10-02 13:10:00|207.45|         Min| <- take either row becase they are equal
|2020-10-02 13:40:00| 208.7|         Max| <- take only this row because it's greater than 208.31
|2020-10-02 14:10:00|208.31|         Max| 
|2020-10-02 14:20:00|208.16|         Min|
|2020-10-02 14:30:00| 208.3|         Max|
|2020-10-02 14:50:00|208.25|         Min|
|2020-10-02 15:10:00| 208.7|         Max|
|2020-10-02 15:30:00|208.08|         Min|
|2020-10-02 16:00:00| 208.0|         Min| <- take only this row because it's less than 208.08
|2020-10-02 16:30:00|208.35|         Max|
|2020-10-02 16:40:00|208.26|         Min|
|2020-10-02 16:50:00|208.27|         Max|
|2020-10-02 17:30:00|208.06|         Min|
+-------------------+------+------------+

How can I group it by consecutive values of Condition, taking max or min value of Val for each group? (e.g. the resulting data frame should be something like the one below) (see comments in the above data frame).

+-------------------+------+------------+
|               Date|   Val|   Condition|
+-------------------+------+------------+
|2020-10-02 10:00:00|211.39|         Max|
|2020-10-02 10:10:00|210.94|         Min|
|2020-10-02 10:30:00|209.21|         Max|
|2020-10-02 11:50:00|207.22|         Min|
|2020-10-02 12:10:00|207.58|         Max|
|2020-10-02 12:40:00|207.45|         Min|
|2020-10-02 13:40:00| 208.7|         Max|
|2020-10-02 14:20:00|208.16|         Min|
|2020-10-02 14:30:00| 208.3|         Max|
|2020-10-02 14:50:00|208.25|         Min|
|2020-10-02 15:10:00| 208.7|         Max|
|2020-10-02 16:00:00| 208.0|         Min|
|2020-10-02 16:30:00|208.35|         Max|
|2020-10-02 16:40:00|208.26|         Min|
|2020-10-02 16:50:00|208.27|         Max|
|2020-10-02 17:30:00|208.06|         Min|
+-------------------+------+------------+

The goal is:

for each group where there are more than one consecutive row with Condition = Max or Condition = Min
to take only one row from each group (which one - is determined by the value of Condition - it's either a row with maximum or minimum value of column Val)

Sathiyan S

Try this,

val wind = Window.orderBy("Date")
    val df1 = df.withColumn("val1", when($"Condition" === lead($"Condition", 1).over(wind),
      when($"Condition" === "Min", min($"val").over(wind.rowsBetween(0,1))).otherwise(max($"val").over(wind.rowsBetween(0,1))))
        .when($"Condition" === lag($"Condition", 1).over(wind),
          when($"Condition" === "Min", min($"val").over(wind.rowsBetween(-1,0))).otherwise(max($"val").over(wind.rowsBetween(-1,0))))
      .otherwise($"val"))

    val df2 = df1.withColumn("rn", when($"Condition" === lead($"Condition", 1).over(wind),1)
      .when($"Condition" === lag($"Condition", 1).over(wind), 2)
      .otherwise(1)).withColumn("Val", $"val1").filter($"rn" === 1).drop("rn", "val1")

    df2.show(false)

+-------------------+------+---------+
|Date               |Val   |Condition|
+-------------------+------+---------+
|2020-10-02 10:00:00|211.39|Max      |
|2020-10-02 10:10:00|210.94|Min      |
|2020-10-02 10:30:00|209.21|Max      |
|2020-10-02 11:20:00|207.22|Min      |
|2020-10-02 12:10:00|207.58|Max      |
|2020-10-02 12:40:00|207.45|Min      |
|2020-10-02 13:40:00|208.7 |Max      |
|2020-10-02 14:20:00|208.16|Min      |
|2020-10-02 14:30:00|208.3 |Max      |
|2020-10-02 14:50:00|208.25|Min      |
|2020-10-02 15:10:00|208.7 |Max      |
|2020-10-02 15:30:00|208.0 |Min      |
|2020-10-02 16:30:00|208.35|Max      |
|2020-10-02 16:40:00|208.26|Min      |
|2020-10-02 16:50:00|208.27|Max      |
|2020-10-02 17:30:00|208.06|Min      |
+-------------------+------+---------+

Let me know if it helps you.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]