Suppose I have a following data frame
+-------------------+------+------------+
| Date| Val| Condition|
+-------------------+------+------------+
|2020-10-02 10:00:00|211.39| Max|
|2020-10-02 10:10:00|210.94| Min|
|2020-10-02 10:30:00|209.21| Max|
|2020-10-02 11:20:00|207.48| Min|
|2020-10-02 11:50:00|207.22| Min| <- take only this row because it's less than 207.48
|2020-10-02 12:10:00|207.58| Max|
|2020-10-02 12:40:00|207.45| Min|
|2020-10-02 13:10:00|207.45| Min| <- take either row becase they are equal
|2020-10-02 13:40:00| 208.7| Max| <- take only this row because it's greater than 208.31
|2020-10-02 14:10:00|208.31| Max|
|2020-10-02 14:20:00|208.16| Min|
|2020-10-02 14:30:00| 208.3| Max|
|2020-10-02 14:50:00|208.25| Min|
|2020-10-02 15:10:00| 208.7| Max|
|2020-10-02 15:30:00|208.08| Min|
|2020-10-02 16:00:00| 208.0| Min| <- take only this row because it's less than 208.08
|2020-10-02 16:30:00|208.35| Max|
|2020-10-02 16:40:00|208.26| Min|
|2020-10-02 16:50:00|208.27| Max|
|2020-10-02 17:30:00|208.06| Min|
+-------------------+------+------------+
How can I group it by consecutive values of Condition
, taking max or min value of Val
for each group? (e.g. the resulting data frame should be something like the one below) (see comments in the above data frame).
+-------------------+------+------------+
| Date| Val| Condition|
+-------------------+------+------------+
|2020-10-02 10:00:00|211.39| Max|
|2020-10-02 10:10:00|210.94| Min|
|2020-10-02 10:30:00|209.21| Max|
|2020-10-02 11:50:00|207.22| Min|
|2020-10-02 12:10:00|207.58| Max|
|2020-10-02 12:40:00|207.45| Min|
|2020-10-02 13:40:00| 208.7| Max|
|2020-10-02 14:20:00|208.16| Min|
|2020-10-02 14:30:00| 208.3| Max|
|2020-10-02 14:50:00|208.25| Min|
|2020-10-02 15:10:00| 208.7| Max|
|2020-10-02 16:00:00| 208.0| Min|
|2020-10-02 16:30:00|208.35| Max|
|2020-10-02 16:40:00|208.26| Min|
|2020-10-02 16:50:00|208.27| Max|
|2020-10-02 17:30:00|208.06| Min|
+-------------------+------+------------+
The goal is:
Try this,
val wind = Window.orderBy("Date")
val df1 = df.withColumn("val1", when($"Condition" === lead($"Condition", 1).over(wind),
when($"Condition" === "Min", min($"val").over(wind.rowsBetween(0,1))).otherwise(max($"val").over(wind.rowsBetween(0,1))))
.when($"Condition" === lag($"Condition", 1).over(wind),
when($"Condition" === "Min", min($"val").over(wind.rowsBetween(-1,0))).otherwise(max($"val").over(wind.rowsBetween(-1,0))))
.otherwise($"val"))
val df2 = df1.withColumn("rn", when($"Condition" === lead($"Condition", 1).over(wind),1)
.when($"Condition" === lag($"Condition", 1).over(wind), 2)
.otherwise(1)).withColumn("Val", $"val1").filter($"rn" === 1).drop("rn", "val1")
df2.show(false)
+-------------------+------+---------+
|Date |Val |Condition|
+-------------------+------+---------+
|2020-10-02 10:00:00|211.39|Max |
|2020-10-02 10:10:00|210.94|Min |
|2020-10-02 10:30:00|209.21|Max |
|2020-10-02 11:20:00|207.22|Min |
|2020-10-02 12:10:00|207.58|Max |
|2020-10-02 12:40:00|207.45|Min |
|2020-10-02 13:40:00|208.7 |Max |
|2020-10-02 14:20:00|208.16|Min |
|2020-10-02 14:30:00|208.3 |Max |
|2020-10-02 14:50:00|208.25|Min |
|2020-10-02 15:10:00|208.7 |Max |
|2020-10-02 15:30:00|208.0 |Min |
|2020-10-02 16:30:00|208.35|Max |
|2020-10-02 16:40:00|208.26|Min |
|2020-10-02 16:50:00|208.27|Max |
|2020-10-02 17:30:00|208.06|Min |
+-------------------+------+---------+
Let me know if it helps you.
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加