Use Spark to group by consecutive same values of one column, taking Max or Min value of another column for each group

madprogrammer

Suppose I have a following data frame

+-------------------+------+------------+
|               Date|   Val|   Condition|
+-------------------+------+------------+
|2020-10-02 10:00:00|211.39|         Max|
|2020-10-02 10:10:00|210.94|         Min|
|2020-10-02 10:30:00|209.21|         Max|
|2020-10-02 11:20:00|207.48|         Min|
|2020-10-02 11:50:00|207.22|         Min| <- take only this row because it's less than 207.48
|2020-10-02 12:10:00|207.58|         Max|
|2020-10-02 12:40:00|207.45|         Min|
|2020-10-02 13:10:00|207.45|         Min| <- take either row becase they are equal
|2020-10-02 13:40:00| 208.7|         Max| <- take only this row because it's greater than 208.31
|2020-10-02 14:10:00|208.31|         Max| 
|2020-10-02 14:20:00|208.16|         Min|
|2020-10-02 14:30:00| 208.3|         Max|
|2020-10-02 14:50:00|208.25|         Min|
|2020-10-02 15:10:00| 208.7|         Max|
|2020-10-02 15:30:00|208.08|         Min|
|2020-10-02 16:00:00| 208.0|         Min| <- take only this row because it's less than 208.08
|2020-10-02 16:30:00|208.35|         Max|
|2020-10-02 16:40:00|208.26|         Min|
|2020-10-02 16:50:00|208.27|         Max|
|2020-10-02 17:30:00|208.06|         Min|
+-------------------+------+------------+

How can I group it by consecutive values of Condition, taking max or min value of Val for each group? (e.g. the resulting data frame should be something like the one below) (see comments in the above data frame).

+-------------------+------+------------+
|               Date|   Val|   Condition|
+-------------------+------+------------+
|2020-10-02 10:00:00|211.39|         Max|
|2020-10-02 10:10:00|210.94|         Min|
|2020-10-02 10:30:00|209.21|         Max|
|2020-10-02 11:50:00|207.22|         Min|
|2020-10-02 12:10:00|207.58|         Max|
|2020-10-02 12:40:00|207.45|         Min|
|2020-10-02 13:40:00| 208.7|         Max|
|2020-10-02 14:20:00|208.16|         Min|
|2020-10-02 14:30:00| 208.3|         Max|
|2020-10-02 14:50:00|208.25|         Min|
|2020-10-02 15:10:00| 208.7|         Max|
|2020-10-02 16:00:00| 208.0|         Min|
|2020-10-02 16:30:00|208.35|         Max|
|2020-10-02 16:40:00|208.26|         Min|
|2020-10-02 16:50:00|208.27|         Max|
|2020-10-02 17:30:00|208.06|         Min|
+-------------------+------+------------+

The goal is:

  • for each group where there are more than one consecutive row with Condition = Max or Condition = Min
  • to take only one row from each group (which one - is determined by the value of Condition - it's either a row with maximum or minimum value of column Val)
Sathiyan S

Try this,

val wind = Window.orderBy("Date")
    val df1 = df.withColumn("val1", when($"Condition" === lead($"Condition", 1).over(wind),
      when($"Condition" === "Min", min($"val").over(wind.rowsBetween(0,1))).otherwise(max($"val").over(wind.rowsBetween(0,1))))
        .when($"Condition" === lag($"Condition", 1).over(wind),
          when($"Condition" === "Min", min($"val").over(wind.rowsBetween(-1,0))).otherwise(max($"val").over(wind.rowsBetween(-1,0))))
      .otherwise($"val"))

    val df2 = df1.withColumn("rn", when($"Condition" === lead($"Condition", 1).over(wind),1)
      .when($"Condition" === lag($"Condition", 1).over(wind), 2)
      .otherwise(1)).withColumn("Val", $"val1").filter($"rn" === 1).drop("rn", "val1")

    df2.show(false)

+-------------------+------+---------+
|Date               |Val   |Condition|
+-------------------+------+---------+
|2020-10-02 10:00:00|211.39|Max      |
|2020-10-02 10:10:00|210.94|Min      |
|2020-10-02 10:30:00|209.21|Max      |
|2020-10-02 11:20:00|207.22|Min      |
|2020-10-02 12:10:00|207.58|Max      |
|2020-10-02 12:40:00|207.45|Min      |
|2020-10-02 13:40:00|208.7 |Max      |
|2020-10-02 14:20:00|208.16|Min      |
|2020-10-02 14:30:00|208.3 |Max      |
|2020-10-02 14:50:00|208.25|Min      |
|2020-10-02 15:10:00|208.7 |Max      |
|2020-10-02 15:30:00|208.0 |Min      |
|2020-10-02 16:30:00|208.35|Max      |
|2020-10-02 16:40:00|208.26|Min      |
|2020-10-02 16:50:00|208.27|Max      |
|2020-10-02 17:30:00|208.06|Min      |
+-------------------+------+---------+

Let me know if it helps you.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

Get the value of a column where one column is max and another is min

分類Dev

One value for each df Column group

分類Dev

SQL: count rows where column = a value AND another column is the same as values in the group where the first condition is true?

分類Dev

Using Dplyr's to find max values of a group and mutate the result in another column in the same table

分類Dev

Retrieve last element of a column from each group and use it as first element of the same column in next group

分類Dev

Group a dataframe on one column and take max from one column and its corresponding value from the other col

分類Dev

Select MIN, MAX Corresponding column based on another column values

分類Dev

For each group in a column, get only the rows with values in another column closest to a defined set

分類Dev

Group by query with max column

分類Dev

How to replace all same values of a group in a column (dataframe) according to another column without loop?

分類Dev

Trying to find multiple values from one column and group by another column - can't seem to figure it out

分類Dev

Group values together in Pandas column, then filter values in another column

分類Dev

Group Values in excel according duplicate value in column

分類Dev

MongoDB - Getting distinct values of a column after a group by on another column is applied

分類Dev

How to sum the values in a column based on another column or different group?

分類Dev

mysql update values from one column in same table based on another 3 column value

分類Dev

Create new column that carries up the last value of another column, by group

分類Dev

R data.table: Rebase each group within the panel by a value found in another column

分類Dev

select a record from each group if it has given value in column otherwise any one record

分類Dev

SELECT MAX()of Column、DATE Column and group by ID

分類Dev

Selecting max and min dates in a group by with null values

分類Dev

Creating a new column and assigning values if any one of the row within a group contains a certain value

分類Dev

Group by column and select range of each group

分類Dev

Select values from one column which share a value in another column

分類Dev

Generate percentage for each group based on column values using Python pandas

分類Dev

pandas groupby where you get the max of one column and the min of another column

分類Dev

dataframe columns as key and column data as value group by id in spark scala

分類Dev

Value in one OR another column

分類Dev

how to display min of a column and use max of the same column to filter other attributes in mysql?

Related 関連記事

  1. 1

    Get the value of a column where one column is max and another is min

  2. 2

    One value for each df Column group

  3. 3

    SQL: count rows where column = a value AND another column is the same as values in the group where the first condition is true?

  4. 4

    Using Dplyr's to find max values of a group and mutate the result in another column in the same table

  5. 5

    Retrieve last element of a column from each group and use it as first element of the same column in next group

  6. 6

    Group a dataframe on one column and take max from one column and its corresponding value from the other col

  7. 7

    Select MIN, MAX Corresponding column based on another column values

  8. 8

    For each group in a column, get only the rows with values in another column closest to a defined set

  9. 9

    Group by query with max column

  10. 10

    How to replace all same values of a group in a column (dataframe) according to another column without loop?

  11. 11

    Trying to find multiple values from one column and group by another column - can't seem to figure it out

  12. 12

    Group values together in Pandas column, then filter values in another column

  13. 13

    Group Values in excel according duplicate value in column

  14. 14

    MongoDB - Getting distinct values of a column after a group by on another column is applied

  15. 15

    How to sum the values in a column based on another column or different group?

  16. 16

    mysql update values from one column in same table based on another 3 column value

  17. 17

    Create new column that carries up the last value of another column, by group

  18. 18

    R data.table: Rebase each group within the panel by a value found in another column

  19. 19

    select a record from each group if it has given value in column otherwise any one record

  20. 20

    SELECT MAX()of Column、DATE Column and group by ID

  21. 21

    Selecting max and min dates in a group by with null values

  22. 22

    Creating a new column and assigning values if any one of the row within a group contains a certain value

  23. 23

    Group by column and select range of each group

  24. 24

    Select values from one column which share a value in another column

  25. 25

    Generate percentage for each group based on column values using Python pandas

  26. 26

    pandas groupby where you get the max of one column and the min of another column

  27. 27

    dataframe columns as key and column data as value group by id in spark scala

  28. 28

    Value in one OR another column

  29. 29

    how to display min of a column and use max of the same column to filter other attributes in mysql?

ホットタグ

アーカイブ