我有以下数据框-
+----+-----+---+
| val|count| id|
+----+-----+---+
| a| 10| m1|
| b| 20| m1|
|null| 30| m1|
| b| 30| m2|
| c| 40| m2|
|null| 50| m2|
+----+-----+---+
由...制作 -
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
)toDF("val","count","id")
我正在尝试通过row_number()和窗口功能来进行排名,如下所示。
df1.withColumn("rannk_num", row_number() over Window.partitionBy("id").orderBy("count")).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| 3|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| 3|
+----+-----+---+---------+
但是我必须用列-val的空值过滤那些记录。
预期产出-
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| NULL|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| NULL|
+----+-----+---+---------+
想知道这是否有可能以最小的变化实现。val和count列也可以有'n'个值。
使用null val过滤那些行,为它们分配一个空行号,然后合并回原始数据框。
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
).toDF("val","count","id")
df1.filter("val is not null").withColumn(
"rannk_num", row_number() over Window.partitionBy("id").orderBy("count")
).union(
df1.filter("val is null").withColumn("rannk_num", lit(null))
).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 30| m1| null|
|null| 50| m2| null|
+----+-----+---+---------+
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句