一个 DataFrame 如下:
import spark.implicits._
val df1 = List(
("id1", Array(0,2)),
("id1",Array(2,1)),
("id2",Array(0,3))
).toDF("id", "value")
+---+------+
| id| value|
+---+------+
|id1|[0, 2]|
|id1|[2, 1]|
|id2|[0, 3]|
+---+------+
我想 groupBy id 以获得每个值数组的最大池化。最大 id1 值为 Array(2,2)。我想得到的结果是:
import spark.implicits._
val res = List(
("id1", Array(2,2)),
("id2",Array(0,3))
).toDF("id", "value")
+---+------+
| id| value|
+---+------+
|id1|[2, 2]|
|id2|[0, 3]|
+---+------+
import spark.implicits._
val df1 = List(
("id1", Array(0,2,3)),
("id1",Array(2,1,4)),
("id2",Array(0,7,3))
).toDF("id", "value")
val df2rdd = df1.rdd
.map(x => (x(0).toString,x.getSeq[Int](1)))
.reduceByKey((x,y) => {
val arrlength = x.length
var i = 0
val resarr = scala.collection.mutable.ArrayBuffer[Int]()
while(i < arrlength){
if (x(i) >= y(i)){
resarr.append(x(i))
} else {
resarr.append(y(i))
}
i += 1
}
resarr
}).toDF("id","newvalue")
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句