如何获得火花行的value_counts？

debugcn 发表于 Dev

国家情报局

我有一个带有3列的spark数据框，其中存储3个不同的预测。我想知道每个输出值的计数，以便选择获得最大次数的值作为最终输出。

通过在每一行调用我的lambda函数以获取value_counts，我可以在pandas中轻松地做到这一点，如下所示。我已经在这里将spark df转换为pandas df，但是我需要能够直接在spark df上执行类似的操作。

r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0] 
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))

输出告诉我值1出现的次数最多-在这种情况下为两次-

+---+--------+-----+-----+-----+
| id|    name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
|  1|test run|    1|    2|    1|
+---+--------+-----+-----+-----+

1    2
2    1
Name: 0, dtype: int64

Majority output = 1, occured 2 out of 3 times

我正在尝试编写一个udf函数，该函数可以使用df1的每一行并获取top_val和top_val_cnt。有没有办法使用spark df实现此目的？

面包

python的代码应该相似，也许会对您有所帮助

  val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
  df1.show()
  df1.select(array('*)).map(s=>{
    val list = s.getList(0)
    (list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
  }).show(false)

输出：

+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
|  1|  1|  1|  2|
|  1|  2|  3|  3|
|  2|  2|  2|  2|
+---+---+---+---+

+------------+-------------------------+
|_1          |_2                       |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3))       |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4))              |
+------------+-------------------------+

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。