PySpark-获取具有相同值的数组的元素数

andy 发表于 Dev

安迪

我正在学习Spark，遇到一个无法克服的问题。我想要实现的是在相同位置获得2个数组具有相同值的元素数量。我可以通过Python UDF得到我想要的东西，但是我想知道是否有一种只使用Spark函数的方法。

df_bits = sqlContext.createDataFrame([[[0, 1, 1, 0, 0, ],
                                       [1, 1, 1, 0, 1, ],
                                     ]],['bits1', 'bits2'])
df_bits_with_result = df_bits.select('bits1', 'bits2', some_magic('bits1', 'bits2').show()


+--------------------+--------------------+---------------------------------+
|bits1                  |bits2                  |some_magic(bits1, bits2)|
+--------------------+--------------------+---------------------------------+
|[0, 1, 1, 0, 1, ]    |[1, 1, 1, 0, 0, ]   |3                                      |
+--------------------+--------------------+---------------------------------+

为什么是3？比特1 [1] ==比特2 [1]与比特1 [2] ==比特2 [2]与比特1 [3] ==比特2 [3]
我试图玩rdd.reduce但没有运气。

如

也许这很有帮助-

spark>=2.4

使用aggregate和zip_with

 val df = spark.sql("select array(0, 1, 1, 0, 0, null) as bits1, array(1, 1, 1, 0, 1, null) as bits2")
    df.show(false)
    df.printSchema()

    /**
      * +----------------+----------------+
      * |bits1           |bits2           |
      * +----------------+----------------+
      * |[0, 1, 1, 0, 0,]|[1, 1, 1, 0, 1,]|
      * +----------------+----------------+
      *
      * root
      * |-- bits1: array (nullable = false)
      * |    |-- element: integer (containsNull = true)
      * |-- bits2: array (nullable = false)
      * |    |-- element: integer (containsNull = true)
      */

    df.withColumn("x", expr("aggregate(zip_with(bits1, bits2, (x, y) -> if(x=y, 1, 0)), 0, (acc, x) -> acc + x)"))
      .show(false)

    /**
      * +----------------+----------------+---+
      * |bits1           |bits2           |x  |
      * +----------------+----------------+---+
      * |[0, 1, 1, 0, 0,]|[1, 1, 1, 0, 1,]|3  |
      * +----------------+----------------+---+
      */

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。