如何在 Pyspark 数据框中查询字典格式列

debugcn 发表于 Dev

艾伦

有以下数据框：

  >>> df.printSchema()
  root
   |-- I: string (nullable = true)
   |-- F: string (nullable = true)
   |-- D: string (nullable = true)
   |-- T: string (nullable = true)
   |-- S: string (nullable = true)
   |-- P: string (nullable = true)

F列是字典格式：

   {"P1":"1:0.01","P2":"3:0.03,4:0.04","P3":"3:0.03,4:0.04",...}

我需要按如下方式阅读 F 列并创建两个新列 P 和 N

   P1 => "1:0.01"
   P2 => "3:0.03,4:0.04"
   and so on

 +--------+--------+-----------------+-----+------+--------+----+
 | I      |  P     | N               |  D  | T    | S      | P  |
 +--------+--------+---------------- +------------+--------+----+
 | i1     |  p1    | 1:0.01          |  d1 | t1   | s1     | p1 |
 |--------|--------|-----------------|-----|------|--------|----|
 | i1     |  p2    | 3:0.03,4:0.04   |  d1 | t1   | s1     | p1 |
 |--------|--------|-----------------|-----|------|--------|----|
 | i1     |  p3    | 3:0.03,4:0.04   |  d1 | t1   | s1     | p1 |
 |--------|--------|-----------------|-----|------|--------|----|
 | i2     |  ...   | ....            |  d2 | t2   | s2     | p2 |
 +--------+--------+-----------------+-----+------+--------+----+

Pyspark 有什么建议吗？

艾伦

最后我是这样解决的：

 #This method replaces "," with ";" to 
 #distinguish between other camas in the string to split it
 def _comma_replacement(val):
    if (val):
        val = val.replace('","', '";"').replace('{','').replace('}', '')
    return val

replacing = UserDefinedFunction(lambda x: _comma_replacement(x))
new_df = df.withColumn("F", replacing(col("F")))
new_df = new_df.withColumn("F",split(col("F"),";").cast(ArrayType(StringType())))
exploded_df = new_df.withColumn("F", explode("F"))
df_sep = exploded_df.withColumn("F",split(col("F"),'":"').cast(ArrayType(StringType())))
dff = df_sep.withColumn("P", df_sep["F"].getItem(0))
dff_new = dff.withColumn("N", dff["F"].getItem(1))
dff_new = dff_new.drop('F')

使用另一个 UDF，我删除了字符串操作期间剩余的额外字符。

上面的解决方案也采用了同样的方式。关键思想是区分不同组件之间及其内部的逗号。为此，我建议在 UDF 中调用 _comma_replacement(val) 方法。上述解决方案也使用了相同的方法，但使用了可以更加优化的 regxp_replace。

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。