我想转换这个PySpark数据框:
df = spark.createDataFrame([
("A", 1),
("A", 2),
("A", 3),
("B", 1),
("B", 2),
("B", 4),
("B", 5)
],
["name", "connect"]
)
df.show()
+----+-------+
|name|connect|
+----+-------+
| A| 1|
| A| 2|
| A| 3|
| B| 1|
| B| 2|
| B| 4|
| B| 5|
+----+-------+
转换成以下格式:
df_out = spark.createDataFrame([
("A", "A", 3),
("B", "B", 4),
("A", "B", 2)
],
["name1", "name2", "n_connect"]
)
df_out.show()
+-----+-----+---------+
|name1|name2|n_connect|
+-----+-----+---------+
| A| A| 3|
| B| B| 4|
| A| B| 2|
+-----+-----+---------+
即,我想知道每个名称有多少个“连接”,并且我想知道每个名称之间有多少个共享的“连接”。Spark中是否有任何标准功能可以使我做到这一点?
您可以进行自联接,合并相同的组合,即A-> B和B-> A,然后connect
为每个组合计算countDistinct 。下面我们sort_array(array(d1.name, d2.name))
用来对唯一名称组合进行分组:
from pyspark.sql.functions import countDistinct
df_new = df.alias("d1").join(df.alias("d2"), "connect") \
.selectExpr("sort_array(array(d1.name, d2.name)) as names", "d1.connect") \
.groupby("names") \
.agg(countDistinct("connect").alias("n_connect"))
+------+---------+
| names|n_connect|
+------+---------+
|[A, A]| 3|
|[B, B]| 4|
|[A, B]| 2|
+------+---------+
df_new.selectExpr("names[0] as name1", "names[1] as name2", "n_connect").show()
+-----+-----+---------+
|name1|name2|n_connect|
+-----+-----+---------+
| A| A| 3|
| B| B| 4|
| A| B| 2|
+-----+-----+---------+
您可以对熊猫做类似的事情:
pdf = df.toPandas()
pdf.merge(pdf, on="connect") \
.assign(names=lambda x: [tuple(sorted(z)) for z in zip(x.name_x, x.name_y)]) \
.groupby('names')["connect"].nunique()
#Out[*]:
#names
#(A, A) 3
#(A, B) 2
#(B, B) 4
根据@anky的建议,使用np.sort()对名称进行排序:
import numpy as np
names = ["name_x", "name_y"]
pdf1 = pdf.merge(pdf, on="connect")
pdf1[names] = np.sort(pdf1[names],1)
pdf1.groupby(names)["connect"].nunique().reset_index()
# name_x name_y connect
#0 A A 3
#1 A B 2
#2 B B 4
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句