val originalDF = Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",600,80000),(3,"rishi","ahmedabad",510,65000)).toDF("id","name","city","credit_score","credit_limit")
val changedDF= Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",650,90000),(4,"Joshua","cochin",612,85000)).toDF("id","name","city","creditscore","credit_limit")
因此,以上两个数据帧具有相同的表结构,我想找出另一个数据帧中已更改其值的id(changedDF)。我尝试在spark中使用exception()函数,但它给了我两行。Id是这两个数据框之间的公共列。
changedDF.except(originalDF).show
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 4|Joshua|cochin| 612| 85000|
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
而我只希望有任何更改的通用ID。
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
有什么方法可以找出更改了数据的唯一公共ID。有人可以告诉我我可以遵循的任何方法来实现这一目标。
您可以进行inner
数据帧的连接,这将为您提供具有通用ID的结果。
originalDF.alias("a").join(changedDF.alias("b"), col("a.id") === col("b.id"), "inner")
.select("a.*")
.except(changedDF)
.show
然后,您的预期结果将会出来:
+---+-----+-----+------------+------------+
| id| name| city|credit_score|credit_limit|
+---+-----+-----+------------+------------+
| 2|sunil|noida| 600| 80000|
+---+-----+-----+------------+------------+
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句