我有Spark作业,最后使用saveAsTable将数据帧写入具有给定名称的内部表中。
使用不同的步骤创建数据框,其中之一是在scipy中使用“ beta”方法,在这里我通过=>从scipy.stats导入beta来导入数据框。它在带有20个工作程序节点的Google云上运行,但是我收到以下错误,它抱怨scipy软件包,
Caused by: org.apache.spark.SparkException:
Job aborted due to stage failure:
Task 14 in stage 7.0 failed 4 times, most recent failure:
Lost task 14.3 in stage 7.0 (TID 518, name-w-3.c.somenames.internal,
executor 23): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 364, in main
func, profiler, deserializer, serializer = read_command(pickleSer, infile)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 69, in read_command
command = serializer._read_with_length(file)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 172, in
_read_with_length
return self.loads(obj)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 583, in loads
return pickle.loads(obj)
ImportError: No module named scipy.stats._continuous_distns
有什么想法或解决方案吗?
我也尝试通过库来完成火花工作:
"spark.driver.extraLibraryPath" : "/usr/lib/spark/python/lib/pyspark.zip",
"spark.driver.extraClassPath" :"/usr/lib/spark/python/lib/pyspark.zip"
库是否安装在集群中的所有节点上?你可以简单地做一个
pip install --user scipy
我在AWS EMR中使用bootstrap操作执行此操作,在Google云上也应该有类似的方法
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句