我正在尝试使TensorFlow在我的Spark集群上工作以使其并行运行。首先,我尝试按原样使用此演示。
该演示在没有Spark的情况下效果很好,但是在使用Spark时,出现以下错误:
16/08/02 10:44:16 INFO DAGScheduler: Job 0 failed: collect at /home/hdfs/tfspark.py:294, took 1.151383 s
Traceback (most recent call last):
File "/home/hdfs/tfspark.py", line 294, in <module>
local_labelled_images = labelled_images.collect()
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 771, in collect
File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError16/08/02 10:44:17 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:45020 in memory (size: 6.4 KB, free: 419.5 MB)
16/08/02 10:44:17 INFO ContextCleaner: Cleaned accumulator 2
: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/usr/hdp/2.4.2.0-258/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
File "/usr/lib/python2.7/site-packages/six.py", line 118, in __getattr__
_module = self._resolve()
File "/usr/lib/python2.7/site-packages/six.py", line 115, in _resolve
return _import_module(self.mod)
File "/usr/lib/python2.7/site-packages/six.py", line 118, in __getattr__
_module = self._resolve()
File "/usr/lib/python2.7/site-packages/six.py", line 115, in _resolve
return _import_module(self.mod)
File "/usr/lib/python2.7/site-packages/six.py", line 118, in __getattr__
_module = self._resolve()
.
.
.
RuntimeError: maximum recursion depth exceeded
当我使用pyspark或直接使用spark-submit时,出现相同的错误。
我试图将递归限制增加到50000(即使它可能不是根本原因),但它没有帮助。
由于错误是由6个软件包引起的,所以我认为python 3可以修复它,但是我还没有尝试过,因为它可能需要在生产环境中进行调整(如果可以避免的话会更好)。
python 3是否应该与pyspark更好地配合使用?(我知道它可以与TensorFlow一起很好地工作)
关于如何使其与python 2一起工作的任何想法吗?
我在带有Python 2.7.5的RHEL 7.2上的HortonWorks集群中运行TensorFlow 0.9.0 Spark 1.6.1。
谢谢
使用python 3.5进行了尝试-获取相同的异常。因此,显然无法升级到python 3。
我终于意识到根本原因是六个模块本身–它与spark存在一些兼容性问题,并且每当加载时都会有问题。
因此,为了解决该问题,我在演示中搜索了六个软件包的所有用法,并用python 2中的等效模块替换了它们(例如,six.moves.urllib.response变成了urllib2)。删除所有出现的六个事件后,该演示程序可以在Spark上完美运行。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句