I have a cluster and I execute wholeTextFiles
which should pull about a million text files who sum up to approximately 10GB
total I have one NameNode and two DataNode with 30GB
of RAM each, 4 cores each. The data is stored in HDFS
.
I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)
I'm just starting and I've never had the need to optimize a job before
EDIT: Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.
EDIT 2: benchmark assessment
So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:
15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.
Would that be a reason of the bad performance? How do I hedge that?
Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....
It seems that by increase the number of partitions in wholeTextFile(path,partitions)
, I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...
To summarize my recommendations from the comments:
--num-executors
) with 1 thread each (--executor-cores
) and 512m of RAM (--executor-memory
), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasksSo my recommendation is:
--num-executors 4 --executor-memory 12g --executor-cores 4
which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallelsc.wholeTextFiles
to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.
침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제
몇 마디 만하겠습니다