Spark: sc.WholeTextFiles takes a long time to execute

debugcn 에 게시 Dev

Stephane

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. The data is stored in HDFS.

I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)

I'm just starting and I've never had the need to optimize a job before

EDIT: Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.

EDIT 2: benchmark assessment

So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:

15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.

Would that be a reason of the bad performance? How do I hedge that?

Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....

It seems that by increase the number of partitions in wholeTextFile(path,partitions), I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...

0x0FFF

To summarize my recommendations from the comments:

HDFS is not a good fit for storing many small files. First of all, NameNode stores metadata in memory so the amount of files and blocks you might have is limited (~100m blocks is a max for typical server). Next, each time you read file you first query NameNode for block locations, then connect to the DataNode storing the file. Overhead of this connections and responses is really huge.
Default settings should always be reviewed. By default Spark starts on YARN with 2 executors (--num-executors) with 1 thread each (--executor-cores) and 512m of RAM (--executor-memory), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasks

So my recommendation is:

Start Spark with --num-executors 4 --executor-memory 12g --executor-cores 4 which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallel
Use sc.wholeTextFiles to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration

이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.

침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제

에서 수정2021-06-2

몇 마디 만하겠습니다

0리뷰

로그인참여 후 검토

Related 관련 기사

기사

Spark: sc.WholeTextFiles takes a long time to execute

Spark: sc.WholeTextFiles takes a long time to execute

Spark : sc.WholeTextFiles를 실행하는 데 오랜 시간이 걸립니다.

c++ Socket receive takes a long time

It takes a *very* long time to eject my flashdrives

Query changing strings to ints in mongoDB takes a long time

Run multiple cron jobs where one job takes a long time

Bluetooth BNEP takes a long time to load on boot Debian

Disabling ntp.service for the boot, since it takes a long time

unzip .gz file in Java takes long time using GZInputStream and byte buffer

Dynamics Ax 2012 R2 AIF services refresh takes too long a time

cygwin on win7 x64 takes long time to respond

powershell script takes hours to execute

Query very long to execute

wholeTextFiles를 사용하여 Spark에서 gz 파일을 읽는 방법

media app takes too long to buffer

CalendarView takes much time for displaying

Httpclient take a very long time

Set time to execute Javascript function?

Prime numbers calculator takes too much time (JAVA)

Removing time element of Long Date Format

Windows Long Time Format 받기

Ubuntu taking way too long time to start

Graphx EdgeRDD count taking long time to compute

ActiveMQ browser needs long time for last .hasMoreElements()

Why is MATLAB job taking a long time running?

Oracle SQL query taking too long like 60 minutes to execute

How to execute "Yes/No" operation as long as I press "Yes" in bash?

Execute code every time the application begins

Time count and session variable for 4 hours long time

Execute compile time-compiled regex at compile time

It takes a very long time to eject my flashdrives