Spark: sc.WholeTextFiles takes a long time to execute

Stephane

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. The data is stored in HDFS.

I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)

I'm just starting and I've never had the need to optimize a job before

EDIT: Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.

EDIT 2: benchmark assessment

So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:

15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.

Would that be a reason of the bad performance? How do I hedge that?

Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....

It seems that by increase the number of partitions in wholeTextFile(path,partitions), I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...

0x0FFF

To summarize my recommendations from the comments:

  1. HDFS is not a good fit for storing many small files. First of all, NameNode stores metadata in memory so the amount of files and blocks you might have is limited (~100m blocks is a max for typical server). Next, each time you read file you first query NameNode for block locations, then connect to the DataNode storing the file. Overhead of this connections and responses is really huge.
  2. Default settings should always be reviewed. By default Spark starts on YARN with 2 executors (--num-executors) with 1 thread each (--executor-cores) and 512m of RAM (--executor-memory), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasks

So my recommendation is:

  1. Start Spark with --num-executors 4 --executor-memory 12g --executor-cores 4 which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallel
  2. Use sc.wholeTextFiles to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration

이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.

침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제

에서 수정
0

몇 마디 만하겠습니다

0리뷰
로그인참여 후 검토

관련 기사

분류에서Dev

Spark : sc.WholeTextFiles를 실행하는 데 오랜 시간이 걸립니다.

분류에서Dev

c++ Socket receive takes a long time

분류에서Dev

It takes a *very* long time to eject my flashdrives

분류에서Dev

Query changing strings to ints in mongoDB takes a long time

분류에서Dev

Run multiple cron jobs where one job takes a long time

분류에서Dev

Bluetooth BNEP takes a long time to load on boot Debian

분류에서Dev

Disabling ntp.service for the boot, since it takes a long time

분류에서Dev

unzip .gz file in Java takes long time using GZInputStream and byte buffer

분류에서Dev

Dynamics Ax 2012 R2 AIF services refresh takes too long a time

분류에서Dev

cygwin on win7 x64 takes long time to respond

분류에서Dev

powershell script takes hours to execute

분류에서Dev

Query very long to execute

분류에서Dev

wholeTextFiles를 사용하여 Spark에서 gz 파일을 읽는 방법

분류에서Dev

media app takes too long to buffer

분류에서Dev

CalendarView takes much time for displaying

분류에서Dev

Httpclient take a very long time

분류에서Dev

Set time to execute Javascript function?

분류에서Dev

Prime numbers calculator takes too much time (JAVA)

분류에서Dev

Removing time element of Long Date Format

분류에서Dev

Windows Long Time Format 받기

분류에서Dev

Ubuntu taking way too long time to start

분류에서Dev

Graphx EdgeRDD count taking long time to compute

분류에서Dev

ActiveMQ browser needs long time for last .hasMoreElements()

분류에서Dev

Why is MATLAB job taking a long time running?

분류에서Dev

Oracle SQL query taking too long like 60 minutes to execute

분류에서Dev

How to execute "Yes/No" operation as long as I press "Yes" in bash?

분류에서Dev

Execute code every time the application begins

분류에서Dev

Time count and session variable for 4 hours long time

분류에서Dev

Execute compile time-compiled regex at compile time

Related 관련 기사

  1. 1

    Spark : sc.WholeTextFiles를 실행하는 데 오랜 시간이 걸립니다.

  2. 2

    c++ Socket receive takes a long time

  3. 3

    It takes a *very* long time to eject my flashdrives

  4. 4

    Query changing strings to ints in mongoDB takes a long time

  5. 5

    Run multiple cron jobs where one job takes a long time

  6. 6

    Bluetooth BNEP takes a long time to load on boot Debian

  7. 7

    Disabling ntp.service for the boot, since it takes a long time

  8. 8

    unzip .gz file in Java takes long time using GZInputStream and byte buffer

  9. 9

    Dynamics Ax 2012 R2 AIF services refresh takes too long a time

  10. 10

    cygwin on win7 x64 takes long time to respond

  11. 11

    powershell script takes hours to execute

  12. 12

    Query very long to execute

  13. 13

    wholeTextFiles를 사용하여 Spark에서 gz 파일을 읽는 방법

  14. 14

    media app takes too long to buffer

  15. 15

    CalendarView takes much time for displaying

  16. 16

    Httpclient take a very long time

  17. 17

    Set time to execute Javascript function?

  18. 18

    Prime numbers calculator takes too much time (JAVA)

  19. 19

    Removing time element of Long Date Format

  20. 20

    Windows Long Time Format 받기

  21. 21

    Ubuntu taking way too long time to start

  22. 22

    Graphx EdgeRDD count taking long time to compute

  23. 23

    ActiveMQ browser needs long time for last .hasMoreElements()

  24. 24

    Why is MATLAB job taking a long time running?

  25. 25

    Oracle SQL query taking too long like 60 minutes to execute

  26. 26

    How to execute "Yes/No" operation as long as I press "Yes" in bash?

  27. 27

    Execute code every time the application begins

  28. 28

    Time count and session variable for 4 hours long time

  29. 29

    Execute compile time-compiled regex at compile time

뜨겁다태그

보관