Graphx EdgeRDD count taking long time to compute

SanS

I am running a stand alone spark, I have this code below related to EdgeRDD. These are graph edges loaded from a textfile. There are around 67 million records.

val edges: RDD[Edge[Int]] = edge_file.map(line => {val x = line.split("\\s+")
                                                         Edge(x(0).toLong, x(1).toLong, x(2).toInt); })
val edges1: EdgeRDD[Int] = EdgeRDD.fromEdges(edges)

println(edges1.count)

The issue is just counting them it gets stuck up on rdd creation. I have machine with 24gb of RAM. What should be the optimal settings for executers and drivers. Or do I need to set any additional configuration in spark-env.sh. I am running spark 1.4.0

spark-1.4.0-bin-hadoop2.6/bin/spark-submit --executor-memory 10g --driver-memory 10g --class "GraphParser" --master local[6] target/scala-2.10/simple-project_2.10-1.0.jar 100

Here is the output :

    15/06/17 02:32:42 INFO SparkContext: Starting job: reduce at EdgeRDDImpl.scala:89
    15/06/17 02:32:42 INFO DAGScheduler: Got job 1 (reduce at EdgeRDDImpl.scala:89) with 6 output partitions (allowLocal=false)
    15/06/17 02:32:42 INFO DAGScheduler: Final stage: ResultStage 1(reduce at EdgeRDDImpl.scala:89)
    15/06/17 02:32:42 INFO DAGScheduler: Parents of final stage: List()
    15/06/17 02:32:42 INFO DAGScheduler: Missing parents: List()
    15/06/17 02:32:42 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[11] at map at EdgeRDDImpl.scala:89), which has no missing parents
    15/06/17 02:32:42 INFO MemoryStore: ensureFreeSpace(2904) called with curMem=507670, maxMem=8890959790
    15/06/17 02:32:42 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.8 KB, free 8.3 GB)
    15/06/17 02:32:42 INFO MemoryStore: ensureFreeSpace(1766) called with curMem=510574, maxMem=8890959790
    15/06/17 02:32:42 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1766.0 B, free 8.3 GB)
    15/06/17 02:32:42 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:55287 (size: 1766.0 B, free: 8.3 GB)
    15/06/17 02:32:42 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:874
    15/06/17 02:32:42 INFO DAGScheduler: Submitting 6 missing tasks from ResultStage 1 (MapPartitionsRDD[11] at map at EdgeRDDImpl.scala:89)
    15/06/17 02:32:42 INFO TaskSchedulerImpl: Adding task set 1.0 with 6 tasks
    15/06/17 02:32:42 INFO FairSchedulableBuilder: Added task set TaskSet_1 tasks to pool default
    15/06/17 02:32:47 WARN TaskSetManager: Stage 1 contains a task of very large size (140947 KB). The maximum recommended task size is 100 KB.
    15/06/17 02:32:47 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 144329897 bytes)
    15/06/17 02:32:53 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 145670467 bytes)
    15/06/17 02:32:58 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 4, localhost, PROCESS_LOCAL, 145674593 bytes)
    15/06/17 02:33:03 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 5, localhost, PROCESS_LOCAL, 145687533 bytes)
    15/06/17 02:33:08 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 6, localhost, PROCESS_LOCAL, 145694247 bytes)
    15/06/17 02:33:12 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 7, localhost, PROCESS_LOCAL, 145686985 bytes)
    15/06/17 02:33:12 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
    15/06/17 02:33:12 INFO Executor: Running task 2.0 in stage 1.0 (TID 4)
    15/06/17 02:33:12 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
    15/06/17 02:33:12 INFO Executor: Running task 5.0 in stage 1.0 (TID 7)
    15/06/17 02:33:12 INFO Executor: Running task 4.0 in stage 1.0 (TID 6)
    15/06/17 02:33:12 INFO Executor: Running task 3.0 in stage 1.0 (TID 5)
SanS

Aftergoing thru the log, figured that my task size is bigger and it takes time to schedule it. Spark itself warns this by saying.

 15/06/17 02:32:47 WARN TaskSetManager: Stage 1 contains a task of very large size (140947 KB). The maximum recommended task size is 100 KB.

That lead me to partition the data using code like below.

val graphDocs = EdgeRDD.fromEdges(sc.parallelize(docList, 200))

That fixed the issue. I got the results back in 45 seconds. Hope this will be useful to someone.

이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.

침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제

에서 수정
0

몇 마디 만하겠습니다

0리뷰
로그인참여 후 검토

관련 기사

분류에서Dev

Graphx에서 EdgeRDD를 만드는 방법

분류에서Dev

Ubuntu taking way too long time to start

분류에서Dev

Why is MATLAB job taking a long time running?

분류에서Dev

Time count and session variable for 4 hours long time

분류에서Dev

dosfsck of pendrive taking too long

분류에서Dev

Monitor Process and echo if taking too long

분류에서Dev

Identify if a website is taking too long to respond

분류에서Dev

How to compute for "Engaged Time" in excel

분류에서Dev

Taking time to load the image from URL to UIImageview

분류에서Dev

Oracle SQL query taking too long like 60 minutes to execute

분류에서Dev

Windows 10 Browser taking abnormally long to open external URLs

분류에서Dev

visual studio 2013 update 2 taking to long to configur

분류에서Dev

Ubuntu upgrade from 16.04 to 18.04 is taking too long

분류에서Dev

Compute the frequency of objects over time in Matlab

분류에서Dev

Httpclient take a very long time

분류에서Dev

How to split a long string based on character count

분류에서Dev

How to split a long string based on character count

분류에서Dev

Basic Javascript Countdown Timer Taking Longer Than Time Set

분류에서Dev

Automatically get the time taking day light saving hours into account

분류에서Dev

MYSQL Query Count in time interval

분류에서Dev

pandas apply function row wise taking too long is there any alternative for below code

분류에서Dev

Removing time element of Long Date Format

분류에서Dev

Windows Long Time Format 받기

분류에서Dev

c++ Socket receive takes a long time

분류에서Dev

It takes a *very* long time to eject my flashdrives

분류에서Dev

ActiveMQ browser needs long time for last .hasMoreElements()

분류에서Dev

SQL Query to Return Count Based on Time

분류에서Dev

Count records per hour within a time span

분류에서Dev

CPU time count on HyperThreading systems on linux

Related 관련 기사

  1. 1

    Graphx에서 EdgeRDD를 만드는 방법

  2. 2

    Ubuntu taking way too long time to start

  3. 3

    Why is MATLAB job taking a long time running?

  4. 4

    Time count and session variable for 4 hours long time

  5. 5

    dosfsck of pendrive taking too long

  6. 6

    Monitor Process and echo if taking too long

  7. 7

    Identify if a website is taking too long to respond

  8. 8

    How to compute for "Engaged Time" in excel

  9. 9

    Taking time to load the image from URL to UIImageview

  10. 10

    Oracle SQL query taking too long like 60 minutes to execute

  11. 11

    Windows 10 Browser taking abnormally long to open external URLs

  12. 12

    visual studio 2013 update 2 taking to long to configur

  13. 13

    Ubuntu upgrade from 16.04 to 18.04 is taking too long

  14. 14

    Compute the frequency of objects over time in Matlab

  15. 15

    Httpclient take a very long time

  16. 16

    How to split a long string based on character count

  17. 17

    How to split a long string based on character count

  18. 18

    Basic Javascript Countdown Timer Taking Longer Than Time Set

  19. 19

    Automatically get the time taking day light saving hours into account

  20. 20

    MYSQL Query Count in time interval

  21. 21

    pandas apply function row wise taking too long is there any alternative for below code

  22. 22

    Removing time element of Long Date Format

  23. 23

    Windows Long Time Format 받기

  24. 24

    c++ Socket receive takes a long time

  25. 25

    It takes a *very* long time to eject my flashdrives

  26. 26

    ActiveMQ browser needs long time for last .hasMoreElements()

  27. 27

    SQL Query to Return Count Based on Time

  28. 28

    Count records per hour within a time span

  29. 29

    CPU time count on HyperThreading systems on linux

뜨겁다태그

보관