컨테이너가 Worker에서 플러시를 수행 할 때 슈퍼 스텝을 완료 한 후 종료되고 전체 애플리케이션이 중단됨-Giraph

씹다

EMR 에서 Giraph 응용 프로그램을 실행하고 있습니다.

나는 1 개의 마스터와 10 개의 슬레이브, 모든 m3.2xlarge 머신 의 클러스터를 사용하고 있습니다.

이 응용 프로그램은 기본적으로 스페인어 버전의 Wikipedia를 통해 BFS로 구성됩니다 (Giraph에 맞추기 위해 Wikipedia 정보를 적용했습니다).

다음과 같은 방법으로 응용 프로그램을 실행합니다.

/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.algoritmos.masivos.BusquedaDeCaminosNavegacionalesWikiquotesMasivo /tmp/vertices.txt 4 -@- 1 ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 10 -yh 11500 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,giraph.isStaticGraph=true,giraph.numInputThreads=4,giraph.numOutputThreads=4

3 개의 수퍼 스텝으로 애플리케이션을 성공적으로 실행할 수 있지만 4 개의 수퍼 스텝을 수행하려면 애플리케이션이 실패하고 컨테이너가 종료되고 나머지도 함께 죽습니다.

Giraph 응용 프로그램 관리자에서 검색하면 다음과 같이 말합니다.

16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-0-147.sa-east-1.compute.internal:9103
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1471231949464_0001_01_000005
16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-0-145.sa-east-1.compute.internal:9103
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000009
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000011
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000004
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000010
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000006
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000007
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000008
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000005
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000002
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000012
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000003
16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got response from RM for container ask, completedCnt=1
16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000008, state=COMPLETE, exitStatus=143, diagnostics=Container [pid=4455,containerID=container_1471231949464_0001_01_000008] is running beyond physical memory limits. Current usage: 11.4 GB of 11.3 GB physical memory used; 12.6 GB of 56.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_1471231949464_0001_01_000008 :
        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
        |- 4459 4455 4455 4455 (java) 13568 5567 13419675648 2982187 java -Xmx11500M -Xms11500M -cp .:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/* org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1 
        |- 4455 2706 4455 4455 (bash) 0 0 115875840 807 /bin/bash -c java -Xmx11500M -Xms11500M -cp .:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/* org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1 1>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stdout.log 2>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stderr.log  

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: After completion of one conatiner. current status is: completedCount :1 containersToLaunch :11 successfulCount :0 failedCount :1
16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got response from RM for container ask, completedCnt=7
16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000002, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000012, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000006, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)


Container exited with a non-zero exit code 1

16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000007, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException: 
org.apache.hadoop.util.Shell$ExitCodeException: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)

따라서 컨테이너 8의 메모리에 문제가있는 것처럼 보이지만 다음은 컨테이너 8의 마지막 로그 행입니다 (종료되는 컨테이너입니다).

16/08/15 03:46:52 INFO graph.ComputeCallable: call: Computation took 23.90834 secs for 10 partitions on superstep 3.  Flushing started
16/08/15 03:46:52 INFO worker.BspServiceWorker: finishSuperstep: Waiting on all requests, superstep 3 Memory (free/total/max) = 4516.47M / 10619.50M / 10619.50M
16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 1307 open requests, waiting for it to be <= 0, MBytes/sec received = 0.0029, MBytesReceived = 0.0678, ave received req MBytes = 0, secs waited = 23.332
MBytes/sec sent = 143.2912, MBytesSent = 3343.4141, ave sent req MBytes = 0.4999, secs waited = 23.332
16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests: 548 requests for taskId=10, 504 requests for taskId=0, 251 requests for taskId=5, 1 requests for taskId=4, 1 requests for taskId=7, 1 requests for taskId=8, 

그래서, 내가이 권리를 이해한다면, 컨테이너는 flush 하기 전에 4516.47M을 사용할 수 있으며, 그것을 할 때 그 4516.47M의 모든 가용을 소비하고 더 많은 것을 원할 때 Giraph AM에 의해 죽을까요?

나는 왜 그렇게 많은 메모리가 플러시를 해야하는지 이해하지 못합니다. 기본적으로 다음 슈퍼 단계를 위해 결과를 디스크에 저장합니까? 이론적으로는 메모리가 전혀 필요하지 않습니다.

씹다

플러시 프로세스가 메모리를 소모 할 수있는 것으로 보입니다. 각 컨테이너에 더 많은 메모리를 추가하는 것이 제가 찾을 수있는 유일한 해결책이었습니다.

이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.

침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제

에서 수정
0

몇 마디 만하겠습니다

0리뷰
로그인참여 후 검토

관련 기사

Related 관련 기사

뜨겁다태그

보관