Kafka topic partitions to Spark streaming

Srdjan Nikitovic

I have some use cases that I would like to be more clarified, about Kafka topic partitioning -> spark streaming resource utilization.

I use spark standalone mode, so only settings I have are "total number of executors" and "executor memory". As far as I know and according to documentation, way to introduce parallelism into Spark streaming is using partitioned Kafka topic -> RDD will have same number of partitions as kafka, when I use spark-kafka direct stream integration.

So if I have 1 partition in the topic, and 1 executor core, that core will sequentially read from Kafka.

What happens if I have:

  • 2 partitions in the topic and only 1 executor core? Will that core read first from one partition and then from the second one, so there will be no benefit in partitioning the topic?

  • 2 partitions in the topic and 2 cores? Will then 1 executor core read from 1 partition, and second core from the second partition?

  • 1 kafka partition and 2 executor cores?

Thank you.

AngerClown

The basic rule is that you can scale up to the number of Kafka partitions. If you set spark.executor.cores greater than the number of partitions, some of the threads will be idle. If it's less than the number of partitions, Spark will have threads read from one partition then the other. So:

  1. 2 partitions, 1 executor: reads from one partition then then other. (I am not sure how Spark decides how much to read from each before switching)

  2. 2p, 2c: parallel execution

  3. 1p, 2c: one thread is idle

For case #1, note that having more partitions than executors is OK since it allows you to scale out later without having to re-partition. The trick is to make sure that your partitions are evenly divisible by the number of executors. Spark has to process all the partitions before passing data onto the next step in the pipeline. So, if you have 'remainder' partitions, this can slow down processing. For example, 5 partitions and 4 threads => processing takes the time of 2 partitions - 4 at once then one thread running the 5th partition by itself.

Also note that you may also see better processing throughput if you keep the number of partitions / RDDs the same throughout the pipeline by explicitly setting the number of data partitions in functions like reduceByKey().

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

ExceptionInInitializerError Spark Streaming Kafka

分類Dev

Spark Kafka Streaming Issue

分類Dev

Kafka Consumer get assigned partitions for a specific topic

分類Dev

Kafka consumer - what's the relation of consumer processes and threads with topic partitions

分類Dev

How to spread existing kafka topic partitions into more directories?

分類Dev

Spark Structured Streaming with Kafka version 2

分類Dev

Spark Streaming + Kafka統合0.8.2.1

分類Dev

Create multiple partitions in kafka topic and publish message to all of them using kafka-node

分類Dev

Does one consumer thread against many partitions per topic in Kafka can cause latency?

分類Dev

What happens to existing topic's partitions when a new broker is added to the Kafka cluster?

分類Dev

Spark 1.6 Streaming consumer reading in kafka offset stuck at createDirectStream

分類Dev

How to read from specific Kafka partition in Spark structured streaming

分類Dev

Spark Streaming Kafka java.lang.ClassNotFoundException:org.apache.kafka.common.serialization.StringDeserializer

分類Dev

Understanding Kafka Topics and Partitions

分類Dev

spark-streaming-kafka-0-10 auto.offset.reset is always set to none

分類Dev

Kafka + Spark Streaming-パーティション間の公平性?

分類Dev

How to fix "java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord" in Spark Streaming Kafka Consumer?

分類Dev

Cannot reassign kafka topic partition

分類Dev

Kafka Streams - missing source topic

分類Dev

Kafka MirrorMaker - Deletion of a topic is not replicated

分類Dev

Is it possible to specify a kafka topic in a Kafka Connect config?

分類Dev

Spark - Collect partitions using foreachpartition

分類Dev

spark streaming throughput monitoring

分類Dev

Spark Streaming textFileStream COPYING

分類Dev

Dstream Spark Streaming

分類Dev

spark-streaming-kafka-0-10auto.offset.resetは常にnoneに設定されます

分類Dev

Kafka Spark Streaming LocationStrategiesjavaクラス定義が見つかりません例外

分類Dev

Apache Kafka streaming KTable changelog

分類Dev

Kafka architecture many partitions or many topics?

Related 関連記事

  1. 1

    ExceptionInInitializerError Spark Streaming Kafka

  2. 2

    Spark Kafka Streaming Issue

  3. 3

    Kafka Consumer get assigned partitions for a specific topic

  4. 4

    Kafka consumer - what's the relation of consumer processes and threads with topic partitions

  5. 5

    How to spread existing kafka topic partitions into more directories?

  6. 6

    Spark Structured Streaming with Kafka version 2

  7. 7

    Spark Streaming + Kafka統合0.8.2.1

  8. 8

    Create multiple partitions in kafka topic and publish message to all of them using kafka-node

  9. 9

    Does one consumer thread against many partitions per topic in Kafka can cause latency?

  10. 10

    What happens to existing topic's partitions when a new broker is added to the Kafka cluster?

  11. 11

    Spark 1.6 Streaming consumer reading in kafka offset stuck at createDirectStream

  12. 12

    How to read from specific Kafka partition in Spark structured streaming

  13. 13

    Spark Streaming Kafka java.lang.ClassNotFoundException:org.apache.kafka.common.serialization.StringDeserializer

  14. 14

    Understanding Kafka Topics and Partitions

  15. 15

    spark-streaming-kafka-0-10 auto.offset.reset is always set to none

  16. 16

    Kafka + Spark Streaming-パーティション間の公平性?

  17. 17

    How to fix "java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord" in Spark Streaming Kafka Consumer?

  18. 18

    Cannot reassign kafka topic partition

  19. 19

    Kafka Streams - missing source topic

  20. 20

    Kafka MirrorMaker - Deletion of a topic is not replicated

  21. 21

    Is it possible to specify a kafka topic in a Kafka Connect config?

  22. 22

    Spark - Collect partitions using foreachpartition

  23. 23

    spark streaming throughput monitoring

  24. 24

    Spark Streaming textFileStream COPYING

  25. 25

    Dstream Spark Streaming

  26. 26

    spark-streaming-kafka-0-10auto.offset.resetは常にnoneに設定されます

  27. 27

    Kafka Spark Streaming LocationStrategiesjavaクラス定義が見つかりません例外

  28. 28

    Apache Kafka streaming KTable changelog

  29. 29

    Kafka architecture many partitions or many topics?

ホットタグ

アーカイブ