Kafka consumer groups suddenly stopped balancing messages among instances

debugcn 投稿 Dev

Lisandro

We have a microservice architecture communicated by Kafka on Confluent where each service is set in its own consumer group in order to balance message delivery between the multiple instances.

For example:

SERVICE_A_INSTANCE_1 (CONSUMER_GROUP_A)
SERVICE_A_INSTANCE_2 (CONSUMER_GROUP_A)
SERVICE_A_INSTANCE_3 (CONSUMER_GROUP_A)

SERVICE_B_INSTANCE_1 (CONSUMER_GROUP_B)
SERVICE_B_INSTANCE_2 (CONSUMER_GROUP_B)

When a message is emitted it should only be consumed by one instance of each consumer group.

This worked fine until two days ago. All of the sudden, each message is being delivered to all the instances, so each message is processed multiple times. Basically, the consumer-group stopped working and messages are not being distributed.

Important points:

We use Kafka paas in Confluent on GCP.
We tested this in a different environment and everything worked as expected
No changes have been made on our consumers
No changes have been made on our part to the cluster (we cant know if Confluent changed something)

We suspect it might be a problem on Confluent or an update that is not compatible with our current configuration. Kafka 2.2.0 was recently released and it has some changes to consumer groups behavior.

We are currently working on migrating to AWS MSK to see if the issue prevails.

Any ideas on what could be causing this?

Lisandro

TL;DR: We solved the issue by moving away from Confluent into our own Kafka cluster on GCP.

I will answer my own question since its been a while and we have already solved this. Also, my insights might help others make more informed decisions on where to deploy their Kafka infrastructure.

Unfortunately we could not get to the bottom of the problem with Confluent. It is most likely something on their side because we simply migrated to our own self managed instances on GCP and everything went back to normal.

Some important clarifications before my final thoughts and warnings about using Confluent as a managed Kafka service:

We think this is related to something that affected Node.js in particular. We tested external libraries in languages other than Node and the behavior was as expected. When testing on multiple of the most popular Node libraries the problem persisted.
We did not have premium support with Confluent.
I cannot confirm that this issue is not our fault.

With all of those points in mind, our conclusion is that for companies that decide on using a managed service with Confluent, its best to calculate costs with premium support included. Otherwise, Kafka turns into a completely closed blackbox, making it impossible to diagnose issues. In my personal opinion, the dependency on the Confluent team during a problem is so large that not having them ready to help when needed renders the service non-production ready.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-10

コメントを追加

サインイン

分類Dev

Related 関連記事

記事