We have a microservice architecture communicated by Kafka on Confluent where each service is set in its own consumer group in order to balance message delivery between the multiple instances.
For example:
SERVICE_A_INSTANCE_1 (CONSUMER_GROUP_A)
SERVICE_A_INSTANCE_2 (CONSUMER_GROUP_A)
SERVICE_A_INSTANCE_3 (CONSUMER_GROUP_A)
SERVICE_B_INSTANCE_1 (CONSUMER_GROUP_B)
SERVICE_B_INSTANCE_2 (CONSUMER_GROUP_B)
When a message is emitted it should only be consumed by one instance of each consumer group.
This worked fine until two days ago. All of the sudden, each message is being delivered to all the instances, so each message is processed multiple times. Basically, the consumer-group stopped working and messages are not being distributed.
Important points:
We suspect it might be a problem on Confluent or an update that is not compatible with our current configuration. Kafka 2.2.0 was recently released and it has some changes to consumer groups behavior.
We are currently working on migrating to AWS MSK to see if the issue prevails.
Any ideas on what could be causing this?
TL;DR: We solved the issue by moving away from Confluent into our own Kafka cluster on GCP.
I will answer my own question since its been a while and we have already solved this. Also, my insights might help others make more informed decisions on where to deploy their Kafka infrastructure.
Unfortunately we could not get to the bottom of the problem with Confluent. It is most likely something on their side because we simply migrated to our own self managed instances on GCP and everything went back to normal.
Some important clarifications before my final thoughts and warnings about using Confluent as a managed Kafka service:
With all of those points in mind, our conclusion is that for companies that decide on using a managed service with Confluent, its best to calculate costs with premium support included. Otherwise, Kafka turns into a completely closed blackbox, making it impossible to diagnose issues. In my personal opinion, the dependency on the Confluent team during a problem is so large that not having them ready to help when needed renders the service non-production ready.
この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。
侵害の場合は、連絡してください[email protected]
コメントを追加