Conf42: Cloud Native 2022

...

Migrating to Multi Cluster Managed Kafka with 0 Downtime

Natan Silnitsky
Senior Software Engineer @ Wix

Natan Silnitsky's LinkedIn account Natan Silnitsky's twitter account



As Wix Kafka usage grew to 1.5B messages per day, >10K topics and >100K leader partitions serving 2000 microservices, we decided to migrate from self-running cluster per data-center to a managed cloud service (Confluent Cloud) with multi-cluster setup.

This talk is about how we successfully migrated with 0 downtime and full traffic and the lessons we learned along the way.

These lessons include: 1. Automation, Automation, Automation - all the process has to be completely automated at such scale 2. Prefer a gradual approach - E.g. migrate topics in small chunks and not all at once. Reduces risks if things go bad 3. First migrate test topics with relayed real traffic - So data will be real but will not effect production. 4. Cleanup first - avoid migrating unused topics or topics with too many unnecessary partitions 5. Adapt to Confluent Cloud APIs - e.g. lag monitoring

Awesome tech events for

Priority access to all content

Community Discord

Exclusive promotions and giveaways