As Wix Kafka usage grew to 1.5B messages per day, >10K topics and >100K leader partitions serving 2000 microservices, we decided to migrate from self-running cluster per data-center to a managed cloud service (Confluent Cloud) with multi-cluster setup.
This talk is about how we successfully migrated with 0 downtime and full traffic and the lessons we learned along the way.
These lessons include: 1. Automation, Automation, Automation - all the process has to be completely automated at such scale 2. Prefer a gradual approach - E.g. migrate topics in small chunks and not all at once. Reduces risks if things go bad 3. First migrate test topics with relayed real traffic - So data will be real but will not effect production. 4. Cleanup first - avoid migrating unused topics or topics with too many unnecessary partitions 5. Adapt to Confluent Cloud APIs - e.g. lag monitoring
Priority access to all content
Exclusive promotions and giveaways