ChannelLife India - Industry insider news for technology resellers
India
Flipkart wins CNCF award for Kubernetes chaos testing

Flipkart wins CNCF award for Kubernetes chaos testing

Thu, 25th Jun 2026 (Today)
Joseph Gabriel Lagonsin
JOSEPH GABRIEL LAGONSIN News Editor

Flipkart has won the CNCF End User Case Study Contest for its work on Kubernetes and chaos engineering, with the award recognising its central reliability engineering team and use of LitmusChaos.

The recognition focuses on a custom multi-tenant chaos engineering platform built to test and strengthen the resilience of Flipkart's sprawling technology estate. The system runs fault-injection exercises across microservices and legacy virtual machine workloads before periods of heavy consumer demand.

Flipkart, one of India's largest eCommerce groups, runs hundreds of tightly linked microservices across Kubernetes clusters and virtual machines. That complexity led the company to make outage testing more routine, rather than relying on ad hoc responses when systems failed.

After reviewing several tools, the engineering team chose LitmusChaos, an open-source project under the Cloud Native Computing Foundation, to orchestrate testing in its Kubernetes environment. Engineers then extended it with a hybrid multi-tenant design, a DaemonSet-based model for parallel fault injection, a Script Runner fault for dynamic target selection, and an internal hybrid extension for workloads still running on virtual machines.

Those additions were designed to solve practical issues that emerged at scale. One involved scheduling bottlenecks tied to helper pods during core operations, which the team addressed by shifting to a persistent node-level DaemonSet that runs concurrent injections through parallel shell sessions.

Scale testing

According to CNCF, Flipkart now carries out about 90% of its infrastructure chaos experiments in staging environments running on Kubernetes. The goal is to expose weaknesses before major sales periods, including India's festive shopping peaks, when transaction volumes can surge and failures in one service can cascade through the wider system.

The work helped reduce over-provisioning bottlenecks in clusters and assess whether observability systems provided enough visibility into failures. It also shifted internal operating practices from reactive incident handling to rehearsed procedures tied to updated incident runbooks.

One team used the platform's Script Runner fault to test leader-election behaviour in high-availability database environments. This allowed engineers to simulate failures more precisely and assess whether systems behaved as intended when a primary node became unavailable.

Open-source contribution

Alongside the internal project, Flipkart contributed five fixes and enhancements to the upstream LitmusChaos project. CNCF said the changes addressed issues including database index fixes for project-scoped probe uniqueness, repairs to duplicate-name validation during tag edits, and workflow configuration fixes for custom image registries.

Those contributions reflected the judging emphasis on practical use and broader community benefit. The award highlights not only the adoption of open-source infrastructure, but also participation in maintaining and improving it.

"Resilience is table stakes for running microservices at scale," said Chris Aniszczyk, Chief Technology Officer, CNCF.

"Flipkart's systematic practice with Kubernetes and LitmusChaos demonstrates how a vendor-neutral approach eliminates the guesswork of fault injection and hardens the open-source foundation. Their five upstream contributions are the real win for community collaboration," said Aniszczyk.

For Flipkart, the project is part of a broader effort to formalise reliability engineering across a large digital platform serving hundreds of millions of consumers. Chaos engineering, once seen by some operations teams as disruptive or risky, is now being embedded as a standard process for preparing systems for failure.

Its central reliability engineering team also built a subscriber model in a central namespace to balance shared efficiency with tenant isolation across internal users. The architecture was intended to let multiple teams run tests within a common framework without creating cross-team interference.

The platform also reflects a hybrid reality common in large enterprises, where not all workloads have moved to containers. By extending the framework to cover both Kubernetes and virtual machine environments, Flipkart aimed to avoid limiting resilience testing to newer applications alone.

"Winning the CNCF End User Case Study contest validates our team's commitment to treating system outages as a standard, systematic procedure," said Aditya Sridasyam, Software Development Engineer 2, Flipkart.

"By leveraging the extensibility of vendor-neutral Kubernetes and LitmusChaos, as well as engineering our own custom hybrid platform, we've successfully hardened our massive microservices estate ahead of high-traffic festive sales, and we are proud to contribute our work back to the open-source community," said Sridasyam.