There are few businesses more seasonal than tax preparation. Whether it is CPAs pouring through their client's records, DIY-ers using online tax software or government agencies processing millions of returns, most of the year's activity is compressed into a few weeks between late-January and mid-April.
Note the word "online" here, since as Intuit's chart summarizing TurboTax sales illustrates, the process of using software to automate and streamline the preparation of individual tax returns has become almost exclusively an online service. Such highly fluctuating workloads are an ideal use of cloud infrastructure, something Intuit realized early on. According to Pratik Wadher, Intuit's VP of Product Development, having completed its migration to AWS, the company closed its last data center earlier this year.
Although cloud infrastructure is more scalable than a fleet of on-premises servers, it still requires planning and effort to ensure that enough resources are available when needed. Indeed, Wadher says that even AWS EC2 auto-scaling isn't fast enough to keep up with load spikes during peak hours of TurboTax usage. For Intuit, a better application deployment option pairs cloud infrastructure with Kubernetes container clusters. 2020 marked the first tax season when the majority of the modules and services making up TurboTax ran on Intuit's Kubernetes platform. The company's route to Kubernetes was mostly smooth and offers instructive lessons for other large organizations considering a similar strategy.
Moving from VMs to containers - easier with a services-oriented application design
Intuit is a diverse software company with a mix of consumer, professional, online and desktop products, however, it is best known for the TurboTax franchise which delivers about 41% of its total revenue and 54% of online sales. Indeed, with 48 million users this year, TurboTax accounted for 30% of all US tax returns. With such a sizeable user base, pioneering using Kubernetes infrastructure for TurboTax was simultaneously the most logical and riskiest choice for Intuit: logical since it promised the most financial and operational upside if successful; risky because any hiccups would cause the most monetary and reputational damage.
A recent blog by an Intuit engineering team describes its experience migrating to Kubernetes citing five objectives when it started migrating from EC2 VMs to container clusters in 2019:
- Increase the pace of product development and release.
- Consolidate development teams on a single infrastructure platform while preserving environmental isolation for each product group.
- Increase resource utilization, particularly during peak times to lower costs without compromising performance or reliability.
- Provide a unified distribution mechanism for reusable services and software components across products.
- Exploit a platform with a robust open source developer ecosystem contributing new features while offering Intuit developers an opportunity to participate in container- and DevOps-related open source projects.
Unlike many Kubernetes projects, Intuit jumped into the technology with both feet with a complicated product requiring sizable compute resources. According to Intuit's engineering team, TurboTax and its dependencies consist of 400 micro-services with 40 planned for Kubernetes. Handling the 40-50 million customers requires 26 Kubernetes clusters distributed between two AWS regions, each using three Availability Zones (AZs). Overall, TurboTax uses about 1,000 Kubernetes nodes that must scale from 5K to 300K transactions per second (TPS) within two hours.
Spreading clusters and the Kubernetes control plane among AZs is the standard way to provide redundancy within an AWS region. Although Intuit doesn't offer the details, the AWS EKS managed Kubernetes service automatically provisions control nodes in multiple AZs and handles rerouting around failed nodes. AWS also documents three ways of creating multi-zone worker node clusters for EKS using auto scaling groups and a load balancer (ELB, ALB) . Intuit does note that it uses the AWS application load balancer (ALB) to distribute client requests to nodes within each cluster.
For DR, Wadher says Intuit uses a mix of active-active and active-passive designs for regional redundancy where services can scale up capacity in either region. The goal is at least four 9s availability (about 4 minutes of downtime per month) for all micro-services.
Pre-production clusters are sliced by namespaces for QA (typically used for unit tests, build pipelines, etc.) and E2E (typically used for end-to-end product tests). Production clusters are sliced by namespaces for staging and production.
One factor that significantly simplified Intuit's Kubernetes migration was an application design that was already decomposed into stateless, API-driven microservices. The downside is that it required integrating the stateless worker nodes with various backend data services. Fortunately, these were already deployed in a separate AWS account and available via APIs. As Intuit's diagram illustrates, connecting the two required peering the two VPCs and enabling cross-account access via AWS IAM, which was already used for account authentication.
Much of the data layer was already being accessed via API through NAT Gateway, though a few services did have additional resource dependencies on other AWS services for datastores, memory queues, and cache. Rather than migrate these AWS services to the AWS account that housed the Kubernetes cluster, we enabled cross-AWS account access via IAM access controls, and set up VPC peering where necessary.
Production deployment reveals five solvable problems
During three months of pre-production testing using several DR scenarios (including some that took out an entire AWS region), Intuit noted five significant problems with system scalability and performance. It documented these in a second engineering blog post detailing the technical issues and its solution to each. These primarily stemmed from default configuration parameters being too restrictive for an application the size of TurboTax. Specifically, slowdowns were caused by:
- Kubernetes' DNS service was unable to service requests fast enough under heavy load; solved by increasing the cache size and number of control nodes running the kube-dns service.
- AWS Auto Scaling service couldn't provision nodes fast enough; solved by adding a few spare nodes to each cluster during peak tax season.
- Nodes weren't correctly booting up due to external dependencies; solved by creating a custom system image (AMI) for cluster nodes.
- Issues with the Kubernetes interface to AWS IAM during some operations; solved with a configuration change.
- Network storms when the AWS ALB performed a health check on cluster nodes; solved via several configuration changes.
- Lost or delayed entries when aggregating log data; solved by running a log collection agent in each Kubernetes pod as a sidecar function.
Satisfied with the fixes, Intuit deployed and successfully operated the production Kubernetes clusters in time for tax season in January. As this chart illustrates, from a baseline of about 1,100 pods, Intuit's infrastructure scaled up to 2,500 in early April. Similarly, its baseline node count of 600 grew 50% to 900 by the end of tax season.
Intuit demonstrated the feasibility and benefits of teaching an old software dog new tricks by repurposing its hallmark application on cloud-based Kubernetes infrastructure. Wadher said that although Intuit didn't realize cost benefits in year one, the instrumentation and dashboards it built to monitor resource usage and spending have made managing its costs much easier and more accurate. Furthermore, integrating the Kubernetes infrastructure with its Argo-based CI/CD workflow, has significantly shortened the release cycle for new software from 1-3 months down to 3 weeks. Indeed, Wadher is confident that its Kubernetes-based infrastructure and process will enable Intuit to achieve daily software releases within three years.
Intuit is typical of companies that have evolved from selling installed software to managed services. Critical to this evolution is exploiting new design paradigms like micro-services, automation tools like Argo and Jenkins, software packaging and management technology like Kubernetes and the abundant features and capacity available from cloud operators like AWS. As Intuit migrates the rest of its portfolio to containers, keeping up with the exponential growth in online revenue for something like Quickbooks wouldn't be possible without the expertise gained during Intuit's initial Kubernetes experiment during the 2020 tax season.