How Ocado migrated its on-premise robotic control system to AWS on Christmas Eve
Project Tempest will see thousands of Ocado bots running in AWS cloud environments
Online grocery pioneer Ocado, which opened its first fulfilment center 20 years ago and which is well-known for its use of automation, says it has taken the next step in optimization by pushing a key operational control system into the cloud.
diginomica has closely tracked Ocado’s use of robotics (see here and here, for examples). What is less well known is the company’s increasingly serious commitment to cloud and, in particular, the AWS stack. The firm utilizes over 80 AWS services to scale the Ocado smart platform to its grocery partners around the world - and according to its Chief Technology Officer, James Donkin, it now promotes a cloud-first approach to all partners and internal engineers.
Speaking at the recent London AWS Summit, Donkin said that the brand has now deployed 12 automated warehouses, is launching another six this year, and over the next few years has 50 planned globally. He added:
Our partnership with AWS has played a key role in our ability to scale our technology to global markets, and to achieve efficiencies that set the bar in the industry. Having realized the benefits of standardizing on native services, we're freeing up our engineers to focus on adding value to the customer offering and leaving the undifferentiated heavy lifting to AWS. That now includes migrating the orchestration of our fleet of bots.
At the heart of the Ocado model is its warehouses, which make heavy use of automation, robotics, and control systems. Thousands of bots need to collaborate seamlessly on 3D grids, rapidly moving grocery items to assemble customer orders.
These bots move at speeds of up to four meters per second, passing within five millimeters of each other, but are not autonomous. They are instead orchestrated by an Ocado-written customized control system using 4G communications technology.
This ‘air traffic control system’ communicates with each bot 10 times each second, with ultra-low latency to ensure the most efficient and seamless collaboration between the fleet. The aim is peak performance, with the highest throughput to each site, at the lowest cost.
This control system receives positional updates from each and needs to be able to combine this with knowledge about customer orders and available stock in order to build a meticulous plan.
Each plan is time-stamped to allow the control system to orchestrate each bot with the perfect position at any given time in the future, so the system can adapt the plan in real-time if any individual bot is unable to complete its task.
That means the system must have the right flight path plan for any bot, at any moment, based on its specific location. High-fidelity physics models are used to predict where each bot will be each moment in time going into the future.
To make that all work, a messaging delay of no more than 50 milliseconds is allowed; any delay means larger margins of error, which has a knock-on effect on the efficiency and throughput of the site overall.
Historically, all that critical orchestration was hosted on-premise at each site. But in 2020 the process of moving the entire system to the cloud was initiated, via what Ocado calls ‘Project Tempest.’
Moving one of our most critical systems to the cloud was a huge decision for us, and we had to know with a high degree of confidence that the low latency and high predictability we'd achieved on-site, could be replicated or exceeded in the cloud. Compromising would mean compromising the throughput of our sites, and therefore the profitability of our retail partners - nothing less.
Alex Harvey, the firm’s Chief of Advanced Technology, said on-prem had been used with local compute in servers with very high-speed data networks to ensure this needed predictability. He stated:
This approach allowed us to reduce the error budgets for the messaging delay and messaging variation, to be able to guarantee what we wanted. However, this also came with the cost of tooling, systems to coordinate deployments, and the instrumentation, not only for the application itself, but for all the redundancy.
Essentially, it was a full private cloud with all the overhead that comes with that. And all of that complexity.
Harvey said the door to ending such private cloud use was the launch of AWS Outposts.
Outposts pushes AWS infrastructure and services to either on-premise or edge locations for a consistent hybrid experience for workloads or devices requiring low latency access to on-premises systems, local data processing, data residency, and application migration.
Whilst it isn’t actively being used, Outposts is an insurance policy for Ocado that contributed to the company’s decision to move to AWS.
Ocado was an early user of AWS cloud, and in 12 months migrated all instances of its orchestration application to run on the AWS environment. The move has meant, Harvey added, that it now has close to 8000 bots in the cloud and is confident that will soon be tens of thousands.
Hybrid architecture, delivered globally
Key to that growth will be AWS. Harvey said:
Three years ago, we took the decision to rewrite and re-platform all our end-to-end systems, leveraging all the AWS services that we could - from deployment, alerting, monitoring, running an ecommerce solution, logging database - literally everything.
That allows us to focus on the platform, because we haven't had to bother about dealing with the scalability of servers, deployment tooling, low-level logging tools, or even database support.
Bot orchestration is the most recent set of applications we have now moved up into the cloud, which we hadn’t done before, as it’s very different from those web applications. It’s highly multi-threaded and parallelized, to take advantage of current modern compute architectures.
Harvey said the bot air traffic control is a hugely complex computational challenge. The warehouse can’t have different threads of computing plans for different bots, for instance, because a single integrated plan for all bots is required. These constraints generate a specific type of computational cycle time and clock speed problem. He added:
We had no immediate plan to move into the cloud. But with Outposts, we were able to get an insurance policy that no matter where we build our warehouses, will always be able to get the ultra-low latency we need and can get increased performance.
So confident were the team that AWS cloud could help that it was first piloted at the firm’s largest fulfilment center in Erith, southeast London. Dorkin added:
We wanted to prove we could run at scale, as everything there happens at 10 times the scale at any other grid. To test, we did the first migration on Christmas Eve after all the customer orders were out, so we had about 48 hours to recover if we needed it.
Next steps for Ocado and cloud, concluded Harvey, will be to continue moving as much onto AWS as possible. He said:
Project Tempest was to remove our on-premise server infrastructure as much as we could. Managing a fleet of servers in a data center requires an awful lot of management software, redundancy software, software for provisioning, recovery, updating and security patches.
So, the real benefit of Project Tempest was being able to take all the applications that run on servers on the Ocado facilities and move those applications up into the cloud running on AWS.
Ocado still has embedded, edge-based and robotic control systems, but a heterogeneous compute architecture offers not just the ability to get high level machine learning and the intelligence embedded, with ultra-efficiency, but also to operate very, very fast feedback sensing on robotic pickup.
Blurring the boundaries between the cloud and on-prem systems is part of the next exciting phase we are really looking forward to working with AWS very closely on.