Main content

Clockwork – discovering wasted bandwidth between the nanoseconds

George Lawton Profile picture for user George Lawton September 1, 2023
Summary:
The impact of synchronized time is increasingly important. Here's why!

clockwork

If time is money, then what is the value of a nanosecond (billionth of a second)? Well, if you are building a large network of distributed applications, it could mean a ten percent improvement in performance or a ten percent reduction in cost for the same workload. It could also mean orders of magnitude fewer errors in transaction processing systems and databases.

At least that is according to Balaji Prabhakar, VMWare Founders Professor of Computer Science at Stanford University, whose research team helped pioneer more efficient approaches for synchronizing clocks in distributed systems. He later co-founded and is CEO of TickTock, which became Clockwork to commercialize the new technology. He also previously co-founded Urban Engines, which developed algorithms for congestion tracking and was acquired by Google in 2016. He has been working on designing algorithms to improve network performance for decades. 

The company initially focused on improving the fairness of market placements in financial exchanges. They have started building out a suite of tools to synchronize cloud applications and enterprise networking infrastructure more broadly. Accurate clocks can help networks and applications to improve consistency, event ordering, and scheduling of tasks and resources with more precise timing. 

This is a big advantage over the quartz clocks underpinning most computer and network timing, which can drift significantly enough to confound time-stamping processes in networks and transaction processing. Traditional network-based synchronization can help reduce this drift but suffers from path noise created by fluctuations in switching times, asymmetries in path lengths, and clock time stamp noise. 

Prabhakar says some customers are interested in cost conservation and want to right-size deployments and switch off virtual machines they no longer need. He notes:

So, if they save 10% or more, and we charge them just 2%, the remaining is just pure savings.

Others want a more performant infrastructure. Clockwork did one case study that found they could get seventy VMs to do the work of a hundred by running apps and infrastructure more efficiently. 

Bringing time back into system design

It is important to point out that there are two levels of improvement in their new approach. The new protocols can achieve ten nanosecond accuracy with direct access to networking hardware. In cloud scenarios mediated by virtual machines, the protocol can achieve a few microseconds of accuracy. However, that’s still good enough to satisfy the new European MiFID IT requirements for high-frequency trading and many other use cases. It is also helpful that the clock sync agent requires less than one percent of a single-core CPU and less than 0.04% of the slowest cloud link while saving 10% of bandwidth. 

Perhaps the most important thing to consider is the impact it could have on the trend toward clockless design in distributed systems. Clockless designs help scale up new application and database architectures but make basic operations like consistency, event ordering, and snapshotting difficult. 

The more accurate clock sync technology is already showing promise in improving tracing tools, mitigating network congestion, and improving the performance of distributed databases like CockroachDB. Over the last couple of years, Clockwork has been building out supporting infrastructure around the new protocol called HUYGENS to improve cloud congestion control, create digital twins of virtual machine placement, and improve distributed database performance by ten to a hundred times. It is named after Christiaan Huygens, who invented the pendulum clock in the 1600s, which became the most accurate timekeeper until the commercialization of quartz clocks in the late 1960s.  

Why time synchronization is important

The impact of synchronized time is increasingly important as the world transitions from dedicated networks and compute to various forms of statistical multiplexing. Networks have been transitioning from dedicated circuits using protocols like circuit-switched networks and asynchronous transmission mode (ATM), which delivered high-level performance for each user but wasted unused bandwidth. As a result, the industry has been migrating to TCP/IP and wide-area Ethernet, which do a better job of sharing unused bandwidth but can get clogged up, causing delays when the load gets too high. 

A similar thing has been happening with compute. Legacy enterprise systems built on dedicated hardware guarantee high performance. However, these struggle to reallocate compute across multiple applications with varying usage requirements or scale out across multiple servers. The move towards virtual machines, cloud architectures, and now containers helps enterprises gain the same economies for compute that TCP/IP brought to networking. 

However, problems with statistical balancing approaches arise when too many users or apps hit the edges of performance. Packets get lost, and transactions don’t get processed, resulting in increased delays and additional overhead as services try to make up for lost time with retries. More precise time synchronization helps networks, apps, and micro-services reach their peak load and then gracefully back down when required without wasting resources on packet retries or additional transaction processing. 

Referring to the transition from dedicated compute and networks to modern approaches, Prabhakar says: 

The trade-off cost us. In communication, we went from deterministic transit times to best-effort service. And computing went from centralized control of dedicated resources to highly variable runtimes and making us coordinate through consensus protocols.

Different approaches to synchronizing time

To contextualize the field, the synchronization of mechanical clocks played an important role in improving efficiency and reducing railroad accidents in the 1840s. More recently, innovations in clocks built using quartz, rubidium, and cesium helped pave the way for more reliable and precise clocks. These led to more reliable networks, operations, and automation and played an essential role in the global positioning system (GPS) for accurate location tracking. 

However, the inexpensive clocks built into standard computer and networking equipment tend to drift over time. In 1980, computer scientists developed network time protocol (NTP) for achieving millisecond (thousandths of a second) accuracy. Although the protocol supports 200 picoseconds (trillionth of a second) resolution, it loses accuracy owing to varying delays in packet networks, called packet noise. 

One innovation on top of NTP, called chrony, combines advanced filtering and tracking algorithms to maintain tighter synchronization. Most cloud providers now recommend and support chrony with optimized configuration files for VMs. 

Various other techniques, such as precision time protocol (PTP), data center time protocol (DTP), and pulse per second (PPS), achieve tens of nanosecond accuracy but require expensive hardware upgrades. They also sometimes require precisely measured cables in a data center between a mother clock on a central server and daughter clocks on distributed servers. 

Clockworks’ HUYGENS innovated on NTP with a pure software approach that can be enhanced by existing networking hardware. It uses coded time transmission signals that help to identify and reject bad data caused by queuing delays, random jitter, and network card time stamp noise. It also processes the data using support vector machines that help estimate the one-way propagation times and achieve clock synchronization within 100 nanoseconds. Prior techniques required a round trip, which suffered from differences in each packet's routes. 

Another substantial difference is that HUYGENS trades timing data across a mesh to improve resolution instead of the client-server approach used with NTP. The agent on each machine periodically exchanges small packets with five to ten other machines to determine the clock drift for each server or virtual machine in a mesh. The agent, in turn, generates a multiplier for slowing or speeding up the clock as prescribed by the corrections.

Ideally, all the computers would use the most advanced clocks available, but these are expensive and only practical for special applications. As a result, most modern clocks count the electrical vibration in quartz crystals that resonate at 32.7 thousand times per second (called a hertz for short). These are 100 times more accurate than mechanical approaches. These are inexpensive but can drift 6-10 microseconds per second unless cooled with more expensive hardware. 

Atomic clocks monitor the cadence of atoms oscillating between energy states. These clocks are so precise that in 1967, a second was defined by the 9.192 billion oscillations per second of a cesium atom. Rubidium is a cheaper secondary clock that ticks at about 6.8 billion hertz. Current atomic clocks drift a second every hundred million years. However, in practice, they must be replaced every seven years. The current most accurate timekeepers, in labs only for now, use strontium that ticks at over a million billion hertz. These only drift a second in 15 billion years and are used for precise gravity, motion, and magnetic field measurements. 

It's important to note that the lack of precision in quartz arises from the lack of temperature controls. Prabhakar says:

If these [quartz] clocks were temperature controlled, you can get down to the parts per billion. So, it'll be some small number of nanoseconds per second. Now, those kinds of clocks and network interface cards could easily be in the few hundreds of dollars to possibly up to $1,000 on their own. And the next level is rubidium clocks, which are three to five grand, and then cesium. As you add these costs to the raw cost of a server, you're piling up the costs across a large data center. So, it'd be nice if we could do it without having to resort to that. And that's more or less what we do.

New insights into virtual infrastructure

Understanding virtual infrastructure is a dark art since most cloud providers don’t inform you about their physical placement. In theory, at least each VM and networking connection is similar. In practice, it is not so simple. Clockworks has been developing a suite of tools to help analyze and optimize the cloud infrastructure using the new protocol. One research project last year explored the nuances of VM colocation. 

A simple analysis might suggest that two VMs running on the same server would have a better connection to each other since packets might be able to flow over the faster internal bus. But Clockworks’ research across Google, Amazon, and Microsoft clouds revealed this is not necessarily the case. The fundamental issue is that the virtual networking service built into hypervisors that run these VMs creates a bottleneck. Sometimes, the hypervisor even tries to run what one would think should be local networking calls to a co-located VM across acceleration services over the much slower external network rather than just the much faster computer bus. 

The problem is confounded when enterprises attempt to collocate multiple VMs running similar apps. For example, a business might have multiple instances of a front-end or business logic app all connected to a back-end database. But performance slows significantly during peak traffic when they are all trying to access the backend server. In one instance, they found that four co-located VMs only saw a quarter of the expected bandwidth arising from this competition. The fundamental problem, they surmised, was that the cloud providers were over-allocating bandwidth in the belief that each VM would require peak networking at different times. 

Although the technology could improve many aspects of distributed networking, Clockwork is focusing on the cloud for now because that is the biggest consolidated market. Prabakar says:

Cloud is a nice place to sell because it's a place, and it's very big. I'm sure we could improve enterprise LANs and hotel Wi-Fi. But we started with the more consolidated, high-end crowd first and will then go from there.

My take

I never really thought much about time synchronization until I heard about Clockwork a month ago. A few years ago, I was elated that Microsoft started using NTP to automatically tune my computer clock, which always seemed to drift a few minutes per month. 

It seems like any protocol or tool that can automatically identify and reduce wasted bandwidth and computer resources could have a long shelf life and provide incredible value. The only concern is that HUYGENS is currently a proprietary protocol, which may limit its broader adoption as opposed to NTP, which became an Internet standard. 

It is possible that Google, which bought Prabhakar's prior company and helped develop the technology, may ultimately buy them out and restrict the technology to the Google cloud. This might be a loss for the industry as a whole, but serve as a competitive differentiator for Google’s growing cloud ambitions. It could also go the other way by releasing it as an open standard like many other Google innovations. 

Loading
A grey colored placeholder image