The new Apple goggles, the Vision Pro, made a tremendous splash over the last couple of weeks as a new augmented reality thingy that is not quite connected to the Metaverse. I believe the most transformative aspect isn’t the immersive graphics but the new eye tracker that lets you click and tap with your gaze. This promises to be as seminal as the mouse and touch UI were in their day.
Many words have been spilled on how cool the graphics are or the fact that Apple is finally jumping into extended reality. That’s not surprising because the graphics look cool, as do the ads. In ten years, when we look back on the significance of this moment, the virtual reality aspects may get short shrift compared to the most significant innovation, which is the introduction of the first Visual Operating System.
This could transform the way we do office work. It could streamline how we work with digital twins in industrial and enterprise applications by turning our eyes into active instruments of control rather than just passive recipients.
The evolution toward visual interaction
Bloomberg previously reported that Apple had trademarked xrOS as perhaps a prelude to an extended reality operating system. It speaks volumes that Apple went with Visual OS instead. This could be as totally transformative as the graphical user interface was in its day ushered in by the mouse and then later the touch user interface ushered in by the iPhone. Apple may not have invented these innovations, but they were refined and scaled there.
The visual user interface may transform how we work with computers, eventually paving a pivotal role in freeing us from hunching over displays, contorting our necks, and tightening our shoulders to click, scroll, tap, and type our way through our daily grind.
The key frustration with modern computing is that our intention is at odds with what we look at today. First, you look, and then you hover the mouse and try and hit the same character, pixel, or field. Similarly, touch interfaces mostly work great until you spend five seconds to precisely precision the cursor to cut two characters out of your misspelled word.
The step from simply displaying graphics to visually interacting with them is no simple feat. While you may click once or twice in a second, the eyes vibrate at a rapid clip that is not always straightforward as we might imagine looking at a still scene or a large screen.
The process of seeing is an active skill in which you decide to pay attention to some object in your field of view. In Timaeus, Plato described the eyes as tentacles reaching out to grasp things in the room. Greater skill in gently caressing objects with these tentacles increases our precision in perceiving fine details and textures.
Saccade movements are quick bursts from one side of the room to the other. They help to direct attention to other objects quickly. They also cleanse the palate, making it easier to focus at different depths. This is the main kind of movement involved in reading.
There are also tiny movements called ocular micro-tremors that keep our eyes in constant vibration, this helps to constantly move the rods and cones across objects and makes it possible to see details smaller than the resolution of a single cone.
These are the most precise and fastest muscles in the human body and help to create the visual experience through a complex dance across many frequencies. At the highest frequencies, micro-ocular tremors driven from the brain stem vibrate the eyes at 60-100 hertz, which helps keep the rods and cones filled with fresh information moving the eye about 150-200 nm in the process.
One level up saccade movements at about 20 hertz and smooth pursuit movements of 1 hertz help to guide the eye’s position toward visual stimuli. The eyes are also coordinated together in the convergence movements required for 3D comprehension.
We also use different kinds of processing pipelines to acquire information about what we look at. We can acquire information sequentially, such as by counting, linearly by reading, or visually by tuning into the lines, textures, and depth of things.
History in science and marketing
Researchers and vendors have developed various innovations to quantify what we look at for years. Early work, called electro-oculography, tracked the minute magnetic changes as our eyes move since the eye emanates a slight electromagnetic difference between the cornea in the front and the retina in the back.
More recent progress measures changes in the angle of infrared light bouncing off our cornea. This underpins most of the progress in the field over the last couple of years. Companies like Tobii claim accuracy as small as .6 degrees for the head-worn kit and as little as .3 degrees for the screen-based equipment. But they also claim a precision of .01 degrees.
Tobii and its competitors like iMotions, Gazepoint, and Eyeware all make gear marketed to scientific researchers trying to understand vision. Marketers also use them to improve their ad campaigns by creating heat maps of how we view a web page or ad. Tobii has taken this a step further by offering gamers to help improve targeting and showing fans what they look at while streaming video games on Twitch.
In theory, at least, they can even control a mouse, sort of. I tried Tobii’s Eye Mouse feature a few years ago, and it was more like how a chicken makes assisted hops into the trees without truly flying. You could sort of jump to a region but still need the mouse to fine-tune things. I am sure it has improved, but they are still not highlighting mouse control like the other stuff. Apple’s foray into a visual UI might change that.
Over the years, both Apple and Facebook/Meta have acquired smaller companies building eye-tracking tech. Apple bought SMI in 2017, while Facebook acquired The Eye Tribe in 2018. The tech could prove valuable for foveated rendering in which the small area you focus on is rendered in much greater detail than the periphery. This is important because the number of visual sensors, called cones, are packed 200 times as densely as the rest of the eye. But Apple appears to be upping the ante by adding the ability to navigate rather than just perceive.
Although I believe the Visual User Interface will radically change how we interact with computers, the first generation or two will probably be buggy. Early reports have people commenting about the challenges of trying to click on stuff with their eyes. These are often followed by an allowance that this is just an alpha product, and Apple is still fixing the bugs.
I predict you will not be able to cut and paste text with the same precision and speed as a mouse today. But this comes from someone similarly dissatisfied with the first laggy iPhone 1. It took a few iterations to catch up and eventually exceed the experience of the hard keyboard on the market-leading Blackberry, which is now just a software company.
It’s also important to note that early research into people trying to do office work on Meta’s Oculus reported significant drops in productivity. One German study identified significant ergonomic issues when they studied participants trying to perform desk-based tasks. Several participants suffered motion sickness and had to drop out. Others saw reduced performance, frustration, and eye strain.
Apple has immense resources, and cameras for tracking eyes, AI for processing visual cues, and new dedicated silicon for visual processing will only get cheaper, faster, and more precise. Their foray into using the eyes to control computers and not just look at them will also galvanize the app marketplace for new vision training games, new reading experiences, and improved productivity.