All the Face-Tracking Tech Behind Apple's Animoji

The face-tracking tech Apple debuted with the iPhone X has been in the works for decades.

A couple years ago, Apple went on a shopping spree. It snatched up PrimeSense, maker of some of the best 3-D sensors on the market, as well Perceptio, Metaio, and Faceshift, companies that developed image recognition, augmented reality, and motion capture technology, respectively.

It’s not unusual for Cupertino to buy other companies’ technology in order to bolster its own. But at the time, it was hard to know exactly what Apple planned to do with its haul. It wasn’t until last month, at the company’s annual talent show, that the culmination of years of acquisitions and research began to make sense: Apple was building the iPhone X.

Perhaps the most important feature in the new flagship phone is its face-tracking technology, which allows you to unlock the phone with your face or to lend your expressions to a dozen or so emoji with Animoji. Apple thinks the iPhone X represents the future of mobile tech, and for many, that’s true. But if you trace most of consumer technology’s most impressive accomplishments back to their origins, more often than not, it’ll lead you to a drab research lab full of graduate students. In the case of Animoji, that research happened to have taken place nearly a decade ago at a pair of Europe’s most prestigious technical schools.

Set in Motion

In the mid 2000s, motion capture was still a laborious process. Creating the nuanced expressions for the characters in Avatar, for example, required the actors to wear painted dots on their face and attach plastic balls to their bodies. These dots, called markers, allow optical systems to track and measure face and body movements in order to construct approximations of how they changed. "Markers help because they simplify the computation of correspondences," says Mark Pauly, a co-founder of Faceshift and head of the Computer Graphics and Geometry Laboratory at EPFL, a school in Lausanne, Switzerland.

Marker technology worked well, but it required significant overhead—a studio, motion capture suits, and of course actors willing to wear all those dots. “Whatever you wanted to create took a lot of money and time,” says Hao Li, director of USC’s Vision and Graphics Lab, who was getting his PhD in Pauly's lab at the time. “We wanted to make it easier.” So Pauly and Li, along with fellow researchers including Thibaut Weise, Brian Amberg, and Sofien Bouaziz (all now at Apple), began exploring how to replace markers and mo-cap suits with algorithms that could track facial expressions using footage captured by a depth-sensing camera. Their goal? To create dynamic digital avatars that could mimic human expression in real time.

There was an issue, though: Algorithmic facial tracking is notoriously difficult pull off. Li calls the human face "one of the holy grails in computer graphics" because it's so difficult to work on. Unlike a static object, the face is constantly deforming; there are no simple rules for a computer to follow.

For a machine to understand facial movement, it needs to understand the many ways a face can look. “The algorithms have to be robust to various lighting changes, occlusions, different extreme head rotations, and standard variations in face appearance across races and different ages,” says Dino Paic, director of sales and marketing at Visage Technologies, a company whose face-tracking software is used by auto and financial clients.

By the mid 2000s, 3-D depth sensing cameras were already sophisticated enough to piece together the landmarks of a face. The bigger challenge was teaching a computer to make sense of that data. “The problem is that even if you can sense all the points, they have absolutely no meaning to the computer,” Li says.

To solve for that, Li and his team treated the face like a geometry problem. They trained their algorithms on a set of faces and expressions that allowed them to build statistical 3-D models that could describe, generally, what a face looks like across different populations and in different environments. With that computational model in hand, the algorithm can more easily match itself to a face’s 3-D point cloud and create an illustrated avatar that mirrors facial expressions in real time.

Face Value

So far, visual effects companies have mostly used this technology to streamline their production process. But the mainstream will soon experience it though features like Apple's Animoji and Intel’s Pocket Avatars, which uses facial recognition software to turn your face into a digital avatar.

Li says face mimicking emoji are only the beginning. He now runs Pinscreen, a startup looking to automate the creation of photorealistic computer graphics, where he and his team are working on technology that would allow algorithms to build a hyper-realistic 3-D avatar based on a single source photo.

After last fall’s presidential election, Pinscreen demoed its capabilities by creating a series of GIFs that featured a dancing Donald Trump. The renderings weren’t the most sophisticated—Trump’s face still had the doughy roughness of CGI production—but they were a clear stepping stone to a future where, conceivably, anyone can create a lifelike avatar say and do whatever they please. Pinscreen’s technology is still in beta, but the implications of it reaching a wider audience are both exciting and potentially sinister.

And there’s the tension: As this technology improves, so does the potential for manipulation. Today, the there’s still a clear visual divide between what’s real and what’s fake. But some day—very soon—it might be a whole lot harder to tell the difference.