Skip to main content

AI researchers translate language into physical movement

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


Carnegie Mellon University AI researchers have created an AI agent that is able to translate words into physical movement. Called Joint Language-to-Pose, or JL2P, the approach combines natural language with 3D pose models. The pose forecasting joint embedding is trained with end-to-end curriculum learning, an approach that stresses shorter task completion sequences before moving on to harder objectives.

JL2P animations are limited to stick figures today, but the ability to translate words into human-like movement can someday help humanoid robots do physical tasks in the real world or assist creatives in animating virtual characters for things like video games or movies.

JL2P is in line with previous works that turn words into imagery — like Microsoft’s ObjGAN, which sketches images and storyboards from captions, Disney’s AI that uses words in a script to create storyboards, and Nvidia’s GauGAN, which lets users paint landscapes using paintbrushes labeled with words like “trees,” “mountain,” or “sky.”

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

 

JL2P is able to do things like walk or run, play musical instruments (like a guitar or violin), follow directional instructions (left or right), or control speed (fast or slow). The work, originally detailed in a July 2 paper on arXiv.org, will be presented by coauthor and CMU Language Technology Institute researcher Chaitanya Ahuja on September 19 at the International Conference on 3D Vision in Quebec.

“We first optimize the model to predict 2 time steps conditioned on the complete sentence,” the paper reads. “This easy task helps the model learn very short pose sequences, like leg motions for walking, hand motions for waving, and torso motions for bending. Once the loss on the validation set starts increasing, we move on to the next stage in the curriculum. The model is now given twice the [number] of poses for prediction.”

JL2P claims a 9% improvement upon human motion modeling compared to state-of-the-art AI proposed by SRI International researchers in 2018.

JL2P is trained using the KIT Motion-Language Dataset.

Introduced in 2016 by the High Performance Humanoid Technologies in Germany, the data set combines human motion with natural language descriptions that map 11 hours of recorded human movement to more than 6,200 English sentences of approximately eight words.

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.