The Conversational IO

Within my industry circles of analyst colleagues, industry executives, and venture capitalists, the idea of next-generation computer interfaces comes up frequently and conversational UI is a main theme. You are going to here quite a bit about this topic so I thought it would be useful to establish a big picture foundation.

I’ve been thinking about computer interaction models the past year and have concluded it is easiest to simplify how we interact with computers by bring it down to workflows. Every interaction we have with a computer comes down to a task or set of tasks. Prior to smartphones, our workflows were defined by a mouse and keyboard. They were our only input mechanisms to interact with a computer. Smartphones brought about touch as an input mechanism and now voice is being added. Gestures are something that has existed in pockets of experiences like video gaming but is a much less common computer interface than typing, touching, pointing (finger/mouse) and speaking.

If we distill our computer interaction models down, it helps us better frame how different input and output mechanisms can vary based on things like situation, context, physical locations, etc. For example, voice is a slam dunk inside the home for workflows like turning on lights or adjusting thermostat or other tasks. Specifically because, more often than not. the object you want to interact with has no screen or you are not close enough to the screen to touch it. Saying, “Turn the AC to 65 degrees” from any location in the home is an easier and more efficient workflow than walking to the thermostat or pulling out your smartphone to open the app to adjust it. Similarly in a car, voice is ideal because your hands are tied up and, for safety reasons you shouldn’t spent a lot of time fidgeting with a screen to play music, look up directions, find nearby points of interest, etc. Voice interfaces add quite a bit of value in computer interactions to contexts where before there either were none or the process was less efficient than using voice to interact with the computer.

However, voice is not and will likely never be the primary computer interface. It will be one of many which extend new capabilities and efficiencies. But all our computer interaction models will need to work harmoniously together to give us the widest range of workflow possibilities. This brings us to the conversational element.

The interesting thing in describing this computer interaction as a conversation is because it is natural. Humans are used to this type of communication whether it is voice or text. I’d offer that the ways humans use technology is largely conversational. We spend quite a lot of time either in text message or email conversations as a healthy portion of our time using all devices. So why not add this element at a computer interaction level? The possibilities of deeper engagement in things like searching, commerce, automation, and even new workflows which don’t exist yet are likely to come from this interaction model.

When we really drill down to the underlying meaning of the conversational interface, what surfaces is the common theme of intent and context. The belief is we will have advancements in machine learning, deep learning, and overall artificial intelligence and that our interactions with computers will deepen due to their ability to truly understand us. Not just understand what we say but know about our likes, dislikes, preferences and as intimate of details as we allow them to know in order to be more helpful to us.

A statement has been made before that “A computer should never ask a question it should know the answer to.” Currently, computers don’t truly have any context on us so they continually need information which conceivably they should know. This is ultimately what this entire concept seeks to solve.

Viv, a new voice startup from the folks who had a role in creating Apple’s Siri, demonstrated the power of voice when context and third party APIs are integrated into such a platform. An example I found particularly interesting was when the demo showed a voice transaction of paying someone back. You could say, “Pay John back $15” and, with the API being tied to Venmo in this case, the entire process of paying a friend back was automated and implemented using voice automation. You can watch the whole demo here of the Viv launch for a deeper look at the concept.

All of this is setting the stage for the next few months when, at both Google’s IO and Apple’s WWDC, I expect the voice interaction/APIs for Siri and Google Now to be highlighted in some capacity. While we are still extremely early, the groundwork for this new interaction layer is being built right now.

What has changed in the past few years is humans’ willingness to engage in speaking with computers. I expect these types of technologies to be adopted quickly and add significant value to how we interface with computers and more easily automate workflows in the future.

Published by

Ben Bajarin

Ben Bajarin is a Principal Analyst and the head of primary research at Creative Strategies, Inc - An industry analysis, market intelligence and research firm located in Silicon Valley. His primary focus is consumer technology and market trend research and he is responsible for studying over 30 countries. Full Bio

5 thoughts on “The Conversational IO”

  1. One thing that I feel gets left out of discussion of conversational interfaces, and personal assistants in general, is what humans actually use smartphones for.

    My understanding is that smartphones are primarily used for communications. Using smartphones to tell a device to do something, look up a map, or even to do a payment is tertiary (after games and entertainment).

    So the question is, how will a voice (only) interface help us communicate better? The answer clearly is it won’t. A picture is worth a thousand words.

    I believe conversational UI will be important, but only in an auxiliary way. It maybe important for IoT, but I doubt it will be a major part of the computers we carry around all day (smartphones).

    1. Shortly after I purchased my Apple Watch and got acclimated to it, I decided to simplify the amount of gear I carry. My briefcase/backpack now contains an iPad Pro, a rMBP and my iPhone 6 Plus. Thanks to Bluetooth and Verizon WiFi, i seldom touch my iPhone, except to charge it or access financial information that requires TouchID.

      In the car, I use Siri to select playlists and play podcasts as well as Beats 1 and other Apple Music synthetic radio stations. On the go, I now make most of my calls from my Apple Watch over Siri; conference calls that require PIN and/or bridge numbers are still a pain, but I can often use my iPad with FaceTime for those.

      At home, lighting and climate controls are done with the limited, but capable HomeKit devices I have…really wish I could do more.

      Enhanced dictation on my rMBP is exquisite, even better than DragonAnywhere on my iPad. Between the two, I do most of my email. Standard iPad dictation is good enough for the rest. I’ve also found that Siri on Apple Watch seems to function far better than on my iPhone.

      I think the Watch and voice are the front runners for the upcoming user interfaces. I expect Apple will publish Siri API, add the ability to detect who is speaking/distinguish multiple voices overlapping, and release Siri devices that can be scattered throughout the home/work space.

      There have been weeks when my Apple Watch and iPad have been the only computing devices I’ve interacted with, nearly all of it by voice.

      It works, but is nowhere near seamless. That’s something Apple is good at, and I think it’s coming soon.

      Naofumi, I disagree. I’m older than most, but based on battery usage, my phone spends most of its time as a personal hot spot, pocket email viewer and camera. Other than conference calls, I rarely do voice anymore. Text works so much better for most things. If I had LTE on my watch and iPad, I wouldn’t carry a phone anymore.

      1. I think you misunderstood my position. By saying that smartphones are predominantly being used for communication, I am by no means restricting the scope to voice calls. I am saying that exchanging messages between humans is the predominant use of smartphones, which will include text, email, Instagram, Facebook, snapchat, etc.

        A lot of that information is transmitted as 2D images or text, and will be harder to consume using a linear conversational UI.

        This is the thing. Conversation is linear and synchronous. Images/pages are 2D and asynchronous. The amount of data that can be transmitted and the convinience of that transmission is miles apart. Conversation on the other hand is transactional, which is still hard to do with images/pages.

        Steve Jobs said that handwriting was the slowest input method ever invented. Similarly, conversation, in many situations, can be the slowest way to get information (other times it can be the fastest). The key is to understand when and where interactivity is important and when data bandwidth is important.

        For the current usage patterns of smartphones, interactivity is not as important as data bandwidth/asynchronousity. You confirm that by saying

        my phone spends most of its time as a personal hot spot, pocket email viewer and camera.

        1. With that definition, I lean more toward agreeing with you. However, you shouldn’t downplay the ease of voice communication. I know a lot of people who can’t write in cursive form, some who have penmanship that would make my kindergarten teacher cry.

          As grammar becomes worse and there’s more language blend, I can envision communicating by voice with transparent translation done in the cloud.

          1. I am not downplaying the importance of voice communication. I am just saying that there are quite a few situations where it is not nearly as good as written text or a graphic.

            Of course, there are also many situations were voice communication is superior. These are interactive sessions.

            The discussion them becomes, is verbal conversation the key or is interactivity the key? I would say that interactivity is the “job-to-be-done” and verbal conversation is merely one of the means to do it. Conversely, if we could learn to design 2D touch interfaces that were as interactive and as contextual as the conversations that we have now, then the use for voice conversations as a computer interface is likely to disappear.

            One of the conversational “bots” that are currently in widespread use is the UNIX (or Windows) command line. You give it commands in a certain syntax and the computer responds. You can give answers to subsequent questions that the computer will ask for. Interestingly, UNIX shells like BASH make heavy use of auto-complete based on history (context) and the state of the computer. This enables you to interact with the computer much faster than you would if you had to spell out each element.

            Conversation enhanced by a 2D interface is likely to be much more effective than a 1D voice communication scheme alone.

Leave a Reply

Your email address will not be published. Required fields are marked *