background image

Will You Welcome A Jibo Into Your Home?

In my last post, I mentioned Dr. Cynthia Breazeal’s founding of her new company as an example of the emergence of  robots designed to the social framework we apply as we interact in the world. This week, she unveiled Jibo, a personal robot for your home.

It is no secret that we tend to respond to social stimuli even when generated from technology. While there has some debate in the robotics community about how far you can take this before you fall into the so-called “uncanny valley”, where too much anthropomorphic design may make a robot seem creepy, there is little doubt that it is almost unavoidable to create a some social context with robots.

I’ve previously cited classic study by Drs. Heidel and Simmel in 1944. Subjects asked to describe animation of moving geometric shapes typically used social terms to described the interaction even though no such context was provided before. More recently the research of my good friend, the late Dr. Cliff Nass, formerly of Stanford, with his colleagues consistently demonstrated that people apply many aspects of how they relate to each other similarly to technology, even when they know consciously that there is no person or “life” in the technology. This evidence is so strong that some question the ethical implications of designing technology to that aspect of human psychology.

The research of Dr. Breazeal and her students have extensively explored that dimension at MIT’s Media Lab with a number of robots. So it is no surprise to learn that this would also be the focus of her new company. With the unveiling of Jibo, we can see the marks and refinement of several projects, including the NSF-funded Dragonbot, Guy Hoffman’s interactive desk lamp, and Cory Kidd’s Autom, as well as Kismet, Leonardo, Nexi, and Huggable.

Still, I was fascinated by her recent announcement. While many of Cynthia’s previous robots have had a face, typically with expressive eyes, Jibo is a much more subtle design. Many in the press have already noted its obvious similarity to Pixar’s famous Luxo lamp that appears in the introduction of the studio’s animated films. Here the lamp’s animation suggests a wide range of emotions that can readily be identified even though it doesn’t have a “face”.

It is also an interesting contrast to the design of Baxter, the robot created by Cynthia’s former faculty advisor, Dr. Rod Brooks, that does include a “face”. The two robots demonstrate that while both share a common philosophy about the importance of social cues in a robot’s interactive design, they favor a somewhat different implementation. While Baxter has animated eyes and eyebrows, Jibo only teases at that notion preferring, like Pixar’s lamp, to use body motion and sound instead to evoke emotion and social behaviors.

When SoftBank/Aldebaran announced Pepper in June, I tweeted that if you like what you saw in Pepper’s design, you would like our robot as well. I can say somewhat the same thing for Jibo. Yet there are also important differences between both these robots and what Hoaloha is developing.

Let me talk about some of the similarities, and differences, in design between Jibo and the robot Hoaloha is developing. I’ve already mentioned the general social orientation of the designs; that is, that the robots interact and respond somewhat like a person, using voice and motion to express social behaviors. To some extent this can also be found in Apple’s smartphone assistant, Siri, or Microsoft’s equivalent, Cortana. While neither of the smartphone interfaces provide a “face” or “body”, they do support a conversational paradigm based on speech input; that is, to interact, you “talk” to them and they speak back to you. As I have said in the past, this is a very challenging approach because speech recognition technologies are not as flawless as a mouse, keyboard, touch, or even gestures as a user interface. While speech is a very natural form of interaction, it is a complex process that requires not only listening for distinct signatures of words across a wide variety of speakers, but also the need to use context to turn recognition into understanding.

Simply matching phonemes to words is never enough as many words can sound alike. Further, conversational language is full of utterances and “noise” words like “uh” and “you know”. So even when all the words spoken are correctly recognized, to get to meaning often requires resolving ambiguity. Speech communication also requires applying and maintaining context. Pronouns are a good example. We are typically able to keep track who a pronoun references across multiple turns of a dialogue.

It is also obvious that in human conversation, the listener processes what they hear as it is being spoken and is an active participant in the dialogue, generating social cues that are perceived by the speaker. Conventional speech technologies often do not process spoken audio until the speaker finishes. While this latency is small, it can have a negative impact on the sense of naturalness of a conversation.

It is impressive that Cynthia implies in her announcement videos that Jibo’s user interface will be primarily driven by speech input. This is unlike Cory Kidd’s approach in providing only touch as input for Autom, his weight-loss coach robot. However, it is very much in line with her research with her Kismet, Leonardo, and Nexi robots. It may also be reflected in her hiring of Robert Pieraccini, an ex-AT&T and ex-IBM speech expert and having former Nuance executive, Steve Chambers, as Executive Chairman.

But speech as an interface is not only challenging on the input side, it is also hard on the output side. There is a natural variability in how we respond using speech, even in the same circumstances from day-to-day. For example, when you greet a friend or business associate you might see regularly, you likely start with a common exchange of pleasantries; yet there will be some differences in the tone of your voice or the words you select, perhaps based on your mood. When this variability is not present and speech pattern seem too repetitive, the dialogue becomes wooden and unappealing; one reason why people dislike voice-based phone interfaces that not only repeat themselves and try to limit what you can say at any one time. It is not only true for speech-based applications. Few people like to speak with a human customer service person that appears to be operating from a strict script. We expect speech to be natural, and reflective not only of the words we say, but how we say them.

However, the reality is that there are technical limitations to using speech technology. Despite the sometimes hype surrounding artificial intelligence, no technology so far comes anywhere close to matching how we effortlessly use language to communicate. Typically this requires limiting what the user can say at any one time or other clever tricks, such as Siri does when “she” doesn’t have a pre-defined response, by converting a user’s request into a Wolfram Alpha or Web search query. To get to a more natural level involves bringing in more context to the process that might include the structure of language, visual information, or information shared between the speaker and the listener. Speakers and listeners also often exchange important, but subtle visual cues as they engage in a dialogue. It is a kind of dance that requires both common ground and the ability to use more than words to effectively signal each other during the exchange.

But if users will not be able to just talk about anything with Jibo (or any robot), how will they know what they can say? For example, in Jibo’s introductory video, how does the guy coming home know that he can ask Jibo to increase his take-out order? Such interplay will likely require  so form of priming. Perhaps, as we do with our robot, Jibo’s screen will be used to provide visual cues on the possible things to say. That may be challenging because of the size of its screen and the need to make such cues readable from a distance, limiting how much can be presented.  The current Jibo videos don’t yet provide enough information to know yet, so we will have to wait to see how clever Jibo’s team is in addressing this.

That said, Jibo’s diminutive size and youthful personality may help. We naturally adjust how we speak to children because we know they don’t share the same knowledge or experience adults have. We also are more accepting of the errors they make. So this tendency can also be applied to make us more forgiving of the inevitable mistakes that a robot will make. However, it takes a careful balance to trust that the robot will have sufficient competence.

Creating the illusion of a youthful personality is a design aspect we share with Jibo’s creators. In one of her interview videos, she talks about how they wanted to create the illusion of a character like Frodo from the Lord of the Rings, a young, though inexperienced person, eager to learn new things, but not having an ego that would present it as a controlling personality. We agree with the important of defining such an appropriate persona, though we would likely select Sam, Frodo’s loyal companion, as our model as we believe that the “sidekick” role better fits our goals.

In addition, our design is oriented toward creating reciprocity in the relationship between the user and the robot. This means that the robot does not only provide support for the user, but has aspects where the user can express care for the robot. This dynamic can provide an important psychological effect because we can experience positive benefits when we do things for others. It is also reflected the relationships we have with our pets and of course in the positive relationships we have with other people. Jibo may be designed this way as well. Cynthia speaks of Jibo being a companion, a perspective we agree is an important ingredient in the design of a personal robot, but also may be intended in design of SoftBank’s Pepper.

Unlike Jibo, the Hoaloha robot is being designed to be autonomously mobile, though I can understand why Cynthia may not have opted that for Jibo (yet). It potentially saves her a huge investment in hardware engineering, the cost of components, and simplifies safety and interaction design. At an introductory price of about $500, Jibo much more attractively/affordably priced for consumer than Aibo or Nao (or Pepper) was. The stationary design also means that by being typically plugged into an outlet for power, Cynthia’s design needs to worry less about a power budget.

Reducing costs and simplifying design to keep to an affordable price and to decrease costs and time-to-market makes sense. It is why Hoaloha has opted not to include conventional robot arms on our initial robots. But like Jibo we do include a “head” that can move in three directions because, we share the importance of social interaction.

Yet, for our targeted audience, we also did not believe we could serve them adequately with a stationary design. Such a choice not only trades off some of the potential impact of the robot as an embodied presence, but also its effectiveness for scenarios that are important to us. For example, if the robot needs to remind you of something important, it may not be more effective than a desktop PC or your smartphone if it is in another room. Similarly for our scenarios, transport of objects on behalf of the user is an important feature.

While it would not be difficult to mount Jibo on a mobile platform (and I am certain some creative individuals will do that), that would not be sufficient. Navigating within a human space requires more than moving safely and purposefully. Movement must be intrinsically integrated with the user experience. Thus, Hoaloha could not have made that same decision for our robot. For us (and our market) being able to bring “experiences “to the user” is an important distinction and value.

Finally, we share Jibo’s design goals to be both connected and a platform for other applications and services. Wireless support enables our robot not only to integrate with other devices, and the coming “Internet of Things, but also leverage cloud-based services. And by offering an API (Application Programming Interface) we can extend the abilities of our robot beyond the limits of our own ability and creativity. However, these aspects by themselves do not necessarily guarantee success. A robot’s user experience requires much more consistency across applications and services, as well as appropriate isolation from each other and core system functions so that software errors or malicious programmers cannot adversely affect overall operation of the robot. We may not get an opportunity to peek at Jibo’s architecture here until the developer edition ships latter half of next year. I would not be surprised if it isn’t done yet. Even for us, it is one of the greatest areas of investment we are making in our design.

In summary, as exciting as Jibo’s announcement is, there remain many important details yet to be revealed. One thing is very certain: Jibo will need to nail its value proposition (user perceived benefits for the cost of the device) and its user experience to succeed. Opening the platform to third party developers is important, but if Jibo’s core suite and supporting architecture, and most importantly user experience are not strong, Jibo may just be end up like Aibo, a well-engineering gadget.

With respect to this, it will be very interesting  to see how well Jibo delivers on its CUI, Conversational User Interface. So far the closest approximations include Apple’s Siri, Google Now, and Microsoft’s Cortana. While these interfaces provide a more natural interface for queries, dialogues tend to be one sided and short. True conversations have multiple dialogue turns. Building an engaging conversational interface requires creating satisfying interface throughout the day as well as day-after-day. It is a paradigm shift as significant as GUI was to computing in the mid-80’s.

There is no question that “personal robots” will need to push beyond the interfaces for the devices we commonly use today. It will require better matching to the way we naturally interact. Done right it should enable personal robots to be natural extensions of who we are; able assistants and companions.

We congratulate Cynthia and the rest of the Jibo team on a promising launch, and look forward to how things develop. While their defined market  is somewhat different from ours, we share many of the same fundamental design principles with regards to social interaction and its importance for personal robots. The positive response to Jibo so far confirms our design choices as well.