Dialogue Systems

From SAM
Jump to: navigation, search

Dialogue systems (a.k.a. voice interaction systems or voice-based user interfaces, VUI's) system enables spoken communication between humans and machines, typically complementing traditional human-machine interaction modalities such as visual output (screen, head-up display) and haptic input (scroll wheels, buttons, etc.) with spoken language input and output.


The main components of a dialogue system are (1) input modules for speech (ASR, Automatic Speech Recognition) and other modalities, (2) natural language interpretation, (3) a dialogue manager which takes interpreted user input and interacts with databases and services to provide appropriate and helpful responses, (4) natural language generation which renders system output in linguistic form, and (5) output modules for speech (TTS, Text To Speech) and other modalities.

Compared to other interaction methods, dialogue systems/VUI's have several potential advantages[1]:

  • Intuitive and efficient: they exploit innate language skills and thus enable intuitive and efficient usage
  • Ubiquitous: they are everywhere; they can run on, or be accessible from mobile devices, and are accessible for a user even from a distance
  • Enjoyable: they increase user-friendliness while efficiently satisfying the user's needs
  • Hands-free, eyes-free: they are the ideal solution while engaged with other tasks; entering complex information is otherwise usually awkward and requires the user's hands and eyes

Relevance to SAM

SAM, being a project about a consumer-level application, needs an intuitive and modern interface. Dialogue systems are one of those types of interface that enable end-users to simplify their interactions with their applications and get the most out of them.

SAM will use sophisticated dialogue system techniques in order to allow the consumption of media content with the minimum amount of distraction. Interacting with two screens leaves little room for additional cognitive activity and the dialogue interface to be implemented will be designed to require no more than that. Additionally, the user will be able to maximize efficiency and brevity of the interactions with the system for example by avoiding to type text or bypassing menu hierarchies.

Thus, dialogue systems will be an integral aspect of SAM both functionally and conceptually.

State of the Art Analysis

Commercial systems


The state of the art in dialogue systems is arguably best represented by Siri, supplied with the Apple iPhone 4S and later models. Siri offers speech-based dialogue interaction with several apps and services, using high quality server-based speech recognition tolerant for variation in how user utterances are formulated, intelligent back-end integration with services, and relatively sophisticated dialogue management mechanisms such as context-dependent interpretation of user utterances. For example, a request for a taxi will be interpreted as a request for a taxi from the current location of the user to the user's home.

While an impressive step forward for commercial speech-based interfaces, almost all interactions with Siri require the user to look at the screen on several occasions. For example, after asking for a restaurant nearby, the system presents a list of restaurants on the screen. However, the list is not presented using speech, so the user must look at the screen and click on one of the alternatives to proceed in the dialogue. This is not so much a design flaw as a consequence of the fact that Siri is designed for a user using both voice, eyes and hands to interact with the system.

Furthermore, Siri is lacking some basic dialogue behaviours that humans frequently depend on when interacting with other humans. As an example, if the user interrupts a task with another task (as one might well need to do e.g. to ask for a gas station while in the process of selecting what music to listen to), the first task is forgotten and any progress made there is lost, forcing the user to start again on the first task. Furthermore, if exposed to some information that does not explicitly state the associated task (e.g. "7 o'clock"), Siri tends to jump to conclusions about what the user wants the system to do with this information, rather than asking the user for a clarification. In effect, this forces the user to be more explicit when formulating utterances for Siri, which may result in a more distracting interaction.

Android-based systems

The Google Now/Google Search system distributed with Android was originally limited to voice search but has recently been extended with the possibility of making calls and accessing other non-search features. It is in many respects similar to Siri, but is less compentent in terms of dialogue behaviours. Spoken interaction is used in conjunction with GUI/haptic interaction, and most interaction follow the pattern of an initial voice command and subsequent GUI/haptic interaction.

Samsung S Voice distributed with Samsung handsets is another variation on the same theme, with the screen displaying the spoken dialogue so far, until an external application (such as the music player) is launched and regular GUI/haptic interaction takes over.

Hound is a recent mobile-based Android VUI (although an iOS version has been announced), whose primary benefits over e.g. Siri are (1) faster interaction and (2) the ability to handle more complex queries. At the time of writing, Hound is only available in Beta version and only in the United States. Regarding dialogue behaviours (beyond interaction speed and query complexity) and multimodality, Hound appears comparable to other systems on the market.


Another recent mobile-based VUI, Microsoft's Cortana, is widely regarded as a response to Apple's Siri, with which it shares many properties. With regards to dialogue behaviours, it is less competent than Siri but more competent than Google Now. In terms of multimodality, it offers similar solutions to other systems on the market.

Problem with existing commercial systems

One thing that all the systems mentioned above is that their interaction models are based on the assumption that the user is able to use voice, eyes and hands to interact with the system. In a second screen scenario, this is clearly undesirable insofar as the user prefers keeping their visual attention on the first screen. It would be preferable with a system which could interact with the user without requiring visual attention, but only their voice and (optionally) hands.

Research systems

A wide range of research systems have been built, with implementations ranging from very basic systems to full-fledged applications. Many different types of dialogues and architectures have been explored, ranging from simple state-based systems for information collection to advanced negotiation systems based on general mechanisms such as planning, plan recognition and inference. Over the last decade, research has moved from symbol processing methods of classical AI and concentrated on developing statistical methods for dialogue management, but success has been limited. It may be noted that no existing commercial systems use statistical dialogue management. Another active area of research has been speech-based multimodal interaction, often in an in-vehicle setting.


Notable standards in the area of dialogue systems include VoiceXML (for dial-up services and form-filling dialogues), SCXML (an emerging standard for dialogue processing) and HTML5 (for web-based multimodal interaction).


  1. M. H. Cohen, J. P. Giangola, and J. Balogh, Voice user interface design. Addison-Wesley Professional, 2004.