The Affective Reasoning Project

Clark Elliott, December, 1994

Much of the technology to be used on the project has already been developed, in prototypes. In previous work we have shown the capability of running the extensive multimedia interface concurrently with the emotion reasoning engine while still maintaining sub-second response time for the intelligent agents, on a 66 Mhz, IBM (clone), platform. These multi-media supported emotion-related extensions are discussed below [Elliott1994b,Elliott94e].

To date we have used the Verbex Listen for Windows continuous speech recognition software package, driving a very simple LISP natural language interface which has allowed us to correctly recognize, for example, 190 of 198 emotion words (e.g., distressed, chagrinned, enraged), drawn from the work of Ortony, et al. [Ortony et al. 1987], on the first pass, in conjunction with emotion intensity modifiers (e.g., extremely, somewhat, mildly), and a limited number of domain emotion-antecedent cues (e.g., the names of agents and simple phrases such as "did that," "wanted to"). In addition, the computer has been able to reliably detect seven categories of emotion inflection in user speech that the user was intending to communicate, for identical phrases. Lastly we have been able to communicate {\em between} agents through strictly aural channels by transmitting English messages through the text-to-speech module and receiving messages through the speech recognition package. In this way the user is able to listen in on the communications between agents, in a social environment. This aural channel between agents has included affective information in both the detection of emotion words, and emotion inflection in the speech itself [Elliott1994a].

Because of the rich affective nature of both the lexicon, and to some degree the speech inflection, that our agents are able to appraise, in real time, we believe that we will be able to make significant advances in our ability to model the affective nature of the user. A key component of this is that the agents themselves have a rich internal representation of the emotional states of other agents, which extends to the user as well.

The music module has allowed us to achieve what we believe to be a significantly increased level of {\em engagement} on the part of users. To achieve this we use a professional quality Proteus 300-voice synthesizer, driven by a LISP interface to a large database of recorded MIDI files. In this way (although indexing is in the preliminary stages) we are able to access any {\em millisecond} of the approximately 200-hour music database, and have the music selection loaded and playing in (in most cases) less than one second. Additionally, once a selection has been loaded into an active memory queue, we are able to switch to it in about a millisecond. This means that we have virtually an instant music response from our agents and that this response is entirely under control of the intelligent affective reasoning engine.

Using the Proteus synthesizer(s), and the high quality playback equipment, we are able to play back full orchestral scores in a close approximation of acoustically recorded music. For some instrumentation (such as certain chamber scoring) this is at times difficult to distinguish from real instruments. It is important to note that the recorded {\em music} is not limited in its expressivity by the MIDI medium, and that the clinical quality of much recorded MIDI music stems from low quality input, not from the mechanism itself.

Using this mechanism, agents might, for example respond angrily with the full scoring of Beethoven's Fifth Symphony, with compassion with a light scoring of Brahms chamber music, or with fear using a contemporary work for percussion instruments. In this way, all of the activities of the agents, and the "stories" the agents are telling, are accompanied by the sort of musical enhancement present in the movies (c.f., [Elliott1995a]).

The facial graphics engine currently supports about sixty different schematic emotion faces. These are morphed from one to another in real time so that approximately 3,000 different morphs are available to the system. When the agents speak, their mouths also move simultaneously with the rest of the morphing process. Additionally we have program control over the speed and granularity of the morph, and the location, size, and color, of the faces. Additionally, we can, to some degree, animate their movement across the screen from one position to another. In this way, as an agent's emotion state changes (and possibly while it is speaking), the agent's face reflects the cognitive nature, and intensity, of this emotion. It may slowly grow content, or rapidly become intensely remorseful.

The text-to-speech component allows us to dynamically construct utterances under LISP program control and then speak them to the user. We currently have Windows message-passing access to two speech packages -- Monolog for Windows, which gives us a decent degree of emotion inflection, and TextAssist (DecTalk) which gives us a wide range of characters. In this way we are able to not only express text aurally, but can also {\em inflect} the text to convey emotional state. That is, when an agent is, for example, afraid, it sounds plausibly afraid, but when it is sad, it sounds plausibly sad.

The emotion reasoning engine tracks twenty-four categories of emotion, each of which may be instantiated with many different {\em instances}. For example in the anger category we have annoyance, anger, rage, apoplexy, and so forth. For each emotion we track subsets of 25 different emotion intensity variables [Elliott and Siegle1993]. Emotion states are manifested through approximately 450 different action channels, leading to about 1000 different action nodes [Ortony et al. 1988, Elliott1992].

Agents also may also have moods which affect both their behavior, and their appraisals of the antecedents of emotion in emotion eliciting situations. Likewise agents have dynamic relationships with other agents which affects both the emotions they may have about the fortunes of those other agents (e.g., pity rather than gloating at another's misfortune), as well as their own emotions (e.g., enemies may tend to be held more accountable for actions seen as blameworthy)[Elliott and Ortony1992].

Lastly, agents have rich, idiosyncratic, personalities which affects both the way the see the world (and consequently the emotion-antecedent appraisals they make), and the way they manifest their emotional states in the system. Like people, on agent may be gratified that some event has taken place, while another is remorseful (and in fact a single agent may have conflicting emotions about some event). In this way, agents have their own, highly idiosyncratic, but consistent, ways of interacting with their environment, including the user [Elliott 1993].

Taken as a whole, these many facets of our affectively intelligent agents, each of which is implemented with sub-second response time, yield engaging, highly responsive, computer entities with which the user can engage in social dialog.

BIBLIOGRAPHY: