Consistent with ``virtual actor'' mode, the study discussed here does not actually make use of the ``intelligent'' components of the agents per se. Nonetheless it does make a specific case for the usefulness of such agents as they develop, and lend some credence to the theory underlying the agents' development. That is, in this case we showed that static, pre-programmed social/emotion content can be effectively communicated by the presentations these agents have at their (real-time) disposal. Since our larger body of work establishes a relatively robust coverage of the emotion categories used in this study, and since these categories can be directly manipulated by our autonomous agents, the conclusion we hope will be drawn is that communication from ``emotionally intelligent'' computer agents (whatever form they ultimately take) to human users is both practical, and plausible.
In the study there were 141 subjects that met for two sessions each, with approximately 14,000 responses analyzed. The subjects were urban undergraduate students of mixed racial and ethnic backgrounds, primarily upperclassmen. About half were evening students who tended to be over twenty-five years of age. Three different sets of subjects met. The studies were undertaken as part of the course of study, but students were first exposed to the material as participating subjects before any theoretical material was presented. The subjects were given tasks wherein they were instructed to match a list of emotion scenarios with a set of videotape presentations in one-to-one correspondence. The lists ranged in length from four to twelve items. The presentations were approximately five seconds long with about twenty seconds between them (and approximately twelve seconds between them for second presentations). The presentations were of ``talking-head'' type (either computer or human) expressing facial emotion content with inflected speech (and in some of the computer cases, music).
For example, in one set, twelve presentations of the ambiguous sentence, ``I picked up Catapia in Timbuktu,'' were shown to subjects. These had to be matched against twelve scenario descriptions such as, (a) Jack is proud of the Catapia he got in Timbuktu because it is quite a collector's prize; (b) Jack is gloating because his horse, Catapia, just won the Kentucky Derby and his arch rival Archie could have bought Catapia himself last year in Timbuktu; and (c) Jack hopes that the Catapia stock he picked up in Timbuktu is going to be worth a fortune when the news about the oil fields hits; [etc., (d) -- (l)].
Five minutes of instructions were given before the first session. These included verbal instructions, and a simple two-part practice session with videotape talking-head computer presentations. Furthermore, written instructions were given at the top of each printed answer sheet, of the general form: ``When the video begins, write the number of the video episode next to the sentence that best describes the emotion [Naomi] is expressing. (played twice)'' The computer video display used an MS-Windows window with the name of the speaking character appearing in the title bar.
Confidence factors were additionally re-corded for much of the material where subjects rated each of their responses from ``1'' (not confident) to ``5'' (highly confident).
The human actor was coached on the subtleties of the different emotion categories, and on what would help to distinguish them. Three to eight takes were made of each interpretation for each scenario. The most expressive take was chosen during editing and a final tape compiled.
The computer was simply given the emotion category and the text, and it automatically selected the face, music, and spoken inflection appropriate to that category. Face morphing, speech generation, and music retrieval and synthesis were all done in real time. Actual music selection was up to the program, based on pre-existing categories. The computer presentations were further broken down into face-only, face and inflection, and face-inflection-music sub-categories in the study.
The ratio of time invested between the human-actor version and the computer version was approximately 30:1.
Overall, subjects did significantly better at correctly matching videotapes of computer-generated presentations with the intended emotion scenarios (70%) than they did with videotapes of a human actor attempting to convey the same scenarios
Among those participants matching computer-generated presentations to given emotions, there were no differences on correct matches between presentation types (face = 69%, face plus intonation = 71%, face plus intonation plus music = 70%). However, an overwhelming majority of these same participants felt that music was very helpful in making a correct match (75%), and another 8% felt that it was somewhat helpful. Less than 3% felt the music was unhelpful or distracting. One group was asked to rate their confidence after each match. An analysis of their confidence ratings indicated that participants were significantly more confident of matches with displays including music (F(2, 1638) = 19.37, P < .001). This could be problematic if music inspired confidence but, in fact, impaired matching ability. A simple look at the proportion of correct matches across 5 confidence levels shows that this is not the case. On a scale where ``1'' means low confidence and ``5'' means high confidence, these participants correctly matched 41% of the time when their confidence was ``1'', 56% of the time when it was ``2'', 58% of the time when it was ``3'', 64% of the time when it was ``4'', and 76% of the time when it was ``5''.
Inflection has not been stressed in either the study or analysis, because the techniques we can support in this area are not very sophisticated. Our best guess, based on experience over time, is that rudimentary emotion inflection in generated speech enhances the believability of characters.
Other results based partly on the coding of long-hand responses are not presented as part of this short paper.