What the presentation studies tend to show is that (1) computers can be used to convey social information beyond that encoded in text and object representations, (2) that this information can be delivered in ways that do not take up bandwidth in the traditional text communication channel (that is -- the content measured in the studies was explicitly not that encoded in the text), (3) that this information can be encoded and delivered in real time, and (4) that the computer performs reasonably well on social communication tasks that are difficult for humans.
The preliminary work with music tends to show that music is rated by subjects as having a significant effect on guiding their social perception, but that this effect is not well understood (or possibly, the musical triggers for this effect are not well understood). We feel that there is strong potential in this area.
Furthermore, the studies suggest the following: (1) That the underlying emotion theory is a plausible categorization system to the extent that subjects were able to discriminate the twenty-one different emotion categories used in the study. (2) That despite it being inexpensive, and commonly available, this is a viable platform for studying emotion interaction between humans and computers. (3) That the low-bandwidth model we have used (i.e., less than 14K bps), which shows great promise as a web-based data collection, and delivery, mechanism nonetheless provides sufficiently rich channels for real-time multimodal communication conveying social/emotion content. (4) That potentially useful information can be conveyed about this complex, ubiquitous, and yet lightly studied component of human (and human-computer) interaction. (5) Highly significant reductions in time investment can be achieved for selected, pre-programmed, emotion content in ``social'' scenarios when using multimedia, multimodal, computer presentations in place of human actors in a real time environment without reduction of the effective content.
While our results showed that the computer actually did better at this restricted task than did the human actor, we are cautious about drawing general conclusions from this. Questions arise: (1) How good was the actor? (2) How does one measure ``goodness'' in an actor? (3) How appropriate was the actor for this medium, this audience, and this task? (4) How much does professional lighting, editing, and sound mixing of the human-actor presentations effect the identification task? It would be possible to control for these factors through, e.g., measuring the effectiveness of different actors for these specific tasks, seeking funding for [expensive!] professional studio time, and so forth. If this were to be pursued we might be able to make some claims about the computer being ``better'' at conveying social/emotion content in some situations than humans. However, this was not our goal. We used the human actor simply to illustrate that, as designed for the study, correct identification of the broad range of interpretations was a difficult task, and that a seventy percent identification rate was admirable.
We can also address some of these questions from a common sense perspective: Presumably our professional actor, who has spent long years honing such skills, and who was specifically coached about how to discriminate the different emotion categories (e.g., was told to use technique ``A'' instead of ``B'' - both of which were valid - for expressing a specific interpretation because it would be less likely to be confused with another interpretation to be presented later) ...presumably this professional would be at least as good at these tasks as a ``typical'' person from the population. (Anecdotally, the actor was quite good, showing an impressive range of expressiveness and flexibility in addressing the task.)
It is important, also, to note that the sentences were entirely ambiguous: long-hand ad hoc interpretations given by subjects before the presentations were given showed no patterns of interpretation whatsoever. A seventy percent correct interpretation rate, with no content clues, is rather high, considering that in practice the communication of such content, completely divorced from cues, will be rare.
Additionally, we suggest that, in general, one-time real-life emotion assessment of the sort required here might well be correct less than seventy percent of the time. People use additional cues to disambiguate situations, they ask questions that help them to clarify their interpretations, they observe emotion in a continuous social context (and thus make continual revisions in previous interpretations) and they simply get it wrong much of the time.
Lastly, we specifically made NO attempt to give any feedback about the correctness of interpretations during the course of the study. There is a very real possibility that subjects might well learn the specific emotion presentations used by our interactive computer agents, thus raising the identification rate significantly.