A Facial Animation Markup Language (FAML) for the Scripting of a Talking Head


by




Quoc Hung Huynh

09525748




Project Supervisor: Andrew Marriott





Submitted to

the Department of Computer Science

in fulfillment of the requirements for the degree of




Bachelor of Science (Computer Science) (Honours)



at


School of Computing

Curtin University of Technology

Perth, Western Australia



November 24, 2000


© Curtin University of Technology 2000








Abstract



The FAQBot forms the focus of our project. The FAQBot Talking Head animation combines a TTS system, an MPEG-4 based FAE and an AI, to produce a 3D talking head answering user's requests. The aim of this project is to implement a Facial Animation Markup Language to enable the control of the animated Talking Head to include facial expressions, gestures and emotions through the input text stream.


Our focus of research encompasses the domains of human psychology and cognitive sciences, computer graphics, computer vision and human-machine interaction to identify the factors that contribute to non-verbal communication of facial gestures, expressions and emotions in humans. From this we derived our subset of FAML tags to mimic identified non-verbal behaviours. The FAML tags form the tools needed to realistically animate the Talking Head.


The research utilises the MPEG-4 facial animation-coding standard. The subset of FAML tags specifies the movement of FAPs as defined by the MPEG-4 specification. The FAPs are used to display the facial expressions denoted by the FAML tags. The FAML tags are implemented to work in conjunction with the personality of the Talking Head allowing smooth and continuous animation. Timing of gestures is synchronized to the audio clock, defined by the timing in the Talking Head synthesized speech. This dissertation describes the development of the tools and techniques required to control the text driven animation of the Talking Head through the use of FAML tags.






















Acknowledgements



I wish to acknowledge the people who have supported me and made significant contributions to my Honours work. First and foremost, I wish to thank my project supervisor, Andrew Marriott, for initially giving me the chance to work on a project that appealed to me and yet challenged the depth of my skills. I would also like to thank him for his guidance, encouragement and support above and beyond my project.


I am in debt to Trang Ly for her support, forgiveness and encouragement, and for allowing me to use her as a facial model, and for making the late nights and weekends spent at university more enjoyable.


Special thanks to John Stallo, whom I worked with for this project, to truly bring the Talking Head to life. John was not only an academic colleague but also a friend, and allowed me to discuss problems with him as well as providing contributions of meaningful ideas to the work. I would lastly like to extend my gratitude to the previous students who are too numerous to name and have contributed to the body of work that is the FAQBot, for I have truly stood on their shoulders.






















Quoc Hung Huynh

11 November 2000





Contents




Chapter 1 9

Introduction 9

Chapter 2 11

Problem Description 11

2.1 Problem Statement 11

2.2 Subproblems 12

2.3 Significance of Study 12

Chapter 3 13

Literature Review 13

3.1 Talking Heads 14

3.2 Face Modelling 15

3.2.1 Interpolation 15

3.2.2 Parametric 16

3.2.3 Physical 17

3.3 Nonverbal Behaviour and Communication 18

3.3.1 Facial Displays 19

3.3.2 Facial expression 22

3.3.3 Emotion 26

3.3.4 Head Movement 28

3.3.5 Eye behaviour 28

3.4 Implemented Visual Text-to-Speech (VTTS) systems 30

3.5 MPEG-4 33

3.5.1 Facial Animation in MPEG-4 33

3.6 Markup Languages 36

3.7 Virtual Characters 38

3.8 Summary of literature review 41

Chapter 4 41

Research Methodology 41

4.1 Hypothesis 42

4.2 Delimitations and Assumptions 42

4.3 Limitations 43

4.4 Design and Demonstration 44

4.5 Evaluation 44

Chapter 5 44

Implementation 44

5.1 Background 45

5.1.1 Input text stream 45

5.1.2 Collaboration 47

5.1.3 The TTS Module 50

5.1.4 PST Personality Module 50

5.2 Overview 51

5.2.1 Festival Text-To-Speech Synthesiser Word Expansion 53

5.3 Synchronisation 54

5.3.1 Timing 55

5.3.2 Frames 57

5.3.3 FAML Synchronisation 58

5.4 Personality and Gesture conflict resolution 60

5.4.1 Blinking 60

5.4.2 Eyebrow 61

5.4.3 Gestures and Head Movements 62

5.4.4 Expressions 62

5.4.5 Emotion 63

5.5 Generic Tag specifications 63

5.6 Tag animation 64

5.7 Gesture FAML Tags 67

5.7.1 Head 67

5.7.2 Eyes 69

5.7.3 Brows 70

5.8 Expression FAML Tags 70

5.9 Emotion FAML Tags 71

5.10 Virtual Characters 71

5.10.1 News presenter 71

5.10.2 Sales assistant 72

5.10.3 Narrator / Storyteller 72

5.11 Producing realistic animation 72

5.11.1 Realistic Head Turns 73

5.11.2 Realistic Eye Movements 73

Chapter 6 75

Results and Analysis 75

6.1 The Experiment 75

6.2 Evaluation of Results 77

6.2.1 Profile of Users 77

6.2.2 Results of Phase Two 78

6.2.3 Results of Phase Three 85

6.3 Summary of Results 87

Chapter 7 88

Conclusions 88

7.1 Future work 89

Bibliography 92

Appendix A 100

Appendix B 103

Appendix C 105

Appendix D 107

Appendix E 109

Appendix F 111

Appendix G 119









































List of Figures



Figure 1 Facial expressions 36

Figure 2 Typical input text document for the FAQBot application 46

Figure 3 Straight filtering of unknown tags in the input text 48

Figure 4 Filtered input text document preserving utterance structure 49

Figure 5 FAQBot animation modules 51

Figure 6 FAML module overview 52

Figure 7 Festival expansion list 54

Figure 8 Utterance timing file 55

Figure 9 Phoneme data for the word "Here's" 55

Figure 10 Phoneme duration breakdown 55

Figure 11 Complete duration information for the utterance "Here's" 56

Figure 12 Timing file for utterance "Here's the latest news" 56

Figure 13 Calculated word timing for the utterance "Here's the latest news" 57

Figure 14 Start time values of each word as offset from time 0 of the audio stream 57

Figure 15 Frame synchronisation of word in the animation sequence 58

Figure 16 Example of FAML tags in text 58

Figure 17 FAML tags synchronization 59

Figure 18 A generic FAML tag 63

Figure 19 Breakdown of smile tag animation 65

Figure 20 The amplitude of a generic FAML tag over its duration in the animation sequence 66

Figure 21 The amplitude of a "smile" FAML tag over its duration in the animation sequence 67

Figure 22 Comparative Storyteller results from demonstration 1 Vs demonstration 2 78

Figure 23 Comparative News presenter results from demonstration 1 Vs demonstration 2 81

Figure 24 Comparative Sales assistant results for demonstration 1 Vs demonstration 2 83




















List of Tables



Table 1 Some communicative facial displays categorized by Chovil (1991)??????....21

Table 2 Facial Animation Parameter (FAP) Groups????????????????..34

Table 3 Primary Facial expressions as defined for FAP 2??????????.??.? ..35

Table 4 McNemar's and Stuart Maxwell's p values for Storyteller character??..????.79

Table 5 McNemar's and Stuart Maxwell's p values for News presenter character?.???..81

Table 6 McNemar's and Stuart Maxwell's p values for Sales assistant character?????.83

Table 5 TTS vs TTS and FAML results???????????.??????????84































Chapter 1



Introduction






Facial animation is now attracting more attention than ever before in its 25 years of development. Imaginative applications of animated graphics can be found in sophisticated human-computer interfaces, interactive games, multimedia titles, virtual reality and in an extensive variety of computer-generated animations. Supporting technologies include synthesized speech and artificial intelligence. The goal is to synthesize realistic Talking Heads, representing the dynamic facial likeness of humans.


One particular application of a computer generated Talking Head is the FAQBot application (Beard et al., 1999), developed jointly by Curtin University of Technology and the University of Genoa utilising an animated Talking Head as the interface to the application. The FAQBot is designed to answer users' frequently asked questions on predefined topics. The FAQBot combines an MPEG-4 based facial animation engine (FAE), a text-to-speech (TTS) system and artificial intelligence (AI).


The FAQBot application is still in development and forms the basis for this project. The FAQBot has evolved from a simple Talking Head animation, with only animated lip movement, to a Talking Head that can display user-defined personalities. The FAQBot Talking Head is however still lacking in terms of animation control. The animation of the FAQBot Talking Head is text driven, as such the animation control of the Talking Heads needs to reside within the input text. It is the task of this project to implement a markup system for the text input to the FAQBot Talking Head application to control the animation of facial gestures, expression and emotions of the Talking Head.


Animating the face by specifying every action manually is a very tedious task and often does not yield the desired results. In order to improve facial animation systems, understanding the non-verbal communication and non-verbal behaviour is an important priority. It is suggested that integrating such non-verbal behaviour as facial gestures, expressions and emotions, accompanied by speech, will increase the realism of the Talking Head animation (Pelachaud et al., 1996).


Facial expression changes continuously in humans, and many of these changes are synchronized to the spoken discourse. When people speak their faces are rarely still. They not only use their lips to talk, but raise their eyebrow, move, blink their eyes, or nod and turn their head. Facial expression is linked to the content of speech, for instance scrunching one's nose when talking about something unpleasant. It is also inherently linked to emotion, personality, and other behavioural variables (Ekman, 1979).


The goals and contributions of the research described in this project are described in chapters 2 and 4. Then follows a discussion of the relevant literature and background in chapter 3. This discussion encompasses the domains of psychology, cognitive sciences, computer graphics, computer vision and human­machine interaction. Chapter 5 describes the implementation of the facial animation markup. This is followed in chapter 6 by analysis of the data acquired using our experiments and details of the experiments and results. Finally, the last chapter provides suggestions for future work and the conclusions of this research (Chapter 7).



































Chapter 2



Problem Description






The following sections formally outline the problems investigated in this research. We further discuss the significant aspects concerning our project.


2.1 Problem Statement


The aim of this research is to design and implement a Facial Animation Markup Language (FAML) to control the facial gestures, expressions and emotion in the Talking Head animation for the FAQBot application. The FAML is to be used in the input text stream to "drive" the facial animation of the Talking Head. The FAML will enable the animator to markup the input text, specifying type, intensity and durations of facial gestures, expressions and emotions. The facial displays will be synchronized to the spoken speech such that the timing of the facial displays coincides with its location in the input text. Facial displays encompass the facial expressions, movements, gestures and emotions displayed by the face.


2.2 Subproblems


Non-verbal communication


We consider the work to be an issue of multi-modal communication, particularly the non-verbal mode. It is imperative that the FAML is able to animate the aspects of non-verbal communication that relate to content and structure of the spoken text, as well as the underlying behavioural aspects of human physiology present during communication. It is therefore imperative that the FAML is able to provide the functionality to allow the animator to exhibit the non-verbal displays for the Talking Head.


Animation control


The current animation of the talking head is probabilistic in nature and no exact method of control can be utilised to animate the Talking Head. The Talking Head is able to portray a personality, but is unable to link the personality to the text or speech. The FAML is able to exhibit control over the animation and direct the facial displays for the Talking Head animation.


The FAML is to work in conjunction with the underlying personality of the Talking Head and as such a method of conflict resolution between the two animation processes is required to ensure the continuous and smooth animation of the Talking Head.



Mutually exclusive personalities


Currently, only one personality can be portrayed for each animation sequence of the Talking Head. No mechanisms exist to allow the personalities to change during the animation. The FAML, although not able to specify a personality, is capable of changing the facial expression of the Talking Head during the animation such that a friendly personality can display sad facial expressions.


2.3 Significance of Study


In the current implementation of the FAQBot application, there is no mechanism to control the animation of facial expressions, gestures and emotions for the Talking Head animation. Recent developments have included a personality for the Talking Head, allowing probabilistic movements and expressions to be displayed in the Talking Head animation through personality defined parameters. However, there still does not exists a mechanism to allow consistent animation of character or persona in the Talking Head animation.


The implementation of the FAML allows high-level control of the Talking Head through the input text stream, enabling specified gestures, expressions and emotions to be scripted into the Talking Head animation, synchronized to the speech. The FAML tags can be used in conjunction with the underlying personality of the Talking Head to convey consistent persona or characters for the Talking Head animation.


The FAML enables the Talking Head animation of the FAQBot application to be further utilised in scripting characters such as virtual storytellers, virtual news presenters and virtual sales assistants.


An important aspect of this proposed research is that the work is based on the recently standardized MPEG-4 standard. MPEG-4 enables the animation of 3D Talking Heads using very low bandwidth, enabling smooth facial animation in multimedia and web based applications.







Chapter 3



Literature Review







The literature review begins with an introduction to the domain of Talking Heads, which are synthetic computer modeled human faces that can speak and move, and in particular the seminal FAQBot application. It then describes aspects of facial modelling, the types of models and the animation techniques used to breathe life into them. The Talking Head, and its subsequent animation features predominantly in our work, and as such the techniques that are used to derive and manipulate the face model are of significant interest.


As part of the delimitations of this body of work, the modelling of the relationship between gesture and speech is beyond the scope of the project. It is however relevant to address the level of symbiosis between the two communication channels and this will be discussed through aspects of multi-modal communication.


Animation control techniques are discussed as methods for animating and simulating a Talking Head. Particular attention is paid to the MPEG-4 specifications to which our project is delimited (see section 4.2). A section of this literature review is dedicated to the exploration of MPEG-4, its specifications and facial animation coding system.


This research involves the implementation of facial animation markup language to script facial expressions, gestures and emotions. As such we touch upon the types of non-verbal communication and their function in humans to narrow our focus on what expression and gestures we choose to model. The emotional and linguistic aspects of gestures and how well they relate semantically will also be addressed. We also discuss in detail markup languages and how they can be used to structure and organize input data for our Talking Head.


Lastly we move on to virtual characters and how they can be constructed and animated using FAML tags to produce "believable characters" that convey the illusion of life.


3.1 Talking Heads


Synthetic Talking Heads is a rapidly developing research area, however it is still in its infancy. It continues to attract attention for its application potential. It can be applied to synthesise an intelligent desktop agent, a virtual friend, virtual salesperson, virtual teacher, virtual presenters and even virtual actors (Noh and Neumann, 2000) (Binsted, 1999) (Parke and Waters, 1996).


The research and implementation of a Talking Head encompasses many disciplines and include facial animation, speech synthesis and multi-modal communication.


The FAQBot (Beard et al., 1999) is an example of the application of a Talking Head. The FAQBot application forms the interface to a Frequently Asked Questions (FAQ) database. The FAQBot application accepts user input based on the FAQ topic and the underlying AI (Artificial Intelligence) matches the input to an answer. The Talking Head then communicates this answer both visually and audibly. In its original state the FAQBot application was very static in nature and the Talking Head did not convey much movement. Recent developments made by Shepherdson (2000) have incorporated personality traits to the Talking Head, improving the realism of the Talking Head animation.


Binsted (1999) relates the application of a Talking Head to a soccer game commentator known as Rocco. Rocco is designed as a system for analysing simulation league games and generating multimedia presentations of the games. Its output is a combination of spoken natural language utterances, gestures and facial expressions. Although still in its infancy, Binsted (1999) has designed Rocco to be as believable as possible, mimicking the consistency between expression and action, as well as the modalities of expression.


All of the above applications have required a large amount of research involving aspects of facial animation, speech synthesis, non-verbal and multi-modal communication. The animation of a synthetic face is a very important aspect of this research as it forms the visual modality for communication. We discuss facial models and animation in the following section.


3.2 Face Modelling


A face is an independent communication channel that conveys both emotional and conversational signals, encoded as facial expressions (Nagao and Takeuchi, 1994).


As the technology of computer graphics and animation has increased, so too has the realism and performance of facial modelling and animation. Recent progress in computational power and facial animation has opened the door to powerful tools for the design, implementation and exploration of virtual environments (Badler, 1995) (Parke and Waters, 1996). Facial models are now more complex than ever, capable of modelling greater dimensions and subtlety in the human face, even to the extent of wrinkle modelling as described by Pelachaud and Prevost (1995).


There have been a number of approaches applied to the animation of synthetic faces. The following presents two common facial animation techniques: interpolation and parametric.


3.2.1 Interpolation


In early systems, modelling was done by digitizing the face (or part of the face) with different expressions. Each expression model was stored in an expression database. The animation was obtained by interpolating between two expressions. This method was very simple, but also an arduously time-consuming one. Even though simplistic, the system was still capable of generating expressive animations as outlined by Benoit et al. (1999).


The interpolation generalizes to polygonal surfaces applying the scheme to each vertex defining the surface. Intermediate forms of surfaces are achieved by interpolating each vertex between its two extreme positions (Parke and Waters, 1996). As noted by Shepherdson (2000) a basic assumption underlying interpolation of facial surfaces is that a single facial topology can be used for each surface.


Interpolation is similar to the cell based animation of cartoon characters, where key frames or key cells were produced and intermediate cells drawn to animate the cartoon from one key frame to another (Pelachaud and Prevost, 1995). The key frame technique requires a complete specification (point by point) at each key frame, but does not however require physical and structural formation of the model.


Key frame interpolation derived from simple interpolation is still widely used for implementing and controlling facial animation. This approach was first demonstrated by Parke (1972) to produce viable facial animation.


While this interpolation can be quite successful for limited applications, such as creating stimuli for perceptual experiments, such a system lacks the flexibility of animating the face to represent realism and consistency, since there is no way to control different facial features independently of each other (Beskow, 1996).



3.2.2 Parametric



An alternative method developed by Parkes (1982) modeled a parametrized three-dimensional facial model. Here, the facial model is produced through a set of parameters. Generally the parameters can be divided into two main groups: expression and conformation parameters as initially outlined by Parkes (1990). Expression parameters can be used to specify expressions such as brow actions, mouth shape or head direction. Conformation parameters control the overall topology of the face, allowing local or global control, and relate to the actual parameters acting upon the topology of the face (including position and size of features such as the eyes, nose, mouth). The animation is obtained by changing the set of parameters values and by interpolating between key frames (Pearce et al., 1986) (Cohen and Massaro, 1993) (Guiard-Marigny et al., 1994).


In context of this project, the parameterization technique is utilised by MPEG-4 facilitating the conformation and expression parameters of the parametric model. The MPEG-4 Facial Animation Parameters (FAPs) relate to the parametric expression parameters whilst Facial Definition Parameters (FDPs) relate to the parametric conformation parameters. MPEG-4, FAPs and FDPs will be discussed in section 3.5.


The main concerns of the parameterization technique are to define the physical properties of an element, and to determine the appropriate parameters of those properties. Since it is only the parameters of the face that is required, this approach has the advantage of being quite simple and efficient in that it requires low data storage, as well as providing precision control of parameters to reproduce exact lip shape during speech. MPEG-4 utilises a parameterized method of facial animation for the efficiency, simplicity and low bandwidth property of the parameterization technique (Ambrosini et al., 1998) (Laveagetto and Pockaj, 1999).


However, one major difficulty with parametric models, as Parke (1991) illustrated, is to develop a complete set of parameters that can describe any facial expression and any facial conformation. Furthermore, parametric models do not model movement propagation and neither do they simulate muscle movement, since this required the modelling of the underlying facial anatomy.


The next evolution in face modelling developed a physically based muscle-controlled face model that modeled the movement of the face to the underlying muscles.


3.2.3 Physical



Physically based models attempt to model the shape and dynamic changes of the face by modelling the underlying properties of facial tissue and muscle action (Parke and Waters, 1996) (Terzopolous and Waters, 1990, 1993) (Pelachaud and Prevost, 1995).


Platt and Badler (1981) created the first model to simulate muscle actions. Waters (1987) was the first to include forces, direction and magnitude, into his model. Later Terzopolous and Waters (1990, 1993) integrated various layers of skin. Using this technique, greater realism and subtle facial movements were created. These models provide the ability to manipulate facial expression based on the underlying muscles and facial tissue. Waters showed that the deformation that simulates the actions of muscles underlying the face looks more natural as muscle movement propagation is intrinsic to the model (Waters, 1987).


Structural models


Platt's model (Platt and Badler, 1981) consisted of an object decomposed into hierarchical structured regions. The face is decomposed further into subregions, where each particular subregion corresponds to one muscle or groups of muscles in the face. Each muscle can be simulated by specifying the precise locations of attachment to the surface structure. These regions under the action of the muscle, can show the propagation of movement along the surface of the subregions.


Muscle­Based models


Muscle­based models, or abstract­muscle models, mimic at a simple level the actions of primary muscle groups in the face. There are two distinct advantages for these models: (1) they are independent of particular facial geometry and (2) they map directly into muscle­based coding systems.


Ekman and Friesen (1978) used a Facial Action Coding System (FACS) to describe facial expressions. FACS are derived from an analysis of the anatomical basis of facial movement. Each facial movement is the results of muscle action. An action unit (AU) is the basic element of the FACS. Each AU defines the direct effect of a muscle as well as the eventual secondary propagation of movement in relation to the surface of the face.


Procedural model


This method is based on empirical data and not on biomechanical studies. Unlike muscle based models there is no propagation of movement. It allows hierarchical definitions of movement in the face, defining low level actions that can be combined together to form facial expressions and or lip shapes for speech.


The face model and how it is animated relates directly to the constraints to which it can be manipulated, and how they are managed. An understanding of the techniques used for face modelling and animation will provide insight into the evolution of facial animation and how it relates to our project.



3.3 Nonverbal Behaviour and Communication



Communication is a dynamic process with many interacting components. Nonverbal cues may provide clarity, meaning or contradiction for a spoken utterance. Nonverbal cues can also influence how we perceive others and how we, ourselves are perceived. Familiar faces may make us more likely to start a relationship and continue it (Chovil, 1991). A large number of studies have been conducted to aid understanding of nonverbal communication and its role in human interaction (Ekman, 1992) (Chovil, 1991) (Harper et al., 1978). Nonverbal communication is an important means to convey meaning and information at the verbal, semantic and emotional level.


Ellyson and Dovidio (1985) define the term nonverbal behaviour as that not part of formal, verbal language, referring to facial expressions, body, gaze and hand movements significant through the discourse of social interaction. Malandro (1989) elaborated on the work of Ellyson and Dovidio (1985) and defined nonverbal communication as the process by which nonverbal behaviours are used, either independently or in combination with verbal behaviours.


Miller (1981) has identified the primary uses of nonverbal behaviour of human in communication as:


  1. Expressing emotion: Non verbal signals are powerful. They primarily express inner
    feelings and evoke immediate action or response.


  1. Conveying interpersonal attitudes: Non-verbal messages are likely to be more genuine. Non-verbal behaviours are not as easily controlled as spoken words with the
    exception of some facial expressions and tone of voice.


  1. Non-verbal signals can express feelings too disturbing to state. These are feelings of superiority or dislike or feelings that etiquette or rules may prevent from being stated verbally. There is also the advantage of being able to change one's mind
    since a commitment has not been made out loud.


  1. Words have limitations. It is easier to explain the shape of something or give directions using hand gestures or head nods.


  1. Accompanying speech for the purpose of managing turn, taking, feedback and attention.



Miller (1981) suggests that only 7% of a message is sent through words with the remaining 93% sent through facial expressions (55%) and vocal intonation (38%).


He further explains why humans use non-verbal communication to such a degree:



  1. Non verbal signals are powerful. They primarily express inner feelings and evoke immediate action or response.


  1. Non-verbal messages are likely to be more genuine. Non-verbal behaviours are not as easily controlled as spoken words with the exception of some facial expressions and tone of voice.


Nonverbal cues are symbols with meaning interpretations also. In general, nonverbal symbols perform five activities of nonverbal behaviour, as suggested by Ellyson and Dovidio (1985)



The non-verbal signals and expressions all from the non-verbal behaviours exhibited during communication form the subset of FAML tags that are to be implemented. Facial expressions, head kinesics and eye behaviour all contribute to the realism of human behaviour. Miller (1981) further highlights the importance of non-verbal communication and alludes to its uses during communication.

An important component of nonverbal communication is facial expression, movement and action. These facial components of nonverbal communication are described in the following section.

3.3.1 Facial Displays


There are three main views on facial expression and facial displays and how they relate to communication. The "emotional view" correlates the movement of the face with the emotional state of the person. In essence emotions are central to the display of facial movements and expressions (Ekman and Rosenberg, 1997). Contrary to this, the "behavioural ecology view" does not treat facial displays as expressions of emotion, but rather as social signals of intent, which have meaning only in the social context (Chovel, 1991) (Fridlund, 1994). Recently facial expression has also been considered as an emotional activator in the "brain plasticity view" (Zajonc, 1994) (Ekman and Davidson, 1994) (Camras, 1992) (Lisetti and Schiano, 2000).


Emotional View: Expressions of Emotion


The emotional view suggests that there are essentially only two types of facial actions. The first are the reflex actions that indicate ongoing emotion and display them with facial expressions of emotions. The second are instrumental facial actions that show emotion that is not occurring, and reflect everyday social interactivity, such as a smile of politeness.


The emotional view has proposed a subset of universal emotions that are accompanied by facial displays. Six basic universal emotions were identified by Ekman and Friesen (1975) and are identified as: surprise, fear, anger, disgust, sadness, and happiness. These basic emotions will be discussed in further detail in section 3.3.3.



Behavioural Ecology View: Signals of Intent


Furthermore, facial expression can also be considered as a multi-modal form of communication, the face being only one independent element conveying conversational signals. It was noted by Birdwhistle (1970) that although the human face is capable of as many as 250,000 expressions, less than 100 sets of the expressions constitute distinct and meaningful symbols. Below is a table of communicative displays whose categorization is based mostly on Chovil (1991):


Syntactic Display

1. Exclamation mark

Eyebrow raising

2. Question mark

Eyebrow raising or lowering

3. Emphasiser

Eyebrow raising or lowering

4. Underliner

Longer eyebrow raising

5. Punctuation

Eyebrow movement

6. End of an utterance

Eyebrow raising

7. Beginning of a story

Eyebrow raising

8. Story continuation

Avoid eye contact

9. End of a story

Eye contact

Speaker Display

10. Thinking /Remembering

Eyebrow raising and lowering, closing the eyes, pulling back one mouth side

11. Facial Shrug: "I don't know"

Eyebrow flashes, mouth corners pulled down, mouth corners pulled back

12. Interactive: "You know?"

Eyebrow raising

13. Metacommunicative: indication of sarcasm or joke

Eyebrow raising and looking up and off

14. "Yes"

Eyebrow actions

15. "No"

Eyebrow actions

16. "Not"

Eyebrow actions

17. "But"

Eyebrow actions

Listener Comment Display

18. Backchannel:


19. Indication of attendance

Eyebrow raising, mouth corners pulled down

20. Indication of loudness

Eyebrows drawn together

21. Understanding levels:


22. Eyebrow raising

Eyebrow raising, head nod

23. Moderately confident

Eyebrow raising

24. Not confident

Eyebrow lowering

25. "Yes"

Eyebrow raising

26. Evaluation of utterances:


27. Agreement

Eyebrow raising

28. Request for more information

Eyebrow raising

29. Incredulity

Longer eyebrow raise

Table 1 Some communicative facial displays categorized by Chovil (1991)


These communicative signals were implemented in a human-computer interface system by Nagao and Takeuchi (1994) with successful results indicating that facial displays help conversation in the case of initial contact.


In the behavioural view there are no fundamental emotions or fundamental expression. This view does not treat facial displays as "expressions" of discrete or internal emotional states. Facial displays are considered as a "signification of intent", evolving in response to stimulus. Facial displays have meanings specific only to their context of occurrence, and are only used to serve the users social motives in that context. These motives do not necessary have any relation to emotion, and a range of emotions can occur in one social motive. Facial displays therefore, depend upon the intent of the user, the behaviour of the listener, and the context of the interaction and not on inner feelings as the emotional view suggests (Lisetti and Schiano, 2000).


Brain Plasticity: Emotional Activators and Regulators


Based on breakthroughs in neuroscience of the human brain. Facial actions have recently been considered as emotional activators and regulators. Research suggests that facial actions such as muscle movements can in actual fact generate emotion, as opposed to just an expression of emotion (Ekman 1993). Research conducted by Ekman and Davidson (1994) suggests that with voluntarily smiling, it is possible to generate a happy emotion within an individual. In this sense facial movement actions and expressions are used to activate and regulate emotion. They suggest that facial movements could help change the emotional state of a person.


The question whether facial activity is a necessary part of emotion is of particular concern to the project. To understand the link between facial expression and emotion further identifies the subset of non-verbal facial behaviours that are used during communication. The implementation of a better model of expression to produce emotion and gain insight into the types of expression required for an emotional display. Improving the ability to create more realistic and believable characters that exhibit the illusion of life.


3.3.2 Facial expression


The context of this project will be placed within the behavioural ecology view that facial expression and displays are used as a form of multi-modal communication centering on the human face. This is the most computationally simpler method of viewing non-verbal facial expressions and display. As such facial expressions do not necessarily correspond to any particular emotion. Some facial expressions are used to accentuate words in an utterance. The raising of the eyebrow can be used to punctuate a discourse and not be a signal of surprise. Ekman (1982) characterized facial expressions into the following groups.


  1. Emblems: correspond to the meanings of well known but culturally dependent movements. They can be used to replace verbal expressions such as a nod for "yes" and a shake for "no". Essentially emblems are a way to iconically accentuate what is being said.


  1. Emotional emblems: are made to convey signals about emotion that are being referenced. A person uses emotional emblems to refer to an emotion. For example, when you talk about something disgusting you wrinkle your nose, however you don't actually feel the emotion disgusted at the time.


  1. Conversational signals: are made to punctuate speech, or to emphasize it. Raising the eyebrows may be used to punctuate the end of an utterance.


  1. Punctuators: are movements over pauses. Certain head movements occur over pauses.


  1. Regulators: are movements that help the interaction between speaker and listener. They control the speaker turn based conversation.


  1. Manipulators: corresponds to the biological needs of the face, for instance blinking the eyes to keep them moist.


  1. Affect displays: are facial expressions of mood.



The following features were identified by Pelachaud et al. (1994) as relevant in modelling the human face. The relevance of these features comes from their role in facial conformation, movement, and communication.


Nose : Nose movement usually conveys an emotion of disgust. Furthermore, nostril movements are observed during deep respiration and inspiration. The size and shape of the nose varies among people with different origins. Nose shape contributes significantly to identification.


Eyebrows : Eyebrow movement is vital, both in verbal and non verbal communication. They are predominantly visible in emotions such as ``surprise'', ``fear'', and ``anger''.


Eyes : Eyes are a crucial source of expressive information. When looking at a picture of a person, people tend to devote the greatest attention to the eyes. The eye movement may reveal ``interest'', or ``attention'' of a person. The shape, size, and color of the eyes provide cues in recognizing individuals.


Ears : A face without ears looks like a mask. Ears have an intricate structure and shape. Modelling the detailed shape of ears may not be necessary, depending on the application. However, the simplification of ear shape changes the appearance of a complete face. Ear movement is extremely rare in humans.


Mouth : The mouth is a highly articulate facial zone. Lips articulate elaborately during speech. Modelling of lip motions should be able to open the mouth, stretch the lips, protrude the lips etc., to produce the phonemes and basic emotional expressions.


Cheeks : Cheek movement is visible in many emotional states. Generally, cheek movements supplement other movements that may include the mouth or lower part of the eyes. Actions such as the puffing and sucking of cheeks may provide emphasis for certain emotions. They reveal characteristic movements during sucking or whistling.


Chin : The movement of the chin is mainly associated with jaw motion. However, the chin is distinctively deformed to indicate ``disgust'' and ``anger'' with the lips tightened. The shape of chin also plays an important role when conforming facial models to individuals.


Neck : The neck permits the movement of the entire head, such as nodding, turning, rolling etc. As the neck moves, it can change its width or it may elongate.


In context of this project, the eyes, eyebrow, mouth, chin, nose, cheek and ears form the basis of the facial features in the Talking Head, and as such should be included in the FAML subset of tags animating the Talking Head animation. The neck however as stated in the delimitations is not independent of the head, and as such is unable to move independently. All other facial features however have been modeled accurately.


As stated by Pelachaud et al. (1991) all categories of facial expressions as outlined by Ekman (1979) need to be included and integrated to obtain a more complete facial animation. In the context of our project and the FAML, we need to ensure that for the effect of realism we provide a set of FAML tags that cover a subset of the identified categories of facial expression to provide a set of tools for the author to create the believable characters. Facial expressions occur continuously during speech, both complementing and reinforcing the information delivered in speech.


Temporal characteristics of facial actions


Facial expression can be defined as time-dependant changes in facial movement and can be described by the following three temporal parameters:


  1. Onset duration : How long the facial display takes to appear.

  2. Apex duration : How long the expression remains in the apex position.

  3. Offset duration : How long the expression takes to disappear.


Facial displays of expressions and emotion differ in the aforementioned parameters. For example the expression of sadness has a slow offset, whilst expression of happiness has a short onset. Although these parameters are vital in terms of believable animation of expression and emotion, observation of the literature indicates that there exists little data on the definitive values of onset, apex and offset durations (Essa, 1994) (Yacoob and Davis, 1994) (Bartlett et al., 1999). Pelachaud et al. (1996) use three parameters to specify a facial expression. Kalra (1993) used four parameters, attack (onset), decay, sustain (apex), and release (offset).


In context of this project, we utilise the three parameters of onset, apex and offset for the temporal characteristics of all facial expressions, gestures and emotions. The three parameters provided adequate realism in facial expressions as indicated by the literature. The extra parameter of decay, suggested by Kalra (1993) did not provide a significant increase in realism to warrant a fourth parameter to model temporal changes of expression.


Synchronism


A person conveys his thoughts with words and facial expressions. For example, actions such as smiling, raising of the eyebrow and wrinkling of the nose often occur with speech. Facial expressions accompany the flow of speech and are synchronised at the verbal level, punctuating accented segments and pauses.


An important aspect of communication is the link between gesture and speech and their tendency to occur in synchrony (Condon and Ogston, 1971). Synchrony implies that changes that occur during speech and body movements, such as the head and facial expressions appear at the same time. For example when a head begins to articulate, eye blinks, head movement, head tuning and brow movements can occur and finish at the end of the word.


Synchrony among body and facial motions occurs at all levels of speech, including the phoneme, the syllable, the word, the intonational phrase and the utterance (Cassell et al., 1994c). Speech has to be synchronised with lip movement, but this also includes facial expressions and gaze. A delay in the synchronisation process is easily perceived by the viewer and can appear unnatural and disturbing (Malandro, 1989).


Timely responses are crucial to successful conversation, since some delay in reactions can imply specific meaning or make the utterance unnecessarily ambiguous (Nagao and Takeuchi, 1994). Systems that use an automated interaction of both audio and visual channels (Pelachaud et al., 1996) (Nagao and Takeuchi, 1994) (Ostermann et al., 1998) (Cassell et al., 1994b) use the audio channel as the synchronous clock. The audio module sends a signal to the visual module, ensuring that the audio and visual representations are synchronised to support communicative process.


Synchronism for this project is implemented at the word level. All gestures, facial expressions and movements are linked to the start time of words in the utterance. The audio channel is used as the clock to denote the start time and durations for words in the utterance. The literature has supported the use of the audio channel as the synchronism between gesture, expression and speech. In context of this project The Text-To-Speech module signals the FAML module ensuring that the audio and visual representations of speech and facial gestures, expressions and emotions are synchronised (Ostermann et al., 1999).


Gestures occur in parallel with speech, although in the case of hesitations, pauses or syntactically complex speech, it is the gestures that appear first (McNiell, 1992). At the most local level, individual gestures and words are synchronised in time so that the "stroke", the most energetic part of the gesture, occurs either with or just before the phonologically most prominent syllable of the accompany speech segment (McNiell, 1992).


Multi-modal communication: The link between gesture and speech


Evidence presented by Kendon (1994) suggests that there is a close relationship between speech and spontaneous gestures during conversation. McNiell (1992) suggest that 75 percent of speech is accompanied by gestures, although the proportions of gestures changes. In general gesture types occur in all languages. For instance, many hesitation gestures occur at the beginning of speech and correlate with the avoidance of gaze (the head turns away from the viewer) as if to help the speaker to concentrate on what is going to be said.


Communication is still possible without gesture. Information appears to be just as about effectively communicated in the absence of gestures (Williams, 1977), for example on the telephone. However it has been shown that when speech is ambiguous or obscure, listeners tend to rely on gestures to fill in their gaps in comprehension.


It is noted that gesture and speech do not always manifest the same information. Firstly semantically, in that speech and gesture give a consistent view of an overall meaning to be conveyed, and pragmatically, in that speech and gesture mark information about this meaning as advancing the purpose of conversation in a consistent way. For example, gestures may depict the way in which an action was carried out when this aspect of meaning is not depicted in speech (Cassell and Stone, 1999).


McNiell (1992) stated that in terms of a computational implementation model, gesture and speech must arise from a common conceptual source, and that gesture plays an intrinsic role in communicative intent. In the implementation of the model two aspects must clearly be defined. Firstly, one single underlying conceptual source must serve as the representation that give rise to the form of both speech and gesture. Second, communicative intent must be specified.


According to McNiell (1992), gesture and speech arise together from the underlying representation that has both visual and linguistic aspects, and so the relationship between gesture and speech is essential to the production of meaning and comprehension.


We have sought to ensure that the chosen subset of FAML tags is sufficiently comprehensive to allow the animator to mimic the relationship between gesture and speech. As indicated by the literature, this is an important aspect of multi-modal communication, as gestures and facial expression supporting the speech aid in comprehension. So too, the FAML tags will enable the animation of the Talking Head to support the synthesised speech.




3.3.3 Emotion


When people speak, there is almost always emotional information communicated with speech. This emotional information is conveyed through multiple communication channels, including emotional qualities of the voice and visible facial expression.


Producing emotional responses requires both the ability to generate facial expressions, and a model for synthesizing appropriate emotion in a dynamic environment. Three main areas of the face are involved in visible expression, firstly, the upper part of the face, with the brows and forehead, secondly the eyes and thirdly the lower part of the face with the mouth (Parke and Waters, 1996). An emotion is defined as the evolution of the human face over time: it is a sequence of expressions with various durations and intensities (Ekman, 1978).


Events can often elicit multiple emotions whose effects blend together. For example a person can be both surprised and frightened. Such emotion can appear concurrently or in rapid succession.


Emotions can sometimes be confused with other aspect of expression, such as reflex and mood. A reflex, such as from being startled, is a brief event that cannot be completely inhibited like an emotional response. Alternatively mood, stretches over a longer period of time than an emotion, and is more inclined to refer to the tendency of an emotional display within a person. An emotion has a limited duration, half a second to for seconds as suggested by Ekman (1982), and the facial muscles cannot hold the expressions for minutes or hours (Ekman, 1982).


Each specific emotion has an average overall duration. However it is the time variation that is context specific. For example a smile of politeness may last a few seconds, but it may last longer with euphoria. Emotions adhere to the same temporal characteristics as described previously. When the overall duration of the emotion is lengthened, so too does the proportional expansion of the temporal stages of onset, apex and offset.


Ekman and Friesen (1978) found six emotions to have universal facial expressions: sadness, anger, joy, fear, disgust and surprise. Most existing facial animation systems use these sets of emotion (Pelachaud et al., 1996) (Nagao and Takeuchi, 1994) (Cassell et al., 1994b), including the MPEG-4 specification as delimited by this project.


Sadness


Sadness has many intensities and variations, including open-mouth crying, closed mouth crying suppressed sadness, nearly crying and miserable. In simple sadness the inner portions of the eyebrows are bent upwards and the corners of the mouth bend slightly downwards (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Anger


Can be aroused from frustration, physical threat, or psychological harm. In simple, anger the inner comers of the eyebrow are pulled downward and together. The lower edge of the eyebrow is at the same level as the upper eyelid. The mouth is closed with the upper lip slightly compressed or squared off. Variations of anger include shouting rage, rage, and sternness (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Joy


In simple joy, the eyebrows are relaxed. The upper eyelid is lowered slightly and the lower eyelid is straight being pushed up by the upper check. The mouth is wide with the corners pulled back towards the ears. Variations of joy include uproarious laughter, laughter, sly smile, open smile, false smile and false laughter (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Fear


Fear arises from persons, or situations that seem dangerous. Fear can range from worry to terror. In fear the eyebrows are raised and pulled together. The inner portions of the eyebrows are bent upwards. The eyes are alert. The mouth might be slightly dropped open and stretched horizontally (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Disgust


Disgust is a reaction to something that is unpleasant or distasteful. Disgust ranges from disdain to physical repulsion. In disgust the eyebrows are relaxed. The eyelids are relaxed or closed. The upper lip is raised in a sneer, often asymmetrical. For physical repulsion the eyebrows are lowered, especially at the inner corners. The eyes may be mostly shut in a squint (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Surprise


Surprise is a reaction to a sudden, unexpected event. In surprise the eyebrows are raised straight up as high as possible. The upper eyelids are opened and wide as possible with the lower eyelids relaxed. The mouth is dropped open without muscle tension to form an oval shape (Parke and Waters, 1996) (Flemming and Dobbs, 1999).


Emotions constitute the primary motivational system of humans. The description of emotion constitutes various components, such as physical responses, autonomic nervous system and brain responses, verbal responses (vocalisations), memories, feelings and facial expressions (Ekman, 1982). For a believable interactive application, there needs to be a connection of facial expression generation with a process that produces believable behaviours given the inputs to the system.


We have chosen to implement these six universal emotions stipulated by Ekman (1978) as they enable the connection of facial expression with the portrayal of believable behaviours, enabling the animator through the use of FAML tags to create realism and believability in the character animation.


3.3.4 Head Movement


Movements of the head and facial expressions can be characterized by their placement with respect to the linguistic utterance and their significance in transmitting information. Head movements can be categorised into three distinct sections: head turning, head nodding and head orientation (Shepherdson, 2000).


Head turns denote the direction of gaze and are accompanied by a change of head position. Head nodding is an example of an emblem, a form of nonverbal communication that can be directly related to a verbal phrase. A head nod could show agreement "yes", whilst a headshake would show disagreement "no". Head orientation can be used to impart personality traits, such as the lowering of the head to show submissive nonverbal behaviour (Shepherdson 2000).


Head movement can also coincide with hesitation and pauses within speech. Hadar et al. (1983) examined the relationship between head movement and speech. They established a link between the temporal aspects of head movement to the prosodic nature of speech. Head movement was classified into three categories based on its temporal aspects: (1) slow movements, occurring at 0.2-1.8 Hz (2) ordinary movements at 1.8 to 3.7 Hz and (3) rapid movements at 3.7-7.0 Hz. Hardar et al. (1983) concluded that primary accents are marked by rapid movements, while ordinary movements followed by stillness denoted terminal points or end of conversation. Rapid movement may also occur during marked repetition of syllables or words and short speech pauses.


3.3.5 Eye behaviour


Visual behaviour of the eyes is an important feature whose main functions are to help regulate the flow of conversation, to signal the search for feedback during interactions, to express emotion or to influence another person's behaviour (Walker and Trimboli, 1993) (Webbink, 1986). Eye contact is an important non-verbal method of establishing relationships and communicating with others. People are very sensitive to eye behaviour and are able to perceive the slightest change in eye direction.


As discussed by Argyle and Cook (1976) eye movement can be defined by the direction of gaze, the point or points of fixation, the percentage of eye contact over gaze avoidance, and the duration of eye contact. A common metric for eye behaviour is "interest". Eyes tend to fixate longer on objects of interest for longer periods of time. When a person is exasperated, or trying to solve a problem, or trying to remember something the eyes will look up (Parke and Waters 1996).


Eye movement


The eyes are in a state of continual motion, usually with rapid changes in fixation. When looking at another person, research conducted by Argyle and Cook (1976) found that the viewer concentrates upon the eyes of the other person 58 percent of the time, and then their mouth 13 percent of the time. The remaining regions of the face are attributed only one percent of the time.


The actual change of focus of the eyes, close focus versus distance focus, constitutes about 6 millimeters in iris displacement (Benoit et al., 1999), however this is easily perceived by humans. Pupils are closer to each other during close focus than distant focus, lending to the term "cross-eyed". This highlights the importance of synchronisation of eye movements for both the left and rights eyes.


Eye-head coordination


Argyle (1975) through experimentation stated that when people break eye contact to avoid gazing at one another, they usually move their heads to look away. A change in the direction of gaze is frequently accompanied by head movement (Argyle and Cook, 1976) (Bizzi, 1974). For example, a sad person has a tendency to look down as well as lowering the head. In the case of a predictive event, an event that preludes another event (Bizzi, 1974), the head generally moves before the eyes, which eventually follow with rapid eye movements. If a person lowers the head first, maintaining gaze and then cast the eyes downward, this preludes the expression of sadness in the person. This is the only case of a predictive event as discussed by (Bizzi, 1974). However, in general the eyes lead head movement (Parke and Waters, 1996) (Argyle, 1975) (White, 1986) and (Maestri 1996).


Blinking of the eyes


Blinking forms an important aspect of synthetic facial animation. Blinking is the rapid closure and opening of the lower and upper eyelids, a process that occurs simultaneously with left and right eyes. The eyes blink frequently, serving not only to accentuate speech, but to also satisfy the biological need to lubricate the eyes. In general, there is at least one blink per utterance (Parke and Waters, 1996) (Pelachard and Pervost, 1995).


It is important to note that the structure of the eye blink is synchronised to the articulation in speech, the eye might close over one syllable and open on another, and blinks can also occur on stressed vowels (Condon and Osgton, 1971).


Blinking can be categorized by the following parameters, as outlined by Pelachard and Pervost (1995):



As discussed by Parke and Waters (1996) through observations based on face-to-face communication there exists synchrony between the speaker's voice and the speaker's eye blinks. The speaker's eye blinks tend to follow pause in the speech, with experimental results showing that this occurs about 75 percent of the time.


Blink occurrence is also emotionally dependent. During fear, tension, anger, excitement and lying, the amount of blinking increases while it decreases during periods of concentration. Blinks also occur on any shift of eye direction as they call attention to change, as well as allowing the animator to make the expression stronger. The eyes are the most important part of an expression and must be animated with care. Any jitter or false movement on an in between destroys both communication and believability (Thomas and Johnston, 1981).


The discussion on eye and head movement highlights the importance of the eyes and head as forms on non-verbal communication and behaviour. It is clear that with regard to this project, the ability to control the movement of the eyes is essential for added realism. The FAML tags provide a subset of both head and eye movements to allow further realism in the scripting of the Talking Head animation.


3.4 Implemented Visual Text-to-Speech (VTTS) systems


From observations of the literature there are a number of systems that integrate a Talking Head with speech, facial expression and gestures, each with varying degrees of realism and effectiveness.


Morphing Systems


Ezzat and Poggio (1997) implemented a VTTS system that pre-stored all the images of the visemes, the visual representation of the phoneme, to allow the animation of lip movement during speech. The intermediate visemes were animated using a morphing technique. The system used optical flow methods borrowed from computer vision literature, to compute realistic transitions between visemes to every other viseme. A text-to-speech (TTS) synthesiser was exploited to generate phonemes/visemes and timing information to determine what visemes to use and the rate of morphing. Using this technique Ezzat and Poggio (1997) were able to synchronise the visual speech stream with audio speech stream, and hence give the impression of a video-realistic talking face. It can however be noted that Ezzat and Poggio (1997) only morphed viseme transitions and not any other facial gesture or feature. Eyebrow movement, blinking and nodding of the head was omitted.


Cosatto and Graf (1998) also used the method adopted by Ezzat and Poggio (1997) for facial animation but implemented a new technique that was capable of extracting facial parts such as the mouth, eyebrows into a compact library independently of each other. Then using these face models and a TTS, new video sequences are "warped" or "morphed" between different views. Because the facial features are controlled independently, each facial feature can be warped independently of the other. This technique can provide photo realistic animation of a Talking Head, however the difficulty is in finding precise specifications of the displacements of many points in order to guarantee results that mimic real faces. Moreover, the computation of such displacement is actually quite expensive and could never be used in real-time animation with the current level of technology. Both Cosatto and Graf (1998) and Ezzat and Poggio (1997) implemented systems animating a Talking Head but the computational expense was far too great for a real-time application. Both techniques suffered from any type of misalignment between visemes, which greatly degraded the performance of the facial animation.


Although the morphing approach does produced sufficient results with regards to realism as indicated by the literature, the morphing technique cannot be applied to this project as the animation systems based on the parametric head model and utilising the MPEG-4 facial animation coding system. MPEG-4 will be discussed in further detail within section 3.5.


Parametric systems


Cassell et al. (1994a) developed a rules-based model for the interaction between intonations and gesture, and implemented these rules in a conversation simulation system with two Talking Heads. Although the modelling of the interrelationships between speech and gesture is beyond the scope of this project, the implementation and synchronisation of the gestures is relevant. The implementation of the gestures was carried out by a group of Parallel Transition Networks (PaT-Nets), finite state machines, several of which ran in tandem. The PaT-Nets govern the production of the gesture and integration of the gesture into the facial animation. An AT&T Bell laboratories TTS synthesiser was used to produce the actual speech wave and phoneme timings. The phoneme timings, duration outputs and speech waves from the synthesis were merged together by rule with the abstract intonational and gestural notations. The detailed timing information allowed the synchronisation of the gestural animations with the speech.


The approach taken by Pelachaud et al. (1994) although similar to Cassell et al. (1994a) used a FACS notation (Facial Action Coding System) created by P. Ekman and W. Friesen (1978) to describe visible facial expressions. FACS describes temporary changes in facial appearance, how a feature is affected by its location, and the intensity of changes. An Action Unit AU corresponds to action produced by one or a group of muscles. The facial model presented by Pelachaud et al. (1994) integrated both the FACS and the AU to realistically animate the Talking Head. Expressions and facial gestures were broken into the corresponding AU or groups of AU and these were in turn animated using the FACS. Synchronisation was implemented in the same manner as Cassell et al. (1994b) and used timing information from the phoneme and TTS synthesiser.


The model implemented by Pelachaud et al. (1994) and Cassell et al. (1994a) provides an animation control system based on rules rather than tags. The system utilises a semantic model of the input text and based on behavioural rules of non-verbal communication link the facial gestures, expressions and emotions to the Talking Head animation. Similarly, this project uses tags to link the facial gestures and expression of non-verbal communication to the speech and the Talking Head animation. The rules used in both the Pelachaud et al. (1994) and Cassell et al. (1994) systems are based on FACS, which is also very similar to the MPEG-4 coding systems discussed in section 3.5. The knowledge gained from the rules of non-verbal communication provide a good indication towards the subset of FAML tags that are required to truly mimic the non-verbal behaviour in humans.


The FAQBot (Beard et al. 1999) was implemented in conjunction with Curtin University and the University of Genoa. The FAQBot was to provide a humane interface to a frequently asked question (FAQ) database. The FAQBot was implemented using the MPEG-4 coding standard. The MPEG-4 specification has already defined and standardized the animation of a synthetic face. The facial animation engine (FAE) implemented for the FAQBot application is similar to the FACS as implemented by Pelachaud et al. (1994) and the higher order expressions, such as smile, similarly can be thought of as collection of AU. In its initial implementation there were no facial gestures, simply a Talking Head, with lips synchronised speech. Further work has integrated a personality module to give the Talking Head behavioural parameters and facial gestures. However there still is no mechanism to synchronise specific gestures to the words or phonemes in the spoken speech. The FAQBot application is the foundational work for this project.


Ostermann et al. (1998) investigated the integration of Talking Heads and text-to-speech synthesisers for a visual TTS. The VTTS synthesiser allows defining facial expression as bookmarks in the text and is used to animate the Talking Head when it is talking. The bookmark itself names the expression, its amplitude and the duration during which the amplitude has to be reached by the face. Ostermann et al. (1998) in their research used MPEG-4 as the animation system. Their research has outlined a method of animating an MPEG-4 facial animation driven by the input text. The bookmarks provide a mechanism to link the gesture or facial expression to their position in the synthesised speech as well as providing syntax for the bookmarks.


Ostermann's implementation of the bookmark mechanism for the MPEG-4 animation of the Talking Head forms a primary knowledge base for the implementation of the FAML tag system. The tags' specification as delimited by the scope of this project is in actual fact derived from the work of Ostermann. Ostermann et al. (1998) describes the process by which the bookmarks alter the flow of the animation and how they a co-articulated together to ensure continuous and flowing animation.


3.5 MPEG-4



As indicated in the delimitations section of this project, MPEG-4 forms the definition and animation control system. MPEG-4 was developed by the Moving Pictures Expert Group (MPEG) and has been standardized by the International Standards Organization (ISO) (MPEG 1999). MPEG-4 enables the integration of face animation with multimedia communications and presentation. With regards to this project MPEG-4 forms a crucial part of the architecture that drives the face model, including the facial expression from the text input.


MPEG-4 separates the animation into two bit-steams, the face animation bit-stream and the audio bit-stream (Ostermann, 1998). With regards to this project we will only be concerned with the facial animation bit-stream, and how it can be manipulated for the FAML tags.





3.5.1 Facial Animation in MPEG-4


The MPEG-4 standard allows sending parameters that calibrate and animate synthetic faces. These models themselves are not standardized by MPEG-4. Standardization only occurs for these parameters (MPEG 1999):





These parameters define what is a synthetic face in terms of MPEG-4, what are its components, and how it is represented and manipulated. These parameters form the core of the foundation that make up the Talking Head and its animation in the project and as such will be discussed in greater detail.


MPEG-4 defines three sets of parameters for both the animation and the calibration of a synthetic face (MPEG 1999) (Ostermann, 1998) (Ambrosini et al., 1998) (Lavagetto and Pockaj, 1999):


Facial Animation Parameters (FAPs) : represent the complete set of basic facial actions, and therefore allows the representation of most natural facial expressions. The parameter set contains two high level parameters, the viseme, and the expression. The viseme parameters allows the rendering of visemes, the visual representation of the lips, for specific phonemes, without the need to express them in terms of other parameters, similarly the expression parameter allows the definition of movements of the face. The FAP is responsible for representing the animation of the face.


Facial Definition Parameters (FDP) : The facial animation must have a generic face model capable of interpreting FAPs. FDPs are responsible for the defining the appearance of the face. These parameters can either modify the shape and topology of the face model. FDPs can be used to personalize the general face model to a particular face.


FAP Interpolation Table (FIT) : The FIT provides the rules of interpolation for the FAPs. The FAPs are used to specify the expressions and the interpolation is done using the FIT.


In context of this project, the FAML tags relate directly to the FAPs provided by MPEG-4. In essence a FAML tag that involves the animation of the eyebrow incorporates the eyebrow MEPG-4 FAPs. FAPs form the basis of the FAML tags, and allow the animation of faces, reproducing movements, facial expressions, emotions and visual speech. FAPs are based on the minimal actions in the face and relate closely to the underlying facial muscles (Ostermann 1998).


MPEG-4 contains 68 FAPs, categorized into ten groups related to different parts of the face. Two of the 68 FAPs group 1, FAP 1 and FAP 2, are high-level parameters associated with visemes and expressions respectively (see table 2). Expression and viseme FAPs relate to complex actions and are typically associated with a set of lower level FAPs. Low-level FAPs are associated with movements of key features points on the face, as well as rotational movements of the head and eyes.





Group

FAP Numbers

1: visemes and expressions

2

2: jaw, chin, inner lowerlip, cornerlip, midlip

16

3: eyeballs, pupils, eyelids

12

4: eyebrow

8

5: cheeks

4

6: tongue

5

7: head rotation

3

8: outer lip positions

10

9: nose

4

10: ears

4


Table 2 Facial Animation Parameter (FAP) Groups


High level expression FAPs (FAP 2) are defined in table 3 and their associated expressions can be seen in figure 1. These high level expressions correspond to the emotions defined by Ekman and Friesen (1975) discussed previously.















No.

Expression

Description

1

Joy

The eyebrows are relaxed. The mouth is open and the mouth corners pulled back toward the ears.

2

Sadness

In simple sadness the inner portions of the eyebrows are bent upwards and the corners of the mouth bend slightly downwards.

3

Anger

The lower edge of the eyebrow is at the same level as the upper eyelid. The mouth is closed with the upper lip slightly compressed or squared off.

4

Fear

The inner portions of the eyebrows are bent upwards. The eyes are alert. The mouth might be slightly dropped open and stretched horizontally.

5

Disgust

The eyebrows and eyelids are relaxed. The upper lip is raised and curled, often asymmetrically.

6

Surprise

The eyebrows are raised. The upper eyelids are wide open, the lower relaxed. The jaw is opened.


Table 3 Primary Facial expressions as defined for FAP 2













Figure 1 Facial expressions


MPEG-4 specifies a face model in its neutral state, a number of feature points on the neutral face as reference points. A neutral face is defined as follows: Gaze is in the direction of the Z axis, all facial muscles are relaxed, eyelids are tangent to the iris, lips are in contact, the line of the lips is horizontal and at the same height of lip corners, the mouth is closed and the upper teeth touch the lower ones (Ostermann, 1998) (MPEG, 1999) (Lavagetto and Pockaj, 1999).


Animation is controlled by deforming the face model corresponding to a particular facial action, specified by some FAP values, at each frame, generating an animation sequence. The FAP value for a particular FAP indicates the magnitude of the corresponding motion, for example a large movement of the eyebrow, or a little movement of the corner lip (Ostermann, 1998).


The amount of the displacement described by a FAP is expressed in specific measurement units, called Facial Animation Parameter Units (FAPU), which represent fractions of key facial distances. Rotations are instead described as fractions of a radian (Lavagetto and Pockaj, 1999).


FAP values of zero correspond to their neutral states. All FAP values are thus expressed in terms of the magnitude of displacement from their neutral states. In an attempt to lower the bit rate of the animation, not all FAP values are represented in the bit-stream (Ostermann, 1998). A masking scheme is utilised to indicate which FAP values are used. A FAP values is switched "on" with a "1" and switched "off" with a "0" value. Corresponding FAP values are then attached to the end of the FAP bit-stream. It is the FAPs and their corresponding values that drive the animation of the MPEG-4 synthetic face animation.


3.6 Markup Languages



As discussed in the delimitations of the project, the FAML tag structure and specification as outlined by the Fifth framework consortium (see appendix A) have been stipulated for interoperability with other consortium projects. As such the design and structure of FAML tags are to be compliant to the Fifth framework specifications.


The Fifth framework tag specifications are based on work conducted by Ostermann et al. (1998). The tags were used in a Visual Text to Speech (VTTS) system, that enabled the integration of a Talking Head with a Text to Speech (TTS) synthesiser. The tags or bookmarks as referred to by Ostermann et al. (1998) enable the animation of the Talking Head through the input text stream.


Research conducted by Binsted (1999) on the implementation of a virtual commentator for soccer commentary (Rocco), implemented a similar VTTS system to Osterman et al. (1998), however the markup used for animation was based on the standardized Standard General Markup Language (SGML). SGML is a meta-language that can be use to develop markup languages. The SGML based markup tags used by Binsted (1999) facilitated the integration of other markup text systems in the virtual commentator (Binsted, 1999).


The system developed by Binsted (1999), used three differing forms of text markup. The SGML based Global Document Annotation (GDA) (Nagao and Hasida, 1998), is used for indicating part of speech (POS), syntactic, semantic and pragmatic structure in text. The SGML based SABLE text-to-speech markup system (SABLE, 1998), developed by the international consortium of speech researchers was used for the markup of intonation in the text. The SGML based FACSML, developed by Binsted (1999) for the markup of FACS is used to animate the face.


The SGML based markup language provided a simple means of parsing the input text document and integrating the different markup system into one conforming text document, for input into Rocco, the soccer commentary system. The ability to integrate different markup systems as provided by SGML based markup language is deficient in the VTTS system proposed by Ostermann (Ostermann et al. 1998).


However, SGML was not widely adopted as a meta language for structuring documents due to its complexity. A subset of SGML known as XML, addresses this complexity by providing the functionality of SGML in a smaller and simpler form. XML is targeted as a simpler solution to document structuring, creation, management, exchange and display (XML FAQ, 2000).


XML or eXtensible Markup Language, similar to SGML is a meta-language. A language used to describe other markup languages. The Web Developers Virtual Library (WDVL 2000) defines XML as "a human readable, machine understandable, general syntax applicable to a wide range of applications (data bases, e-commerce, Java, web development, searching)". It further explains that "custom tags enable the transmission, validation and interpretation of data between applications and organizations". In this context XML allows the development of domain specific markup languages. XML can be used to design a markup language specific to our needs, the markup of facial displays for the animation of a Talking Head.



XML offers its users many advantages discussed by St. Laurent (1998), including:


making it possible for rapid development, and ease of use.


"'extensible" tag sets that can be used for multiple applications. XML itself too is extensible, being extended with several additional standards that add styles, linking, and referencing ability to the core XML set of capabilities


wide variety of tools. Due to the tight structure of XML, parses can be quickly developed and XML supports a number of key standards for character encoding, allowing it to be used in any number of different applications.


web.



The high structured nature of XML allow ease of parsing and document authentication, with regards to syntactic use of tags as well as their semantic use. XML documents can conform to EBNF forms of language allowing easing validation.


Stallo (2000) successfully implemented a TTS markup system based on the XML specifications. In the context of this project and with the collaboration of Stallo (2000) the input document used in the project is XML compliant (see section 5.1.1) for reasons of simplicity, extensibility, interoperability and openness. However due to the delimitations as described previously the FAML tags are not XML compliant. The need for the XML compliancy of FAML tags will be discussed in future work, section 6.4.


As indicated from the literature the FAML tags system needs to be structured, simple, extensible and robust. XML and SGML are meta-languages that can be used for the implementation of markup languages, however these meta-languages are unable to be utilised due to the delimitations of this project (see section 4.3).


3.7 Virtual Characters


Virtual characters relate directly to our hypotheses (see section 4.1), which state that FAML tags can be used to script a Talking Head to portray believable characters. A clear definition of believable in terms of character animation is required to ensure that: 1) the characters behave believably in terms of humans and 2) that the subset of FAML tags is extensive enough to provide this believability in the virtual characters. It is the intention of this project to portray three virtual characters, namely a Storyteller, News presenter and Sales assistant.


Believability


In this context believability does not mean an honest or reliable character, but one that provides the illusion of life, and thus permits the audience's suspension of disbelief. The idea of believability has been studied over multiple disciplines such as literature, theater, film, radio drama and other multimedia (Bates, 1994).


Disney animations and animators have been leaders in creating believable character animation, be it cell animation. However recently with the production of Toy Story, Toy Story 2 and A Bugs Life, Disney has moved into the digital domain and is creating movie length animations of virtual characters. The techniques of cell animation can be utilised in the animation of digital characters much in the same way as to be create believable virtual characters.


Chuck Jones describing animation at Warner Bothers (Jones, 1989), said :


"Believability. That is what we are striving for ? belief in the life of the characters. That, after all, is the dictionary definition and meaning of the word 'animation': to invoke life"


According to Thomas and Johnston (1981), appropriately timed and clearly expressed emotion is a central requirement for believable characters. Properly portraying the emotional reactions of a character requires the animator to remember several key points (Bates, 1994) (Thomas and Johnston 1981):


  1. The motional state of the character must be clearly defined. The animator needs to know the state of the character at each instance or frame, so that the viewer can attribute definite emotional status to the character.

  2. Accentuate the emotion. Use time wisely to establish the emotion, to convey it to viewers, and let them savor the situation. Viewers often cannot grasp the emotion immediately.


Emotion is one of the primary means to achieve believability or the "illusion of life". Bates (1994) concluded that the "illusion of life" refers to a conveyance of a strong subjective sense of realism. For believable characters using the FAML tags, the animator needs to be aware of the key points made by Thomas and Johnston (1981) if the animation is to appear believable.


Virtual Actors


In today society, people make presentations to inform, teach and motivate and persuade others. Therefore if a virtual character has the same skills as its human counterpart, then it can perform the same function.


Research conducted by Noma and Badler (1997) states that to make a virtual human presenter usable, it should satisfy the following requirements:


  1. Natural motion with presentation skills. To be credible with users the virtual presenter should be as natural as possible. In addition, presentation skills, such as non-verbal communication should be embedded into the presenter system.


  1. Real-time motion generation synchronised with speech. For real time systems the gestures and movement must be synchronised to the speech for believability and credibility.


3. Proper inputs for representing presentation scenario. The forms of inputs to the virtual presenter should enable designers to structure it in such a way without detailed description.



In context of this project, to achieve realism and believability when portraying a virtual News presenter, the animator must be able to follow the requirements specified previously by Noma and Badler (1997).


Many researchers emphasize the impact of eye contact with the audience during presentations (Becker and Becker, 1994) (Bergin, 1995) (Kupsh and Graves, 1993) (Leech, 1993). In the case of public presentations by real humans, presenters vary their directed eye contact. However in the case of a virtual presenter, eye gaze and viewpoint should be focused on the TV camera. This enables the presenter to talk to every person in the audience directly (Noma and Badler, 1997).


Although the system proposed by Noma and Badler (1997) lacked facial expression and gesture commands it does outline the function and the actions of a virtual presenter and how this relates to the credibility and believability in the audience.


Consistency between expression and action, and also between modalities of expression, contributes to a character's believability. Believability, in turn, contributes to the expected predictive value of the character's perceived personality (Binsted, 1999).


Successful character animation implies much more than just animating facial displays. The primary goal of animation is to communicate. The fundamental text of a good believable character is how well it communicates. It must communicate the intended message clearly, creatively and logically. Effective animation elicits a response from the viewer. It informs, captivates, and entertains (Parke and Waters, 1996).


The overall goal of animating the face is to give the illusion that the character poses and expressions are motivated by the character instead of topically manipulated by the animator. Making informed decisions on what the face needs to express and understand natural facial composition are fundamental to the making of a believable facial performance. The techniques discussed by Disney animators can be used in the FAML tags to improve the believability of the facial gestures, expressions and emotions exhibited by the Talking Head. Techniques such as blinking when changing head orientation can be either utilised in the FAML tags themselves or applied by the animator to create believable performance of the virtual characters.


3.8 Summary of literature review



In the literature review we discussed five research domains: Talking Heads, facial modelling and animation, non-verbal communication, markup languages and character animation. The project utilises facial modelling and animation techniques to produce smooth and realistic animation of the Talking Head. The literature on non-verbal communication enables us to narrow our focus of gestures, expressions and emotions that can be mimicked by a subset of FAML tags to add further realism to the Talking Head. The facial gestures, expression and emotions that we are concerned with, detail the movement of the head, eyes and eyebrow. The literature on meta-languages enables us to recognize the need for simplicity, extensibility and extensibility for markup languages. Furthermore, the MPEG-4 specification provides us with sufficient control and functionality for the animation of the Talking Head. The techniques indicated by the literature review of character animation allow insight into creating characters that are both realistic and believable.


















Chapter 4



Research Methodology






4.1 Hypothesis


Through the inclusion and implementation of the FAML in the FAQBot application, our research hypothesizes that:


  1. Realistic animation of a Talking Head can be simulated using the FAML tags to "direct" facial gestures and expression, mimicking true non-verbal human communication.


  1. The successful use of FAML tags can be used to create believable virtual characters. Where believable is being used in the sense that the user suspends his/her disbelief and interacts with the Talking Head as a real person.


4.2 Delimitations and Assumptions



The main aim of the project is to design and implement a FAML to direct the facial animation for a specific persona or task and therefore does not include:


Modelling


The FAML is not intended to model the complex relationship between gesture and speech and hence has not been re-created in the project. The FAML tags are only designed to provide a mechanism allowing non-verbal behaviours and gestures to be added to the facial animation, synchronized to synthesized speech.


Speech


This project does not manipulate or alter the audio stream in anyway and hence is not involved with any production or control of the audio stream used for the facial animation. The visemes (visual representation of the phonemes) created by the process of synthesized speech are used as is from the Text-To-Speech module and are not altered in anyway. The manipulation and control of lip movement based upon the audio stream is not within the scope of this project.


Automation


The project does not attempt to automate the process of inserting FAML tags into the input text stream. The FAML tags are scripted into the input text stream by the programmer manually. The process to automate this task would involve the analysis of an input stream sentence and understanding the linguistic semantics behind the words and structure. This is not within the scope of this project.


Wrinkle and skin modelling


As suggested by Parke and Waters (1996) the human face is composed of many features, as part of a physical modelling system the movement of skin and the production of wrinkles is an important factor is terms of adding realism. A smile cause wrinkles at the outer corner of the eyes, a frown causes wrinkles on the forehead. A skin-wrinkling model is beyond the scope of this project and beyond the capabilities of the Talking Head animation for the FAQBot application.


Animation control


This project is to be designed and implemented using the MPEG-4 facial animation coding standard. MPEG-4 has standardized the synthesis and animation of a Talking Head. MPEG-4 is used in the FAQBot application and as such the project is delimited in using MPEG-4 as the facial animation system.


4.3 Limitations




At present there does not exists a mechanism within the FAQBot application to define an object model translation, allowing an object to translate forward or back dynamically, giving the impression of the object moving closer or further away within a scene. During normal conversation gestures such as moving closer or further away provide a mechanism for conversation turn taking. (Pelachaud et al., 1996). The current implementation of the FAQBot application and the FAE does not allow such object model translation, and hence is unable to be implemented within this project.



In its current implementation the head model defined by the FAQBot application and the FAE, does not allow any independent movement of the head and neck. The neck is in actual fact part of the head model and therefore cannot move independently. This is a major limitation to the realism that can be portrayed through facial gestures, expressions and head movements. When the pitch of the head is altered to give the impression of the head looking up, the neck moves in sync and as such gives the undesired impression of the head leaning back. Any expression, gesture or head movement that would realistically in the real world, be independent of the neck, cannot be accurately simulated in the FAE and FAQBot application



The FAQBot application is a project in co-development with the School of Computing and the Fifth Framework organization based in Genoa Italy. The tag specification for the facial markup is constrained by the specification outlined by the Fifth framework. The constraints are placed on the facial tag so that there is as little incompatibility problems with further work developed by the Fifth Framework. The facial markup specification can be seen in Appendix A.



4.4 Design and Demonstration



The first step in the implementation of the FAML is to allow the author access to some of the very low-level functionality that MPEG-4 already provides. Low-level access refers to the low-level FAPs (see section 3.5). These FAPs control simple aspects of the Talking Head such as pitch, yaw, roll. Once these FAPs were identified they were implemented in the FAML to allow the author to script into the text stream, time, intensity and duration of low-level FAPs. We further develop higher order expressions and gestures such as confused, dazed, emphasis and nod, with the intention of providing a more rich and realistic set of non-verbal gestures and expressions to animate the Talking Head.


4.5 Evaluation



The FAML was formally evaluated using an evaluation research methodology. The process incorporated a questionnaire based on a series of demonstrations (refer to Appendix F for a copy of the questionnaire). The questionnaire provided numerical data that could be further analyzed to determine the effectiveness of the FAML tags.





Chapter 5



Implementation







In previous sections, discussion has involved facial expression and emotion. However MPEG-4 does not provide any mechanism to specify persistent gestures and expressions. This section will detail the FAML and how it has been implemented within the FAQBot application.


As mentioned previously the FAQBot application has been an ongoing development under the School of Computing, Curtin University of Technology. In its initial implementation the animation of the Talking Head for the FAQBot application was rudimentary at best, a static rendered head, with only the lips being animated. However with continued improvements, the Talking Head and its animation has improved vastly to include random head movements based on personality traits. The next step in the evolution of the Talking Head animation is the ability to script actions, expression and gestures into the dynamic animation, the basis of the FAML and this project.


5.1 Background

5.1.1 Input text stream


As described previously the FAQBot application and Talking Head animation is solely dependent upon the input text stream. The input text contains the words to be spoken as well as the markup tags for both speech and facial animation, used by the TTS and FAML modules respectively. An example of the input text stream is shown in figure 2.





<?xml version="1.0"?>

<!DOCTYPE sml SYSTEM "./sml-v01.dtd">


<sml>

<p>

<neutral>

<pitch range="+150%">

Future <smile 2 5 5000/> computers will invite us <l_roll 2 4 1200/><nod 2 3 1200/> to communicate through a mix <r_roll 2 6 1000/><nod 2 3 1000/><hl 2 4 1000/> of speech, gestures and gaze.

</pitch>

</neutral>

</p>


<p>

<neutral>

<speaker gender="male" name="us1">

Current <l_roll 2 5 1000/><nod 2 5 1000/> <emph affect="b" level="moderate"> technology </emph> has <smile 4 2 1000/> progressed to the stage.

</speaker>

</neutral>

</p>

</sml>



















Figure 2 Typical input text document for the FAQBot application


As seen in figure 2, the input text document is highly structured and is based upon the XML specification for the markup of documents, described previously. XML was adopted as the standard used for the markup of all input text to the FAQBot application. The FAML tags do not use the XML specification, as outlined in the delimitation of the FAML tag structure (see section 4.4). XML compliant tags have their attribute names included in the tag, as well as well as their attribute value (eg <XML_TAG attribute_name = "attribute_value"/> ). This however is not the case with FAML tags, only attribute values are included in the FAML tag.


In figure 2 there are two utterances that will be separately synthesised. The first utterance being "Future computers will invite us to communicate through a mix of speech gesture and gaze" and the second utterance "Current technology has progressed to the stage." Each utterance is marked up separately with its own TTS tags and FAML tags.


The bold tags denote the standardized XML tags that identify the document as an XML document. The underlined tags are TTS tags and denote the markup of the synthesised speech. Italicised tags are FAML tags and express the markup of the facial expression, gestures and emotion for the Talking Head animation. Standard tags represent the actual synthesised text.


The TTS tags and their function are beyond the scope of this project but further detail can be obtained from Stallo (2000). It is sufficient to say at this stage that TTS tags are XML compliant and are structured according to XML standardized specifications.


The FAML tags co-exist with TTS tags of which both are incorporated into one single document used as the text input into the FAQBot application. The FAML module filters the input text document and produces a FAML structured input text format that is suitable for parsing, analysis and interpretation, by the FAML module. This process known as tag filtering and its significance will be discussed in further detail in next section.


5.1.2 Collaboration


The ability to synchronise facial expressions, gestures and emotions to the synthesised speech has required a large investment in the inter-process communication between the TTS module and the FAML module, with both being responsible for animating the Talking Head. Issues such as name space conflicts, API, tag filtering and information exchange were all addressed in the project to allow the two modules to work together.


The TTS module will be described in greater detail in the next section, however it is crucial to understand that the TTS drives the animation of the Talking Head. The text is responsible for animating the Talking Head and as such the PST module and FAML module, processes responsible for adding animation, are called from the TTS module itself.


FAML API


The TTS module is responsible for the lip synchronisation as well as the timing information used by the FAML module to synchronise gestures. For the successful implementation of the FAML module within the FAQBot application a clear and concise form of communication interface between the FAML module and the TTS module is essential and this is provided by the FAML API.


The FAML API is used to seamlessly integrate the FAML module with the FAQBot application, as well as providing a mechanism that allows the TTS module to execute the FAML functionality. Collaboration with the author of the TTS module, Stallo (2000) allowed the design of the FAML API that was both functional and powerful for the FAML module, but was also easy to use for the TTS module.


Name Space Resolution


The TTS tags and the FAML tags coexist in the same text input. This gives rise to the problem of name space. Collaboration was used to ensure that tags for the TTS module did not clash with tags for the FAML module. An example would be the "emph" tag. The emph tag was used in the TTS module to define a word that was to be emphasized in speech. The "emph" tag was also used in the FAML module as an emphasis gesture, a nodding of the head and lowering of the eyebrows.


Collaboration with Stallo (2000) produced two different tag-naming conventions that the FAML and TTS modules were to adopt. This separated the name space for tags and ensured that there would be no ambiguity as to which tag belonged to which module.


The XML specification has recognized the problem of name space and has provided a namespace specification (XML_namespace 2000) but is still under development. The current draft addresses the problem of name space by associating each namespace prefix with a URI (a URL or URN).


Tag Filtering


Both modules, the TTS and FAML, are only concerned with their own tags. As such the input text stream is filtered to remove unknown tags.


Due to the structure of the input text document and the implementation of synthesizing multiple utterances within the document, straight filtering of unknown tags removes the ability of the FAML module to recognise separate utterances. The straight filtering of unknown tags is adequate for single utterances, but with the implementation of multiple utterances as seen in figure 2, straight filtering fails. By removing all unknown tags, the TTS tag delimiters used to identify separate utterances are lost. This can be seen in figure 3.



Future <smile 2 5 5000/> computers will invite us <l_roll 2 4 1200/><nod 2 3 1200/> to communicate through a mix <r_roll 2 6 1000/><nod 2 3 1000/><hl 2 4 1000/> of speech, gestures and gaze.

Current <l_roll 2 5 1000/><nod 2 5 1000/> technology has <smile 4 2 1000/> progressed to the stage.









Figure 3 Straight filtering of unknown tags in the input text


With regards to the FAML module, the above filtered input would appear as one utterance and would be animated as such. Each separated utterance represents a separate animation sequence and as such requires its own separate sets of data to be passed from the TTS module to the FAML module.


The FAML module needs to be aware of the separate utterances to be synthesised and as such needs to be aware of the TTS tags that delimit these utterances. The TTS module delimits separate utterances through TTS emotive tags. These include tags such as <neutral>, <happy>, <sad> and <angry>. Each separate synthesised utterance was contained within these TTS emotive tags. This is highlighted in figure 2.


For the preservation of multiple utterances, the TTS emotive tags are not filtered out, but replaced with new FAML tags to denote the separation of utterances. This ensures that the utterance structure of the document remains intact as well as allowing the continued filtering of unknown or unwanted tags. Figure 4 below describes the input text document seen in figure 2, once processed by the filtered procedure.








<s>

Future <smile 2 5 5000/> computers will invite us <roll_left 2 4 1200/><nod 2 3 1200/> to communicate through a mix <roll_right 2 6 1000/><nod 2 3 1000/><head_left 2 4 1000/> of speech, gestures and gaze.

</s>


<s>

Current <roll_left 2 5 1000/><nod 2 5 1000/> technology has <smile 4 2 1000/> progressed to the stage.

</s>













Figure 4 Filtered input text document preserving utterance structure


As seen from figure 4 the <s> and </s> tags are new FAML tags that are used to delimit the separate utterances within the document; these have replaced the TTS tag <neutral> used by the original input text document to separate utterances. Without access to the emotive TTS tags, the input text could not be successfully filtered and maintain the correct sequence and structure of utterances.


Embedded TTS tags


There is a special case in which the FAML module performs a different function based on a TTS tag. This tag is called the embed tag and is not a tag used for the actual markup of the text document or the animation of the Talking Head. Rather it is a tag associated with the document itself, and provides further functionality to the input text document that can be further developed in future.


The TTS embed tag allows another document to be embedded within the input text document for the FAQBot application. This embedded document is typically another XML compliant document, however is not restricted to this form. The functionality of the TTS embed tag is explored further by Stallo (2000) but is beyond the scope of the current project.


With regards to the FAML module the TTS embed tag halts the current synthesis of the original document and synthesises the new document as defined by the attributes of the TTS embed tag. With regards to the FAML module the synthesis and animation of the new document is no different, except that the state of the original document is stored, before parsing the embedded document.


The current timing information, text input, FAML tags, utterance and associated data of the current input text document is stored, and a new set of timing information is passed from the TTS module to the FAML module. New calculations, analysis and interpretation of the input text is performed and new Talking Head animation sequence produced. Once the synthesis of the embedded document is completed, the state of the original document is restored and processing of the original text document continues as usual.


5.1.3 The TTS Module



The TTS module forms an important part of the Talking Head animation. The TTS module is responsible for providing the visemes, the visual representations of the phonemes that will be animated for the lips in the Talking Head animation, as well as producing the audio waveform used for speech. As stated previously the animation of the lips is not within the scope of the FAML project and is controlled entirely by the TTS module.


Festival (Festival, 2000) forms the TTS (Text-to-Speech) synthesiser that is implemented in the TTS module for the FAQBot application and performs the essential task of providing the FAML module with timing information. The TTS module generates a phoneme duration file for each utterance that is synthesized. Through an API call to the FAML module, it provides access to the phonemes and their timed durations. Synchronisation and timing will be discussed in greater detail in section 5.3. The TTS and Festival is the "black box" as viewed by the FAML module and information is shared between the TTS module and the FAML module through the established FAML API.


It is important to note however that Festival does not actually produce the synthesised audio file. This is produced by MBROLA (MBROLA, 2000). MBROLA takes a list of phonemes as input from Festival TTS, together with prosodic information (duration of phonemes and a piecewise linear description of pitch), and produces a speech audio file. This is the voice for the Talking Head.


5.1.4 PST Personality Module


The PST module is responsible for the underlying head movements and emotions that appear based on the personality of the Talking Head (Shepherdson, 2000). The personality is defined by a PST file and outlines the types of movements, the range of movements, types of emotions and their respective intensities, as well as how often they occur during the animation. The PST module inserts into the Talking Head animation, movements and emotions based on user pre-defined probabilities of particular emotions or movements. Due to the probabilistic behaviour of the PST module, no two-animation sequences using the same input text document will produce exactly the same animation sequence. This further adds to the realism that the Talking Head is able to convey in the animation.


It is understood that the PST module provides the core animation of the Talking Head that the FAML process will set to improve upon it. Both the PST module and the FAML module are used to animate the Talking Head. This can give rise to conflicts between the PST and FAML modules. The process of conflict resolution addresses this problem and will be discussed in the further detail within section 5.5.


It is important to note that the FAML module does not alter the personality animation of the Talking Head, rather it provides the ability to script actions to the Talking Head.


5.2 Overview



The FAML module involves many different processes that enable the smooth and realistic animation of the Talking Head. The FAML module needs to identify what tags are used in the text stream, when they are initiated and how they will integrate together with other FAML tags as well as blend with the underlying PST personality module.










Figure 5 FAQBot animation modules


As seen in figure 5 the PST module exists within the FAML module itself. The FAML module is responsible for coordinating the author's scripted facial expressions and gestures with the personality of the Talking Head, and as such encapsulates the PST module to resolve conflicts between gestures as well as coordinating the smooth integration of the authors' tags.


A detailed breakdown of the processes involved in animating the Talking Head can be seen in figure 6. Notice from figure 6 that the TTS is responsible for the timing information of the audio stream. This timing information is essential as it provides the ability for the FAML module to synchronise tags to the spoken text.











TTS and FAML Markup Text Stream










Audio timing information










Festival word expansion list

MPEG-4 Animation File

MPEG-4 Animation File including PST personality information











FAML MPEG-4 Animation




Figure 6 FAML module overview



FAML tag identification and synchronisation


FAML tag identification and synchronisation involves parsing the input markup text file, stripping away unnecessary tags such as the TTS tags and any other XML tags. It also determines what tags are used and their timing information, based on their location in the text stream.




FAML and PST conflict resolution


Conflict resolution is the process by which the personality traits of the Talking Head are coordinated so as not to interfere with the FAML tags of gesture, expression and emotion. Conflict resolution ensures the Talking Head facial animation is blended appropriately together.


PST Personality module


The PST module written by Shepherdson (2000), provides the underlying personality of the Talking Head. The personality is defined by a PST file and outlines the various personality traits and their intensities expressed by the Talking Head.


FAML Tag animation compositor


The tag animator and compositor is responsible for including the author's scripted gestures into the animation. The process animates the Talking Head as defined by the constraints of the tag attributes and the programmed behaviour of the tag itself.


The FAML tag filter parses the input text stream, removing XML and TTS markup tags. The FAML and PST conflict resolution process is responsible for ensuring that the authored FAML tags do not interfere with the PST module gestures and emotions. The PST module is called from within the FAML module, and adds the predefined personality gestures and emotions to the Talking Head. Finally the FAML tag animation compositor process combines the authored gestures, expressions and emotions, with the underlying personality. This produces smooth facial gestures, expressions and animation, either preferentially superimposed upon the personality, or expressed as a combination of both modalities.


5.2.1 Festival Text-To-Speech Synthesiser Word Expansion


The Festival expansion list seen in figure 7 contains the expansion for abbreviations and cardinal numbers. Festival is quite complex and fully capable of distinguishing differences in the pronunciation of numbers, dates, time and symbolic characters. Take for example this sentence:


On May 5 1985, 1985 people moved to Livingston.


Notice that the context of the numbers 1985 affect the way they are pronounced. The number "1985" should be pronounced differently. The first number is pronounced as a year, "nineteen eighty five" while the second number as a quantity "one thousand nine hundred and eighty five". Numbers may also be pronounced as ordinals, as in the "5" in the sentence above, it should be "fifth" rather then "five". Festival is even capable of discerning the abbreviations such as Dr. pronounced Doctor or Drive, based on the context that it appears in the sentence.


This is known as homograph ambiguity, where the pronunciation of certain words cannot simply be found from their orthographic form alone.


Festival provides the functionality to correctly pronounce the words for a sentence. For instance, Festival would correctly pronounce the above sentence as "On May fifth nineteen eighty five one thousand nine hundred and eighty five people moved to Livingston". However this differs from the input text stream and the timing information provided by Festival no longer corresponds to the input text utterance but to the new converted utterance. If the words in the text stream differ to the words that are spoken then there is no mechanism to allow the FAML tags to be synchronised to the synthetic speech.


Festival provides a scheme-based interface to the object components that synthesise the text into speech. Scheme is a functional based language and is a variant of the programming language LISP. To determine which words are expanded and the words they expand to, a new scheme function was written to access Festival and build up a list of words and their equivalent expansion for each utterance. This list is parsed by the FAML module matching up the words in the input text stream with their appropriate expansion. This new expanded input stream is used as the input stream to synchronise the FAML tags to the spoken speech. Figure 7 shows the Scheme expansion list that is used to update the text stream. Notice that the FAML tags are preserved during the replacement and maintain their location and attributes in the text stream. The Festival scheme code can be seen in Appendix B.



On May 5 1985, <FAML tag/> 1985 people moved <FAML tag/> to Livingston.



Identified word expansion list using Scheme?


(("5" , ("fifth"))

(("1985" , (("nineteen") ("eighty") ("five")))

(("1985" , (("one") ("thousand") ("nine") ("hundred") ("and") ("eighty") ("five" )))



Replacement of words and their expansion?


On May fifth nineteen eighty five, <FAML tag/> one thousand nine hundred and eighty five people moved <FAML tag/> to Livingston.
















Figure 7 Festival expansion list

5.3 Synchronisation


Synchronisation is the most crucial element of the FAML module. It forms the linchpin that allows the FAML to manipulate and control the flow of the animation. Without the ability to link the synthesised speech to the facial expression, gestures and emotions defined by the FAML tags, the FAML module fails to attain its objectives. This section will detail the implementation of synchronisation through a simple example, and then move onto a more complex one.

5.3.1 Timing


Synchronisation is executed at the word level of an utterance. That is, for each word in the utterance a start time and end time for that word is calculated based on the beginning of the Talking Head animation. The timing information allows the FAML module to know exactly at what time any particular word is spoken in the audio stream. The audio stream is the audio WAV file that is synthesised for the text.


As stated previously the timing information is passed from the TTS module through the FAML API and is structured to provide phoneme durations of each word that will be synthesised in the utterance. Figure 8 describes a typical timing file as provided by the TTS module. The timing file displayed, reflects the timing data for the utterance "Here's".

_ 100

_ 420

>Here's

h 55

i@ 161

z 47

_ 100








Figure 8 Utterance timing file

Notice the underscores ("_") at the beginning and end of the timing file. These represent pauses and their duration in the synthesised speech and as such are also included in synchronisation calculations. In this example the first two pauses have a combined duration of 520 (ms). This indicates that there is 520 milliseconds of silence before the first word is uttered, offset by the start time of the animation, time 0.


>Here's

h 55

i@ 161

z 47





Figure 9 Phoneme data for the word "Here's"

From figure 9, the word "Here's" is broken down into three phonemes "h", "i@" and "z". As seen each phoneme is followed by its duration in milliseconds (ms). Figure 10 further highlights the timing of the word "Here's".

>Here's


"h" duration = 55 (ms)

"i@" duration = 161 (ms)

"z" duration = 47 (ms)






Figure 10 Phoneme duration breakdown




Summation of the duration values of the phonemes determines the length of time taken for the pronunciation of the word "Here's". For this instance, the word "Here's" has a duration of 263 (ms). Therefore, for the complete utterance "Here's" the timing values can be represented as figure 11.

"_" duration = 100 (ms)

"_" duration = 420 (ms)


>Here's duration = 262 (ms)


"_" duration = 100 (ms)






Figure 11 Complete duration information for the utterance "Here's"


We can determine from the timing information that the word "Here's" is initially spoken in the audio stream at time 520 ms and ends at time 783 ms for the duration of the audio stream.


The process described above is used to calculate timing of words based on the beginning of the audio stream. A more complex utterance "Here's the latest news" can be seen in figure 12 and illustrates how timing of the other words in the utterance is calculated.

_ 100

_ 420


>Here's

h 55

i@ 161

z 47


>the

dh 29

@ 40


>latest

l 69

ei 136

t 71

i 66

s 85

t 66


>news

n 72

y 45

uu 217

z 146


















Figure 12 Timing file for utterance "Here's the latest news"

From figure 12 we can see that each word has been broken down to its phonemes and each phoneme has its time duration associated with it. The next step is to sum up the phoneme duration for each word and this is shown in figure 13.








_ 100 (ms)

_ 420 (ms)

>Here's 263 (ms)

>the 69 (ms)

>latest 493 (ms)

>news 480 (ms)








Figure 13 Calculated word timing for the utterance "Here's the latest news"




Duration of words in (ms)













0

100

520

783

852

1345


Start time of words in (ms)





Figure 14 Start time values of each word as offset from time 0 of the audio stream


Figure 14 is a representation of the start times for all the words in the utterance. The start time information is accurate to the nearest millisecond. The timing information calculated provides the exact time in the audio file that any word in the utterance is spoken.


5.3.2 Frames



The timing of each word can be related directly to the animation. Each second of audio will produce 25 frames of rendered animation. The Talking Head is rendered at 25 frames per second, adequate for smooth animation, this equates to 40 milliseconds for each frame.


A simple equation can be used to determine from the start time of any word, the exact frame it will appear in the animation.



Frame Number (n) = start time (ms) / number of ms for 1 frame


Frame Number (n) = start time (ms) / 40


For instance, for the word "Here's" the start time is 520 milliseconds, and the duration is 263 milliseconds. Therefore the visual representation of the lips (visemes) will be animated for the word from frame 13 to frame 19. Alternatively, for the word "latest" with a start time of 852 milliseconds and an end time of 1345 milliseconds, then "latest" would be animated from frames 21 to 33. This synchronisation of words to the frames in the Talking Head animation sequence is represented in figure 15.



















Figure 15 Frame synchronisation of word in the animation sequence

5.3.3 FAML Synchronisation


So far discussion has only involved the synchronisation of words to the animation sequence, the synchronisation of FAML tags to the animation sequence will now be addressed.


The timing of FAML tags within the Talking Head animation is based upon the timing of words in the input text stream. Figure 16 shows the same input text stream as seen in figure 12 marked up with 2 FAML tags.




Here's <smile 2 5 1000/> the latest <nod 2 6 800/> news.






Figure 16 Example of FAML tags in text


The FAML tags, its attributes and specifications will be discussed in further detail within section 5.6. It is however more important at this stage to note the location of the FAML tag rather than the tag itself.


The FAML tag start time, the time at which the scripted tag will appear in the Talking Head animation, is based upon the start time of the word directly after it in the input text stream. In figure 15 the timing of the word "Here's" is at time 520 ms, or frame 13, the timing for the word "the" is 783 ms, or frame 19.


As can be seen in figure 14 the timing of the tag <smile 2 5 1000/> will have its start time associated with the next word in the input text stream, in this case the tag <smile 2 5 1000/> has a start time of 783 ms, the start time for the word "the". Therefore with regards to the animation sequence the Talking Head animation will perform the tag animation of the FAML tag <smile 2 5 1000/> at exactly 783 ms into the animation sequence, or at exactly frame 19, which is exactly the same time that the lips will be animated to lip-synched to the spoken word "the".


The next tag, the <nod 2 6 800/> FAML tag will have its start time associated with the start time for the word "news", in this instance a start time of 1345 (ms) or frame 33 in the Talking Head animation. This is further highlighted in figure 17.



Start Frame 19 33










783 ms

1345 ms


Start Time

Start tag animation <smile 2 5 1000/> <nod 2 6 800/>




Animation sequence


Figure 17 FAML tags synchronization










5.4 Personality and Gesture conflict resolution


As described previously, the work done by Shepherdson (2000) produced facial animation and simple gestures probabilistically distributed across an utterance and animation. In the context of scripting gestures into the animation sequence a mechanism needs to be in place to ensure that gestures produced by the FAML module do not conflict with the randomized movements and emotions produced by the PST personality module. Conflicting gestures can produce disjointed and jerky animation that would in actual fact detract from the realism of the Talking Head and decrease the effectiveness of both the personality and FAML modules respectively. Therefore conflict resolution of FAML produced animation and the underlying personality animation is absolutely essential to produce a realistically animated Talking Head.


Gesture conflict resolution does not attempt to alter the randomized gestures, implemented for a particular personality. Rather it provides a mechanism to overwrite the changes that are implemented by the Personality module with new gestures and facial expression scripted by the author of the input text. Conflict control attempts to incorporate the authored gestures, expressions and movements seamlessly into the Talking Head application without the animation being jerky or disjointed.


Two approaches were proposed for conflict control. One approach was to notify the PST module a range of frames that would be unavailable for animation and that these frames would not be apart of the animation itself. The PST module would only be able to animate the remaining unrestricted frames. This approach although the ideal approach, required a large investment in integrating such a system into the PST module. Due to the probabilistic nature of the PST animation, the integrating of such a system would border upon re-writing the PST module itself, a task beyond the scope of this project.


An alternative approach and the approach discussed involves overwriting the values produced by the PST module before, during and after the instance of the scripted tag animation. This ensures that all PST animations for that particular conflict are completed before the scripted tag animation is implemented and that no PST tag animations are animated during the scripted tag animation itself. This approached required far less integration in the PST module and still produced the desired results of resolving the animation conflicts and produces realistic and flowing animation.


Five main areas on the Talking Head were identified as requiring some form of conflict control. These included blinking, eyebrows, gesture and head movements, expressions and emotions.


5.4.1 Blinking


Blinking provided through the personality module and the PST personality file (Shepherdson 2000), can be randomly implemented for the FAQBot Talking Head application. However as discussed by Pelachaud et al., (1995) blinking regularly occurs as a punctuator for the end of sentences. The FAML tag set included a blink tag to enable the author to animate a blink at any instance in the spoken text, or for instance, at the end of a sentence, as discussed by Pelachaud and Prevost (1995).


Blinking produced by the personality module (Shepherson, 2000) and a FAML scripted blink may occur within the same instance or interrupt instances of the blinking animation, producing unrealistic animation as the Talking Head will appear to start to blink and then re-start. This is further illustrated in Movie 1 found on the compilation CD. As can be seen, while the animation is in mid-blink, the eyelids flick back to their initial positions and then re-start blinking, producing unrealistic and disjointed animation. As highlighted by the literature review (section 3.3.5) the eyes are the most important part of an expression, any form of jitter or false movement can destroy both communication and believability (Thomas and Johnston, 1991).


The "overwriting" approach discussed previously was implemented such that the FAML module would determine all instances of blinking in the animation sequence. This blinking information would then be used to force the personality module to not animate any sequence of blinking before the scripted FAML tag that could possibly be interrupted, and to not animate any blinking during the scripted sequence that would be interrupt scripted blinking itself. This produced the desired results of restricting the PST module to not animate a blinking sequence for a range of animation frames, and allowing the scripted blink to be successfully animated.


This can be seen in Movie 2 found on the compilation CD where the blinking conflict control has been turned on. Notice that the blink occurs as scripted by the author without being interrupted by a random blink originating from the PST module. In a sense the FAML module "clears" a range of animation frames of blinking, such as to allow the animation of a scripted blink to be animated successfully without interruption of the PST personality module.


5.4.2 Eyebrow


Similarly eyebrows suffer from the same problem as seen with blinking. The PST module provides random movements and intensities of eyebrow movement as defined by a PST file. However, there are instances of expression in which random movement of the eyebrows is not required. Some expressions require the eyebrows to remain static. An example of this would be that of a confused expression. In this state we want the inner eyebrow points to move upwards and the middle and outer eyebrow points to remain neutral, portraying a confused look. In this state other random movements would increase the height of middle or outer eyebrow values and interfere with the confused expression. This can be seen in Movie 3 on the compilation CD. Notice that even in the confused state, the eyebrows continue to move quite intensely.


Eyebrow conflict control is based upon the same premise as that of blinking conflict control. This involves notifying the PST module that certain range frames are not to have the eyebrow values altered. The FAML module will alter these frames as authored. Movie 4 on the compilation CD has eyebrow conflict control switched on and as can be seen the confused state the eyebrows remain static from the start to the end frames of the tag animation.


5.4.3 Gestures and Head Movements


The PST module also provides realistic head movements and gestures such as nodding and head turning. These are randomly integrated into the Talking Head animation using the PST personality file as described previously. Conflict control mechanism for gestures and head movements differs from blinking and eyebrow movements. Authored gestures are blended into the animation rather than overwriting the PST module over a certain range of frames during the animation. Blending consists of the summation of animation values, from both the PST and FAML modules individual frames in that animation sequence. This blending approach allows the movements of the Talking Head to be continuous and flowing, meanwhile animating the desired scripted head movement seamlessly into the Talking Head animation.


The overwriting of animation values is only used if the limiting extent is reached when PST animation values are blended with the FAML animation values. The FAML module defines an upper limit for certain movement values and ensures that a blended animation value does not exceed these set limits. For example, if the Talking Head is already randomly looking left and the author required the Talking Head to further look left, then the two values responsible for the Talking Head looking left are seamlessly blended together. However we need to ensure that the blended values do not undermine the realism of the animation. It is undesirable to have the blended values look so far left such that the head turns left and keeps on turning left, reversing its point of view. This type of conflict resolution mechanism is required to ensure realism in the animation, as both the PST and FAML modules work independently of each other to animate the Talking Head animation sequence.


5.4.4 Expressions



The FAML tags enable the markup of the Talking Head animation with facial expressions. These facial expressions differ from emotions. Facial expressions form part of the portrayal of emotion for the Talking Head. Conflict resolution for FAML expressions deal directly with the high-level emotions provided by MPEG-4 discussed previously and the PST module defined personality traits of the Talking Head. When the Talking Head is scripted to display the expression "confused", the personality expressions provided by the PST module would undesirably alter the scripted confused expression. It is unnecessary for the FAML expression and personality to blend either, as this may result in some unknown and peculiar and contrary expression entirely. An overwriting mechanism similar to conflict resolution of eyebrows and blinking nullifies the effect of the personality expression, during the animation of the scripted expression tag. This provides to the author the ability to script a confused expression into the Talking Head animation, even though the personality trait is that of a friendly personality.






5.4.5 Emotion



As specified by the PST module, the Talking Head can portray different high-level emotions. High-level emotions are considered in terms of the high order emotions defined by the MPEG-4 specifications. A friendly personality defined by the PST information file would randomly exhibit the high-level "joy" expression defined by MPEG-4. The FAML module also provides the author the ability to use these high order expressions synchronised to the text. However, a problem exists in blending the random emotions and authored FAML emotion tags. Due to the random nature of the PST module, knowing exactly how the Talking Head will behave, and when emotions will be used is a problematic tasks. Conflict resolution between scripted FAML emotions and random PST emotions, uses a "bleeding" mechanism. The FAML module would only take over the animation of an emotion for the Talking Head if the authored FAML emotion tag were greater in intensity as to "bleed" through any emotion preset by the PST module. The "bleeding" technique provides a seamless method to move from one emotion to another. This "bleeding" technique ensures that the animation of the emotions flows smoothly between instances without sacrificing realism.


The overwriting technique described previously for blinking and eyebrow conflict resolution was inadequate in this instance due to the high level nature of the emotion tags. Nullifying or negating the personality high-level emotion animation values produces disjointed and irregular animation. It became apparent that overwriting would destroy the flow within the underlying personality of the Talking Head


5.5 Generic Tag specifications


FAML tags are based upon the facial markup specifications as outlined by the Fifth framework consortium. The consortium foresaw the need for a facial markup of the text stream and provided a specification for such a markup. Figure 18 shows an example of a suggested facial animation markup tag.



<anger 2 5 1000/>





Figure 18 A generic FAML tag


As can be seen from figure 18, a FAML tag is initiated by an angle bracket "<" and delimited by a slash angle bracket "/>" The type of tag is denoted by its name, in this instance the tag type (name) is "anger". Tag type denotes its function and is an indication of the type of gesture, expression or emotion that that particular tag will perform on the Talking Head animation. Tag names have been selected to reflect their function for ease of use when scripting the animation text input document.


The FAML tag has three attributes. These attributes coincide with "bit set", "amplitude" and "duration". In figure 18 "bit set" is equal to 2, "amplitude" is equal to 5 and "duration" is equal to 1000.


There are three attributes associated with all FAML tags:


Bit-set


The bit-set attribute of the tag is an MPEG-4 specification and denotes the bit that is set in the MPEG-4 facial animation bit stream. A bit-set value of 2 will denote that the tag is a high order tag, manipulating the high order emotions defined previously. A bit-set of 1 denotes that the tag is dealing with low order bit values and will be manipulating the low order or level bits of the MPEG-4 animation. The low level values control individual FAP points on the Talking Head model, whilst the high order bits control sets of FAP points.


Amplitude


Each tag has associated with it an amplitude value. This value is a range from 1-10 and represents a percentage value of the maximum intensity of that particular facial gesture, expression or emotion. The amplitude value corresponds to the intensity that expression, gesture or emotion will be portrayed in the Talking Head animation. A value of 1 denotes 10 percent of the maximum intensity allowed, whilst a value of 10 denotes 100 percent of the maximum intensity for that particular tag.


Duration


Every FAML tag has a duration value associated with it. The duration value represents the time span in milliseconds that the tag expression, gesture or emotion will persist in the Talking Head animation.



5.6 Tag animation


Tag animation requires the features of start time and duration. The start time and start frame of each tag can be calculated as described previously in section 5.3 through the process of synchronisation. A duration time of 1000 ms, denotes 1 second of animation or exactly 25 frames. Therefore a tag with a start frame of 10, with duration of 1000 ms will persists in the Talking Head animation from frame 10 and end at frame 35.


This is further highlighted in figure 19. Tag animation denotes the animation of a particular tag over its user defined duration value, synchronised and integrated into the Talking Head animation sequence.








<smile 2 5 1000/> start frame = 10


Tag type = smile

Bit-set = 2

Amplitude = 5

Duration = 1000 ms = 25 Frames


Frame 10 Frame 35



smile

Duration 1000 ms = 25 frames


Animation Sequence
















Figure 19 Breakdown of smile tag animation


FAML Tags are animated in a three distinct phases with respect to its duration in the Talking Head animation sequence. Figure 20 shows the amplitude of the tag over its duration time. Each of the phases is described in relation to Figure 20.


  1. Growth: From t0 to t1, the tag amplitude increases from its minimum value to its maximum value. The maximum value is obtained through the amplitude attribute of the FAML tag and is a percentage of the maximum allowed amplitude value for that particular FAML tag. Time t0 to t1 constitutes 25 percent of the total duration of the FAML tag and represents 25 percent of the number of frames for the tag's duration. From t0 to t1 the animation sequence changes to reflect the increasing amplitude of the FAML tags expression, emotion or gesture.


  1. Delay: From t1 to t2, the tag amplitude remains constant at its attribute defined maximum amplitude value. Time t1 to t2 constitutes 50 percent of the total duration time of the FAML tag and represents 50 percent of the number of frames used to animate the FAML tag expression, emotion or gesture over its duration value.


  1. Decay: From t2 to t3, the tag amplitude decreases from its maximum attribute defined amplitude value to its minimum value. Time t2 to t3 constitutes 25 percent of the total duration time of the FAML tag and represents 25 percent of the number of frames used to animation the FAML tag expression, emotion or gesture over its duration value. From t2 to t3 the animation sequence changes to reflect the decreasing amplitude of the FAML tags expression, emotion or gesture.











Amplitude






Time


Figure 20 The amplitude of a generic FAML tag over its duration in the animation sequence


For the example of the <smile 2 5 1000/> FAML tags as described by figure 19, the animation of the FAML tag will appear in the Talking Head animation sequence as shown by figure 21.








Amplitude






Time

Frame 10 Frame 35





complete animation sequence



Figure 21 The amplitude of a "smile" FAML tag over its duration in the animation sequence


All FAML tags are animated using this growth, delay and decay phases. However there are instances where the phases are reversed such that the animation decays to a minimum value, is delayed at that minimum value and then grows back to its original value. Other tags may have multiple instances of these three phases, notably the nod tag that raises the head using the growth, delay and decay phases, and then lowers the head using the decay, delay and growth phases.


5.7 Gesture FAML Tags


5.7.1 Head


The animation of the head movement can be broken down into three main parts, which include pitch, yaw and roll.


The pitch affects the elevation and depression of the head in the vertical field. The yaw affects the rotational angle of the head in the horizontal field and roll affects the axial angle. The combination of these three factors allow full directional movement for the animation of the Talking Head


There are eight main tags that control and animate the direction and orientation of the Talking Head. These consist of the look directional tags and head directional tags. All tags following the generic specification of FAML tags:



  1. <look_left /> : Turns both the eyes and head to look left of the author.


  1. <look_right /> : Turns both the eyes and head to look right of the author.


  1. <look_up /> : Turns both the eyes and head to look up.


  1. <look_down /> : Turns both the eyes and head to look down.


  1. <head_left /> : Only the head turns left, the eyes remain looking forward.


  1. <head_right /> : Only the head turns right, the eyes remain looking forward.


  1. <head_up /> : Only the head turns upward, the eyes remain looking forward.


  1. <head_down /> : Only the head turns downward, the eyes remain looking forward.



It is noted that the eyes and head move at the same rate during the animation of the looking tags. All combinations of the above directional tags allow the head to have full range of orientation in the animation. A combination of the <look_left /> <look_up /> tag will enable the head to look to the top left in the animation sequence. Whilst the combination of <look_right /> <look_down /> will enable the head to look to the bottom right.


Nod [ <nod /> ]


The nod tag as the name suggests, animates a nod into the animation of the Talking Head. The nod tag animation is broken into two sections, the head raise and then the head lower. Observations of peers have shown that there is a raise of the head before the nod is initiated. The nod tag mimics this and 10 percent of the duration for the nod tag is allocated for the head raise, with an amplitude 10 percent of the authored amplitude value, the other 90 percent is allocated to the head lower. The nod tag can typically be used to gesture "yes" or "agreement". Only the vertical angle of the head is altered during the tag animation, the eye gaze is still focused forward.




Emphasis [ <emph/> ]


The emphasis tag is very similar in animation to the nod tag. The difference being the emphasis tag incorporates a lowering of the eyebrow into the nod itself as described by Pelachaud and Prevost (1995). This serves to further emphasize or accentuate words in the spoken text. The emphasis tag similarly has raise and lower stages as found in the nod tag animation. It is noted however that the eyebrow are lowered at the same rate as the nod and if a different intensity of eyebrow lowering is needed the emphasis tag can be used in conjunction with the eyebrow down tag to produce an emphasis animation with a greater lowering of the eyebrow or a more subtle one.


Disagree [ <disagree /> ]


The disagree tag animates a shake of the head. The tag animates two shakes, a single shake is considered to be a head movement from the left to the right. This tag can be used as a facial gesture for "no" or "disagree". The tag only affects the horizontal displacement of the head and no other facial features are affected. Animation involves moving first to the left, then right, repeated and then returning to the central plane.


Winking [ <left_wink/> <right_wink/> ]


As the name suggest winking animates a wink of either the left eye or right eye as specified by the author. The wink is not just the blinking of one eye, but the head pitch, roll and yaw is affected as well as the outer eyebrow and cheek. The combination of these animated features add to the realism of the wink itself. These features have been mimicked from peer observation.


Roll [<left_roll/> <right_roll/> ]


The roll tag animates the roll of the Talking Head in the axial plane. Roll although subtle in normal movement, is essential for realism. This tag allows the author to script roll movement in the Talking Head, typically in conjunction with other tags, such as nodding and head movements, to add further realism to the Talking Head.


5.7.2 Eyes


Blinking [ <blink/> ]


The blink tag animates a blink of both eyes in the Talking Head animation. The blink tag only affects the upper and lower eyelid facial features of the head. By alternating the amplitude value, the amount of eye closure is affected in the animation. An amplitude value of 5 denotes 50 percent of the max amplitude for the blinking tag, and as such the animation would only reflect half blinking where only half of the eyeball is covered.


Double Blink [ <double_blink/> ]


Not all blinks in humans are singular. Peer observation has shown that double blinking is quite common and can precede changes in emotion or denote sympathetic output as described by research in human physiology (Miller, 1981). Under conditions of stress, the autonomic nervous system is activated, producing an immediate, widespread response that has been called the "fight or flight" response. The overall effect is to prepare the individual for imminent danger and can be exhibited as nervousness or agitation in an individual.


Eye gaze [ <gaze_left/> <gaze_right/> <gaze_up/> <gaze_down/> ]


Sometimes independent movement of the eyes from the head is desirable. The gaze directional tags allow four directions for eye movement. This entails movement in the vertical and horizontal planes. As with head directional tags, the tags can be combined together to provide full range of eye gaze even those not humanly possible. It is however noted that the eyes cannot be animated independently of each other but then again neither is it humanly possible. The gaze tag animates both eyes simultaneously.


1. <gaze_left /> : The eyeballs are animated to look left, in relation to the author's left.


2. <gaze_right /> : The eyeballs are animated to look right, in relation to the author's right.


3. <gaze_up /> : The eyeballs are animated to look in the upward direction.


4. <gaze_right /> : The eyeballs are animated to look in the downward direction


5.7.3 Brows


Eyebrow movement [ <brow_up/> <brow_down/> <brow_squeeze/> ]


Eyebrow movement is categorized into three sections:


  1. <brow_up/> : vertical eyebrow movement upwards.


  1. <brow_down/> : vertical eyebrow movement downwards.


  1. <brow_squeeze/> : squeezing of the eyebrow together.


Although the PST module animates eyebrow movement, the eyebrow movement tag enables the author to script certain eyebrow movements to accentuate words or phrases. MPEG-4 separates the eyebrow into 3 regions, inner, middle and outer. The eyebrow tags affect all three regions of the eyebrow to animate movement. At present individual sections cannot be moved independently.


5.8 Expression FAML Tags


Smile <smile\>


The smile tag, as the name suggest animates the expression of a smile into the Talking Head animation. The mouth is widened and the corners pulled back towards the ears. The larger the amplitude value for the smile tag the greater the intensity of the smile, however a value too large, produces a rather "cheesy" looking grin and can look disconcerting or phony. This however can be used to the animator's advantage, if a mischievous grin or masking smile is required. The smile tag is generally used to start sentences and is used quite often when accentuating positive or cheerful words in the spoken text (Pelachaud and Pervost, 1995).


Facial Shrug <shrug\>


The facial shrug tag animation mimics the facial expression "I don't know". Through peer observation a facial shrug consist of the head tilting back, the corners of the mouth pulled downward and the inner eyebrow tilted upwards and squeezed together. This is further supported by Flemming and Dobbs (1999).


Confused <confused\>


As the name suggests, the confused tag animates the expression of confusion onto the Talking Head animation. The animation involves the movement of the eyebrow upwards, the inner eyebrow having greater movement, and the outer points of the mouth to close closer together (Flemming and Dobbs, 1999).


Dazed <dazed\>


The distinguishing features for the facial expression of dazed is a slight raising of the eyebrow, the eyes are open slightly wider than normal and looking forward. The lips are slighted pulled down and outwards (Flemming and Dobbs, 1999).


5.9 Emotion FAML Tags


As described previously in the literature review section, MEPG-4 has six high-level emotion tags that can be used to integrate emotion into the Talking Head animation. Similarly there exists six FAML emotion tags can be used to script emotion into the Talking Head animation. The six emotions that can be animated are "joy <joy/>", "sadness <sadness/>" , "anger <anger />" , "fear <fear />" , "disgust <disgust />" or "surprised <surprise />".


FAML emotion tags can be placed in sequence to produce a seamless flow from one emotion to the other. Emotion tags can also be blended together at the same instance to produce different expressions and emotions entirely, as desired.


5.10 Virtual Characters


With the implementation of the FAML module and the successful integration in the FAQBot Facial Animation Engine (FAE), the set of FAML tags can be used to script the Talking Head animation to mimic the persona or behaviour of an author scripted character.


The scripted character using the FAML module is capable of displaying facial gestures, expression and emotions synchronised to the spoken text. Three virtual characters, a News presenter, Sales assistant and Storyteller were implemented to test the project hypotheses, and will be explored in detail.


5.10.1 News presenter


The News presenter mimics the persona of a virtual character reading factual and informative news of local or world events in a clear and unbiased manner. The goal of the new presenter is to inform the viewer of the day's events, but leaving the interpretation of the events to the viewer.


The News presenter, utilises a large amount of head gestures and movements to convey meaning, but rarely shows any emotion other than that of a friendly appearance. Typically the head gestures consist of nodding, and emphasis gestures to accentuate words in the spoken text. Words that are important to the comprehension of the new article presented are typically highlighted or emphasized in this manner.


The Talking Head typically utilises the nod and emphasis FAML tags that allow the Talking Head to nod or emphasize words in the spoken text. Movie 5 from the compilation CD provides the animation of the Talking Head using the FAML tags to markup the Talking Head as a News presenter character. The text input and markup is provided in Appendix C.


5.10.2 Sales assistant


The Sales assistant mimics the persona of a character promoting a product for sale, or a character that helps in the advertisement of a product. The task of the Sales assistant is to encourage the viewer to purchase a product or aid the viewer in purchasing.


The Sales assistant is involved in customer relations and as such has a very friendly and open persona. Large open smiles and head movements accentuate the character's need to maintain the customer's attention and sell products. The friendly nature and happy appearance is designed to appeal to the viewer. The Sales assistant character can be seen on Movie 6 of the compilation CD and the input marked up text document can be seen in Appendix D.


5.10.3 Narrator / Storyteller


The Narrator or Storyteller character is similar to the News presenter virtual character but is much more emotive. Whilst the News presenter's function is to provide factual information to the viewer in a clear and concise manner, the Narrator functions to entertain and stir emotion in the viewer, enticing the viewer into the story or passage that is being read. The Narrator needs to be more expressive in facial gestures and expressions, following the semantic and intonative content of the text. If the text is sad or sombre in nature than this too is also reflected in the facial emotion, expression and gestures of the character. As the nature of the text changes from one to another, such as sad to happy, the emotional content in the face needs to change also, synchronised with the spoken text. A Narrator character that does not exhibit expressive facial expression or emotion fails in entertaining the viewer and fails in portraying the character. An animated Narrator character using the FAML tags can be seen in Movie 7 on the compilation CD. The input text document can be seen in Appendix E.


5.11 Producing realistic animation


Realistic animation is essential for providing realism. As mentioned by Shepherson (2000) humans are rarely static and exhibit behaviour that is dynamic, temporal and unpredictable. By providing seamless animation that flows from one expression to the next, from one movement to another, and from one emotional display to the other, we come closer to mimicking true life-like animation.




5.11.1 Realistic Head Turns


Through observations of nature, we can see that objects do not move in a linear motion. White (1986) makes the stout observation that "everything that moves in life, moves in arcs"(p.38). In real life, people don't move in a linear fashion, but flow from one movement to another in arcs, one simply has to look at any human motion to see this effect. Machines and robots have been attributed linear movement and as such are deemed to be unrealistic, unlife-like and robotic in nature.


With regards to the Talking Head, the issue of flowing movement is addressed such that movement produced by the Talking Head animation mimics the natural arcs seen in nature.


A "Fairing" approach used by Marriott (1992) uses a non-linear interpolation technique between points or key-frames in the animation sequence. Key-frame animation as described by Burtnyk and Wein (1971), is a technique used in computer animation that involves the automatic generation of intermediate frames based on a set of key frames supplied by the animator. Straightforward animation using simple linear interpolation algorithms as previously described produced the undesired mechanical movement and lack of smoothness in the animation. Key frames become clearly visible in the animation because of sudden changes in the direction of motion. Discontinuities with the speed of animation can also be seen in linear interpolating methods.


"Fairing", an adaptation of an interpolation algorithm, proposed by Kochanek and Bartels (1984) utilises a cubic interpolating spline for animating intermediate key-frames. The cubic interpolating spline minimizes the mechanical look of movement due to the changes in direction of motion and speed. The PST module for the interpolation of movement also uses "fairing" for the Talking Head.


Kochanek and Bartel's cubic interpolating algorithm (1984) was also able to take into account the acceleration and deceleration affects within the key frame animation through control parameters for the cubic spline. Humans vary their acceleration and deceleration of motion. The human head accelerates to perform a turning movement and decelerates before the turn in completed. The acceleration and deceleration of motion using the "fairing" approach produces a much smoother and life-like animation of the Talking Head.


5.11.2 Realistic Eye Movements


Similarly the movement of the eye is not linear and utilises the same "fairing" technique as described previously. This allows the animation of the eyes to flow in arcs and behave in the same continuous flow of animation as head movements.


However, as discussed by White (1986) and Maestri (1996) people tend to look in the direction that their head is turning. In a sense, the eyes lead the head movement. The implementation of the look FAML tags have incorporated the leading eye movement to add realism that White and Maestri observed.


It is however important to note that leading eye movement has not been implemented or incorporated into the PST module and as such, eye movement animated by the PST module does not lead head movement.














































Chapter 6



Results and Analysis







Synthetic Talking Heads as stated previously, are still in their infancy and as such very few evaluation procedures have been considered (Benoit et al., 1999). One evaluation technique compares the system to previous systems that use plain text, audio and still pictures. Others compare different sets of conditions, single modality versus multi-modality conditions (Nagao and Takeuchi, 1994).


We considered the second approach of comparing modality. We demonstrate a multi-modal system, but compare the results when we alter one modality, in this context the visual modality. By analysing the results we hope to establish the effectiveness of the changes and how they can support or reject our hypotheses.


As indicated previously, the testing and evaluation process involved a questionnaire-based approach, implemented through a series of demonstrations (refer to Appendix F for a copy of the questionnaire). The demonstrations were arranged in three distinct sections. Each section was designed to provide data to test the project hypotheses.


6.1 The Experiment


The sections of the questionnaire are as follows:



The purpose of this section was to acquire some demographic data as well as to establish cultural backgrounds of the participants. As indicated previously in the literature review, emotions, expressions and gestures are culturally dependent. From this data it was hoped to provide some insight into cultural aspects that could have affected the participants' evaluation of the project.



The purpose of this section was to test our two project hypothesis:


  1. Realistic animation can be simulated using FAML tags to "direct" the facial gestures, movements and expressions, mimicking true non-verbal human-to-human communication.


  1. The successful use of FAML tags can be used to create believable virtual characters. Where believable is being used in the sense that the user suspends his/her disbelief and interacts with the Talking Head as a real person.


Section 2 was divided into two demonstrations of each of the three selected virtual characters, a Storyteller, News presenter and Sales assistant. The respondents to the questionnaire were shown two short animations for each character demonstrated. The first demonstration did not include FAML tags. The second demonstration contained no TTS markup and but did include FAML tags. In this sense only the visual modality of the Talking Head was changed between demonstrations.


The Talking Head animation consisted of a friendly personality as described by Shepherdson (2000) with no emotive speech markup. It is important to note that throughout all demonstrations in section 2, no speech markup (Stallo, 2000) was used and the friendly personality remained constant. The only variable that changed during the demonstrations was the inclusion or exclusion of FAML tags. This was done to reduce the variables that could influence the participants' decisions when filling out the questionnaire.


After the first demonstration (without FAML tags), users were asked to comment on the Talking Head animation in terms of its facial gestures, expressions and emotions, by rating questions on a scale of one to five. The questions centered on how "realistic", "boring" and "expressive" they found the Talking Head animation. A rating scale of five was used as it provided an adequate range of discrete response values, without being too extensive as to confuse respondents (Sparks et al., 2000).


The second demonstration for the character was then shown, this time with the inclusion of FAML tags. Respondents were then asked once again to comment on the "expressiveness" of the Talking Head , how "life-like" it performed and so forth. In addition respondents were asked which demonstration of the character did they find

more believable in portraying the specified character, the first demonstration (without FAML tags) or the second demonstration (with FAML tags). It is important to note that participants were not told of the markup changes, or what they could expect from each demonstration.




In this section of the questionnaire, respondents were asked to comment on the Talking Head animation reading a narrative excerpt from "Alive in Wonderland" (Carroll, 1946). Section three consisted of two demonstrations. The first demonstration contained speech markup produced using the TTS tags from Stallo (2000), but no facial markup from the FAML tags. The second demonstration contained both TTS and FAML tags. Users were presented with both demonstrations one after the other and then asked to comment on them in the questionnaire.




For each phase of the demonstration all users were asked to comment on their choices. This was intended to provide insight into the reasoning behind the respondents' choices.


6.2 Evaluation of Results

6.2.1 Profile of Users


The questionnaire was conducted on the 26th of October 2000 at Curtin University of Technology, during a Systems Program and Design 251 lecture. We demonstrated the Talking Head to 35 participants. The following gives an outline of the results for section one, the participants' background.


Nationality


From the questionnaires we found that 71 percent of participants were of Australian nationality, and the other 29 percent were of Asian nationality, ranging from countries such as Vietnam, Malaysia, Singapore and Hong Kong.

Language


The questionnaire indicated that 63 percent of the respondents spoke English as their first language, 23 percent Chinese, 3 percent Dutch, 3 percent Filipino, 3 percent Russian and 3 percent Vietnamese. A further 3 percent of participants did not state their first language.


Previous exposure to Talking Heads


The users were also profiled on any Talking Head systems they had seen previously. Of the 35 users surveyed 66 percent had seen some type of Talking Head system before. Twenty percent had seen the current FAITH Talking Head Talking Head, while 9 percent said they had seen a Talking Head on a TV program called "Download". "Download" is a children's game show program. The "Download" Talking Head system is performance driven and utilises a two dimensional Talking Head. A further 9 percent of users identified "Max Head Room" as a Talking Head system they had seen previously. "Max Head Room" is a real human head that was video processed to have the appearance of a computer-generated character. Originally developed as a character in a children's movie, "Max Head Room" is unique in the sense that his animation purposely causes him to stutter and behave in a jerky manner.


From the profile data collected from the participants, it was established that the majority were English speaking, whilst two thirds of participants had seen some type of Talking Head system before. Users that had previous exposure could compare the performance of the project to what they had seen previously, whilst other participants could not.

6.2.2 Results of Phase Two


Storyteller


The Storyteller demonstration consisted of the Talking Head narrating a story. Participants were shown the Talking Head and asked to rate on a scale of one to five how life-like, believable, realistic and interesting the Talking Head appeared. Participants were also asked to comment on how well the Talking Head appeared to be communicating to them, and how expressive they thought the Talking Head appeared.



















Figure 22 Comparative Storyteller results from demonstration 1 Vs demonstration 2


From figure 22 one can see the percentage increases of results from the first demonstration (without FAML tags) of the Storyteller character when compared to the second demonstration (with FAML tags). At a glance there is an overall increase in the rating values from demonstration 2 versus demonstration 1. All areas of interest such as more life-like, more believable, more realistic, more interesting, more communicative and more expression all had significant increases of greater than 50 percent. The only exception being realism with only an increase of 45 percent. Realism as stated on the questionnaire was taken in the context of how physically real the Talking Head appeared to the user. Some respondents may have taken this as to be the specific physical model of the face and its features as opposed to the overall realism of the Talking Head. Since this did not change from each demonstration it could indicate why there was only a 45 percent increase when questioned on the realism.


Of relevance is the percentage of significance increase. A significant increase is defined as a difference value of two or more. For example if a respondent had given a rating of (1) on the first demonstration, but a rating of (4) for the second demonstration, then this would be associated with a significant increase. All areas of interest had significant increases in rating values. Of particular note is the expressiveness result, with a significant increase of 27 percent, indicating that almost one third of respondents thought there was a significant increase in the expressiveness of the Talking Head when animated using the FAML tags.


All areas except life-like had some decreases in values, indicating that the second demonstration that included FAML tags was, not wholly perceived as improving the Talking Head animation and the Storyteller character.


As indicated previously, the physical realism of the Talking Head did not change for each demonstration, however the decrease in rating for the question of realism can be attributed to the comments made by the participants. Comments included:



The majority of the eyebrow movement can be attributed to the personality module. The number and intensities of eyebrow movement could be decreased in light of the comments made about the extensive use of eyebrow movements. A realistic texture map of the Talking Head could be applied to increase realism. However, it was a conscious decision not to implement texture mapping for this demonstration so as to focus the participants' attention only on the aspects of facial expression, gesture and movement that were relevant to the project hypotheses.


To test the significance of the results alluded to by the previous discussion, a statistical analysis of the data was conducted. Analysis included descriptive analysis of the results to determine mean, standard deviation and median values. Due to the discrete nature of the collected data, McNemar's test (McNemar, 1947) (Sheskin, 2000) (Somes, 1983) and the Stuart Maxwell test (Stuart, 1955) (Maxwell, 1970) (Everitt, 1977) was used to test the significance of the matched ordinal data values for before and after demonstrations.


The null hypothesis is stated that the proportion of ordinal values in demonstration 2 (with FAML tags) is equal to the proportion of ordinal values in demonstration 1 (without FAML tags). This indicates that there was no significant increase of ranking values for all questions answered.


H0 =2 -1 = 0.


The alternate hypotheses is stated that the proportion of ordinal values in demonstration 2 is greater than the proportion of ordinal values in demonstration 1. This indicates that the implementation and application of the FAML tags did in fact increase realism, believability, how interesting, how life-like, how well it communicated and the expressiveness of the Talking Head animation.


H1 =2 -1 > 0.


Question

McNemars X2

McNemars p value (4 dp)

Stuart Maxwell X2

Stuart Maxwell p value (4 dp)

Life-Like

22.000

0.0000

19.577

0.0006

Believability

15.211

0.0001

14.923

0.0049

Realistic

8.895

0.0029

8.811

0.0460

Interesting

13.762

0.0002

12.872

0.0119

Communicative

16.667

0.0000

16.297

0.0026

Expressive

18.615

0.0000

18.094

0.0012


Table 4 McNemar's and Stuart Maxwell's p values for Storyteller character


The statistical analysis of the Storyteller dataset using McNemar's and Stuart Maxwell's statistic can be seen in Appendix G.


All tests were conducted using McNemar's and Stuart Maxwell's homogeneity tests at a 95% confidence level. The McNemar's statistic tests before and after marginal homogeneity with respect to each individual category. The Stuart Maxwell test tests marginal homogeneity for all categories simultaneously.


As seen from the results in table 4, at the 95 percent confidence level all p values are less than 0.05 (5%), indicating that there is less than a 5 percent chance of committing a type 1 error, that is rejecting the null hypothesis when it was in fact correct. Therefore both McNemar's and Stuart Maxwell tests produced p values that were significant. In the case of the Storyteller experiment, we can reject the null hypothesis and accept the alternate hypothesis that the FAML tags increased the rating values for all questions in the Storyteller Talking Head questionnaire, at a confidence level of 95 percent. The acceptance of the alternative hypothesis validates our project hypotheses that realistic animation of a Talking Head can be simulated using the FAML tags to "direct" facial gestures and expression, mimicking true non-verbal human communication. In addition the successful use of FAML tags can be used to create believable virtual characters, where believable is being used in the sense that the user suspends their disbelief and interacts with the Talking Head as a real person.


Furthermore participants were asked which demonstration they believed better portrayed the character. Eighty-six percent said they preferred the second demonstration, whilst only 6 percent said they liked the first demonstration and 6 percent said neither demonstration was better. These values further support our hypotheses.


News presenter


The News presenter demonstration was conducted in the same manner as the Storyteller demonstration described previously. The News presenter character read a news article to the participants, initially in demonstration 1 with no FAML tag, but with FAML tags in demonstration 2. The participants were then asked the same questions as the Storyteller demonstration. This included how life-like, believable, realistic, interesting, communicative and expressive they found the Talking Head.


From figure 23 it can be seen that across the range of questions all had increases in response rating of greater than 55 percent from demonstration 1 to demonstration 2.


All questions faired better than the Storyteller demonstration. This may be attributed to the fact that participants were familiar with the questionnaire process and questions, giving better ratings since they were more aware of the features of the demonstration.



















Figure 23 Comparative News presenter results from demonstration 1 Vs demonstration 2


Once again there were significant increases across all questions, with the expressiveness question rating the best with 32 percent of the values being significant increases in rating. The percentage of significant increases remained constant for other questions with little variation.


Of important note, is that the percentage decreases and significant decreases in rating values were greater than those seen in the Storyteller demonstration. This was particularly for questions relating to believability, realism, communication and expressiveness.


For participants that rated a decrease in rating value their main concern were with the eyebrow movements, seen in the News presenter demonstration. Comments included:



As stated previously the eyebrows movement can be decreased by specifying their values in the personality module PST file. The friendly personality formed the basis for all demonstrations used in the questionnaire, to minimize the change in variables between demonstrations. It is clear that for serious news articles a more serious expression and less smiling is appropriate, and the literature does indicate towards this. For future implementation of the News presenter character, personality would be content specific, however for our demonstration and collection of results it was deemed inappropriate as it would introduce another unwanted variable into the analysis and confound the results.


With regards to head nodding, it is clear that from observation of real news presenting (TV_NEWS 2000) that head nodding is an integral gesture used by all News presenters when reading articles. The nodding coincides with accentuated speech.


As stated previously McNemar's and Stuart Maxwell test for homogeneity was used to test significance. The null hypothesis and alternate hypothesis used in the Storyteller experiment, is also used for the News presenter experiment.



Question

McNemars X2

McNemars p value (4 dp)

Stuart Maxwell X2

Stuart Maxwell p value (4 dp)

Life-Like

18.615

0.0000

18.333

0.0010

Believability

13.500

0.0002

11.562

0.0021

Realistic

13.500

0.0002

11.323

0.0230

Interesting

14.440

0.0001

15.183

0.0043

Communicative

15.385

0.0001

13.585

0.0090

Expressive

18.241

0.0000

16.827

0.0020

Table 5 McNemars and Stuart Maxwell p values for News presenter character


The statistical analysis of the News presenter dataset using McNemar's and Stuart Maxwell's statistic can be seen in Appendix G.


At the 95 percent confidence level all p values are less than 0.05 (5%). Therefore we reject the null hypothesis and accept the alternate hypothesis that the FAML tags increased the rating values for all questions in the News presenter Talking Head questionnaire, at a confidence level of 95 percent.


From the data collected on which demonstration of the News presenter character the participants preferred, 90 percent indicated that they preferred the second News presenter demonstration (with FAML tags) as opposed to the first demonstration (without FAML tags) further supporting our hypothesis results. Only 3 percent indicated that they preferred neither and 7 percent indicated that they preferred the first demonstration.


Sales assistant


The Sales assistant demonstration was conducted in the same manner as both the Storyteller and News presenter demonstrations described previously. The Sales assistant in this demonstration is working for Domino's Pizzas, and in the demonstration exists within the context of a web page. The Talking Head Sales assistant is providing information on types of pizzas and informing the user on how to select and make purchases online.






















Figure 24 Comparative Sales assistant results for demonstration 1 Vs demonstration 2


From figure 24 it can be seen that across the range of all questions there are significantly more increases when compared to Storyteller results (Figure 22) and News presenter results (Figure 23). All percentage increases are greater than 65 percent with all questions having at least 35 percent of increases being significant. Compared to previous demonstrations of the Talking Head, the Sales assistant demonstration achieved a greater percentage of increased responses from all participants, for all questions when the FAML tags were used to animate the gestures, movements and expressions.


As highlighted in the discussion on the new presenter, these increases in rating values could be attributed to the familiarity that participants gained from having seen the two previous Storyteller and News presenter demonstrations.


Another aspect that may explain the higher percentages increases is the text itself. The spoken text presented in previous demonstrations (Storyteller and News presenter) have been listener orientated, in the sense that the Talking Head animation is just talking to or telling the viewer information. The Sales assistant differs in the fact that the Talking Head relates directly to the user by asking them to "look up at the menu bar" and "down to click on the submit button". The Storyteller and News presenter lacked this interaction between the user and Talking Head. This interactivity could explain the high percentages of increased ratings with questions about communication, expressiveness and believability.


Also seen from figure 24 is that there are still decreases in values from demonstration 1 to demonstration 2. The majority of the percentage decrease relate to only one participant, however he did not relate his ranking choices with any comments, as such it is difficult to infer the reasons for his poor ranking in the second demonstration (with FAML tags).


For the Sales assistant experiment, the null hypothesis and alternate hypothesis were identical to the Storyteller and News presenter demonstrations



Question

McNemars X2

McNemars p value (4 dp)

Stuart Maxwell X2

Stuart Maxwell p value (4 dp)

Life-Like

24.143

0.0000

21.007

0.0003

Believability

22.154

0.0000

22.292

0.0002

Realistic

19.593

0.0000

19.391

0.0007

Interesting

20.167

0.0000

18.437

0.0010

Communicative

22.154

0.0000

20.468

0.0004

Expressive

23.148

0.0000

22.217

0.0002


Table 6 McNemar's and Stuart Maxwell p values for Sales assistant character


The statistical analysis of the Sales assistant dataset using McNemar's and Stuart Maxwell's statistic can be seen in Appendix G.


As we have inferred from the statistical analysis of the previous demonstrations we can see that at the 95 percent confidence level all p values are less than 0.05 (5%). Therefore we reject the null hypothesis and accept the alternate hypothesis that the FAML tags increased the rating values for all questions in the News presenter Talking Head questionnaire, at a confidence level of 95 percent.


Of the 35 participants in the questionnaire 91 percent thought that the second demonstration (with FAML tag) was better than the first (without FAML tags). Only 3 percent preferred the first demonstration, while only 6% thought neither Sales assistant were preferable over the other. Of the participants that thought the first demonstration were better, cited comments such as:




As stated previously the eyebrow movement can be minimized in future. The mouth as discussed under delimitations is not touched upon with regards to this project. The visemes used to animate them are currently under the control of the Text-To-Speech system.


The statistical analysis of all three characters has shown significant increases between rating values of realism, believability, interesting, life-like, communication and expressiveness of the Talking Head animation. This supports our hypothesis that realistic animation can be simulated using FAML tags to "direct" the facial gestures, movements and expressions, mimicking true non-verbal human-to-human communication that enable the Talking Head animation to portray believable characters.


6.2.3 Results of Phase Three



This section of the questionnaire was implemented to test the effectiveness of the emotive speech (TTS tags) in conjunction with the FAML tags. Respondents were presented with two demonstrations, the first with only TTS speech markup provided by the TTS tag system implemented by Stallo (2000) and the second with both TTS speech markup and FAML tags. Although not incorporated in the hypothesis of the project itself, it was included into the questionnaire to provide insight into the multi-modal aspect of the Talking Head. Emotive speech and synchronised facial gestures and expressions, as discussed by the literature review could better model the human communication process.


Questions such as "which portrayed the best Storyteller character", "which demonstration was more human like", "which was more expressive" and "which was more interesting" were asked and their corresponding sample percentages can be seen in table 7.



Response

Best Storyteller

More natural / human like

More expressive

More interesting

Demo 1

9%

14%

6%

9%

Demo 2

89%

69%

86%

83%

Neither

3%

17%

9%

9%

Table 7 TTS vs TTS and FAML results


As seen from table 7 the majority of the respondents chose demonstration 2 (with TTS and FAML tags) as the better of the two. However, when asked which was more "natural / human like", only 69 percent of respondents said the second demonstration and 17 percent said neither were more "natural / human like". This is a 10 percent decrease in the number of respondents choosing demonstration 2 than any other question. This result could be attributed to the fact that "natural / human like" could have been interpreted as just the physical aspects of facial features, such as eyes, nose and the mouth and not the overall performance of the Talking Head with regards to movement, gesture and expression. The physical aspects of the Talking Head did not change from the first demonstration to the second.


Most of the comments that related to the "natural / human like" aspects of the Talking Head where respondents preferred demonstration 1 or neither demonstration included:



The first comment indicates that it is most likely an observation of the physical aspects of the Talking Head, such as its hair, eyes, nose or mouth. The top lips are related to the visemes and are beyond the scope of this project, as too is the shape of the smile.


The comment on "too unexpressive" in the first demonstration and "too expressive" in the second demonstration indicates the need for the animator to alter the intensity of the tagged expressions. The FAML tag provides intensity values that can be used to tone down expressions if desired. As the animator gains more skill in the effects of the tags, the Talking Head animation will gain further realism and believability.


There were no comments on the emotive speech, from which one may conclude, that either the participants did not notice the emotive speech, or that the emotive speech was too good it did not require commenting on. Further analysis of the effect of the TTS and FAML tags need to be addressed. This is discussed further in section 6.4, future work.


Most of the comments on why participants chose demonstration 2 as being better have been positive, with comments such as:




These comments support our hypotheses and highlight the improvement in Talking Head animation made through the use of FAML tags for animation markup.


6.3 Summary of Results


Through the graphical analysis and statistical analysis of the results obtained through the questionnaire, we have supported our hypotheses.


  1. Realistic animation of a Talking Head can be simulated using the FAML tags to "direct" facial gestures and expression, mimicking true non-verbal human communication.


  1. The successful use of FAML tags can be used to create believable virtual characters. Where believable is being used in the sense that the user suspends his/her disbelief and interacts with the Talking Head as a real person.


All three virtual characters implemented, the Storyteller, News presenter and Sales assistant were all statistically proven through the McNemar's and Stuart Maxwell's tests to be an improvement over a Talking Head not using FAML tags, in terms of being more expressive, better able to communicate, and looking and behaving more realistically through human movements and expressions. The higher results for demonstrations using FAML tags were attributed to the FAML tags implemented.


Section three, although not part of the project hypothesis testing, did highlight some positive feedback on the system and the need for the animator to improve the skilled use of the FAML tags. The FAML tags are tools used to aid the animator in scripting expressive and emotive characters that can nonverbally behave like real humans. It is up to the animators themselves to truly invoke the illusion of life.















Chapter 7



Conclusions





The FAQBot forms the focus of our project. The FAQBot Talking Head animation combines a TTS system, an MPEG-4 based FAE and an AI to produce a 3D talking head answering users requests. As stated, the aims of this project is to implement a FAML to enable the control of the animated Talking Head to include facial expressions, gestures and emotions through the input text stream.


Our initial focus of the literature encompassing domains of human psychology and cognitive sciences, computer graphics, computer vision and human-machine interaction to identify the factors that contribute to nonverbal communication of facial gestures, expressions and emotions in humans. From this we derived our subset of FAML tags to mimic the identified non-verbal behaviours. The FAML tags form the tools needed to realistically animate the Talking Head.


The subset of FAML tags derived from the study of nonverbal communication were categorized into five categories:


1) Facial expressions of emotion

2) Facial expressions

3) Eye behaviour

4) Brow gestures

5) Head movement


MPEG-4 was used for the animation of the Talking Head. The subset of FAML tags specified the movement of FAPs as defined by the MPEG-4 specification. The FAP's were used to display the facial expressions denoted by the FAML tags. The FAML tags were implemented to work in conjunction with the personality of the Talking Head allowing smooth and continuous animation. Timing of gestures were synchronized to the audio clock, defined by the timing in the Talking Head synthesized speech.


With the inclusion of the FAML tags, we hypothesized that:


    1. Realistic animation of a Talking Head can be simulated using the FAML tags to "direct" facial gestures and expression, mimicking true non-verbal human communication.


    1. The successful use of FAML tags can be used to create believable virtual characters. "Believable" is being used in the sense that the user suspends their disbelief and interacts with the Talking Head as a real person.


To test the hypothesis we conducted a questionnaire to collect respondent data based on a series of Talking Head Demonstrations. Three virtual characters were implemented using FAML tags to control the animation. All three virtual characters implemented, the Storyteller, News presenter and Sales assistant were all statistically proven through the McNemar's and Stuart Maxwell's tests to be an improvement over a Talking Head not using FAML tags This was determined in terms of being more expressive, better able to communicate, and looking and behaving more realistically through human movements and non-verbal behaviours. The higher results for demonstrations using FAML tags were attributed to the FAML tags implemented.


7.1 Future work


XML Compliancy


As discussed in the literature review, XML is an up and coming meta-language that has been successfully used to design markup languages for domain specific tasks. Stallo (2000) has shown that a text-to-speech markup can be successfully implemented in XML to allow the structuring of text into emotive sequences.


The investigation of XML for this project highlighted XML's simplicity, the high ordering of structure, extensibility and interoperability. However as defined in the delimitations (section 4.2), the FAML tags for facial animation used in this project were required to comply with the specification of the Fifth Framework consortium.


The FAML tag structure is not XML compliant and does not meet any standardized specifications. The structure of the FAML tags specified by the Fifth framework consortium is based on the research of Ostermann et al., (1998). The current implementation of the FAML tags do not structure the input text stream into sequences of facial animation in the same manner that the TTS tags do for speech.


XML provides all the mechanism that allow the flexibility, simplicity, interoperability and structuring of text that need to be enabled for FAML tags. In future FAML tags need to become XML compliant, to meet the future demands placed on the animation for the Talking Head and the scripting of expression and movement.


The FAML module has foreseen the need for such compliance and as such FAML tags can be converted into XML compliant tags and used in XML compliant documents. However the FAML module has not implemented the XML system itself.


Extra FAML tags and tag style sheets


The FAML tags provided by the FAML module is just a subset of the numerous facial gestures, expression and movements a real human head can produce. Future work includes adding extra FAML tags, to allow a greater range of tools to script and animate the Talking Head.


However, when applying FAML tags to character animation, not all facial gestures, expressions and emotions are performed in the same manner. A nod for a news presenter character may be different to a nod in a virtual lecturer character. A style sheet approach similar to that used in HTML pages, can be applied to automate the process of creating different facial animation styles for different characters.


Automatic FAML tag markup


As mentioned in the delimitations, the automatic markup of text is beyond the scope of the project. However, the next logical step is to automate the process of markup, to include both the TTS emotive speech markup as well as the FAML gesture markup. This would enable the Talking Head to receive a stream of text, automatically include the markup of both speech and gesture and produce a Talking Head animation without the need for human animation or intervention. This would render the whole animation process completely automated.


Independent neck movement


As stated in the limitations (section 4.3) for this project, the neck for the current Talking Head is fixed and as such there is no independent neck movement. Future work will need to address this severe limitation as it has an adverse effect on the realism able to be portrayed in the Talking Head.


Head model translations


As indicated by the literature, the head moves towards and away from the listener to indicate conversation turn taking. Head model translation is a limitation in this project, however future work can include the addition of dynamic head model translation for the Talking Head animation.






Tag onset, apex and sustain variances


In the current implementation of the FAML tags the onset, apex and offset values for the temporal changes in facial gestures, expression and emotion is fixed, 25 percent for onset, 50 percent for apex and 25 percent for offset. Future work can include a mechanism that will enable the onset, apex and offset stages of tags to vary. This would enable the same FAML tags to produce temporally different facial expressions, gestures or emotions.




































Bibliography



Abrantes, G.A. and Pereira, F. (1999) MPEG-4 facial animation technology, Survey implementation and results. IEEE Transactions on Circuits and Systems for Video Technology, 9(2), 290-305.


Ambrosini, L., Costa, M., Lavagetto, F. and Pockaj, R. (1998) 3D head model calibration based MPEG-4 parameters. The 6th SPACS - IEEE International Workshop on Intelligent Signal Processing and Communication Systems. Melbourne, Australia.


Argyle, M. (1975) Bodily Communication. Methuen and Co. Ltd, London.


Argyle, M. and Cook, M. (1976) Gaze and Mutual gaze. Cambridge University Press.


Badler, N. I. (1995) A workshop on standards for facial animation. Computer Graphics, 66-67.


Bartlett, M.S., Hager, J.C., Ekman, P. and Sejnowski, T.J (1999). Measuring facial expressions by computer image analysis. Psychophysiology, 36:253-263.


Bates, J. (1994) The Role of emotion in believable Agents. Communications of the ACM, 37, 122-125.


Beard, S., Crossman, B., Cechner, P., and Marriott, A. (1999) FAQBot. PanSydney Area Workshop Visual Information Processing Held in University of Sydney, 40-43.


Becker, D. and Becker, P. D. (1994) Powerful Presentation Skills. Irwin.


Benoit, C., Pelachaud, C., and Suhm, B. (1999) Multimodal speech systems. In Gibbons, D. Moore, R. and Winski, R. Eds., Handbook of Standards and Resources in Spoken Language Systems, Volume supplement. Berlin: Mouton de Gruyter. to appear, http://werner.ira.uka.de/~bsuhm/.


Bergin, F. (1995 ) Successful Presentations. Director Books.


Beskow, J. (1996) Talking heads - communication, articulation and animation. In Proceedings Fonetik 96, Swedish Phonetics Conference. 53-56


Binsted, K. (1999) Character design for soccer commentary. In Minoru Asada and Hiroaki Kitano, editors, RoboCup-98: Robot Soccer World Cup II. Springer Verlag, Berlin.


Birdwhistle. (1970) Kinesics and Context: Essays on Body Motion and Communication. Philadelphia,

PA:University of Pennsylvania Press.


Bizzi., E. (1974) The coordination of eye-head movement. Scientific American, 531(4).


Brody, M. and Kent, S. (1993) Power Presentations. Wiley.


Burtnyk, N. and Wein, M. (1971) Computer Generated Key Frame Animation. Journal of SMPTE 80, 149-153


Camras, L. (1992) Early development of emotional expression. In K.T. Strongman (ed), International Review of Studies on Emotion, Volume 1. New York: Wiley, 16-36.


Carroll, L. (1946). Alice's Adventures in Wonderland, Random House, New York.


Cassell, J., Steedman, M., Badler, N. I., Pelachaud, C., Stone., Douville, B., Prevost, S., and Achorn, B. (1994a) Modelling the interaction between speech and gestures. In Proceeding of 16th annual Conference of the Cognitive Science Society. 119-124


Cassell, J., Pelachaud, C., Badler, N.I., Steedman, M., Achorn, B., Beckett, T., Douville, B., Prevost, S. and Stone, M. (1994b) Animated conversation: rule-based generation of facial display, gesture and spoken intonation for multiple conversational agents. Computer Graphics (SIGGRAPH '94 Proceedings), 28(4): 413-420.


Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., and Stone, M. (1994c). Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. In Proceedings of ACM SIGGRAPH '94.


Cassell, J. and McNiell, D. (1992) Gesture and the poetics of prose. Poetics today, 12:375-404.


Casell, J. and Stone, M. (1999) Living Hand to Mouth: Phsychological Theories about Speech and Gesture in Interactive Dialogue systems. AAAI Fall Symposium on Narrative Intelligence


Chovil, N. (1991) Discourse orientated facial displays in conversation. Research on Language and Social Interaction, 25:163-194.


Cohen, M. M., Massaro., D. W. (1993) Modelling coarticulation in synthetic speech. Models and Techniques in Computer Animation, Tokyo, Springer-Verlag.


Condon, W. S. and Ogston, W. D. (1971) Speech and body motion synchrony of the speaker-hearer. In the Perceptino of language. Horton and Jenkins.


Cosatto, E. and Graf, H.P. (1998) Sample based of photo-realistic Talking Heads Computer animation, Philadelphia, Pennsylvania, 103-110


Ekman, P. (1979) About brows: emotional and conversational signals. Human ethology: claims of a new disipline: contributions to the colloquim, Cambridge University Press, Cambridge, England; New-York. 169-248,2


Ekman, P. (1982) Emotion in the human face. Cambridge University Press.


Ekman, P. (1992) Facial expression of emotion: New findings, new questions. American Psychologist Society 3(1): 34-38.


Ekman, P. (1993) Facial expression and emotion. American Psychologist 48: 384-392.


Ekman, P. and Davidson, R. (1994) The Nature of Emotion: Fundamental Questions. New York: Oxford University Press.


Ekman, P. and Friesen, W. (1975) Unmasking the Face: A Guide to Recognizing Emotions from Facial Expresssions. Englewood Cliffs, NJ: Prentice Hall.


Ekman, P. and Friesen, W. (1978) Facial Action Coding System, Consulting Psychologists Press, Inc.


Ekman, P. and Rosenberg, E. (1997) What the Face Reveals: Basic and Applied Studies of Spontaneous

Expression Using the Facial Action Coding System. New York: Oxford University Press.


Ellyson, S. L. and Dovidio, J. F. (1985) Power Dominance and nonverbal behavior: Basic concepts and issues. In S. L. Ellyson and J. F. Dovidio, editors, Power, Dominance, and Nonverbal Behavior, chapter 1, pages 1-27. Springer-Verlag, New York.


Emmett, A. (1985) Digital portfolio: Tony de peltrie. Computer Graphics World, 8(10):72-77


Essa, I. A. (1994) Analysis, Interpretation, and synthesis of facial expressions. PhD thesis, MIT, Media Laboratory, Cambridge, MA, 1994.


Everitt, B.S. (1977) The analysis of contingency tables. London: Chapman & Hall.


Ezzat, T and Poggio, T (1997) Videorealistic Talking Faces : A Morphing Approach In Proceedings of the AVSP'97 Workshop, 184-188


Festival (2000) The Festival Speech Synthesis System. [Online] Available http://www.cstr.ed.ac.uk/projects/festival/ 25th November 2000


Flemming, B. and Dobbs, D. (1999) Animating Facial features and expressions. Charles River Media Inc. USA


Fridlund, A.J. (1994) Human Facial Expression: An Evolutionary View. San Diego: Academic Press.


Guiard-Marigny, T., Adjoudani, A., and Benoit., C. (1994) A 3-D model of the lips for visual speech synthesis. In Proc. Of the 2nd ESCA/IEEE workshop on Speech Synthesis, 49-52.


Hadar, U., Steiner, T.J., Grant, E.C. and Clifford Rose, F. (1983) Head movement correlates of juncture and stress at sentence level. Language and Speech, 26(2):117-129.


Harper, R. G., Weinds, A. N. and Matarazzo, J. D. (1978) Nonverbal Communication: The state of the Art. J. Wiley and Sons, New York.


Jones, C. (1989) Chuck Amuck: the Life and Times of an Animated Cartoonist. Farra, Straus & Giroux, New York.


Kalra, P. (1993) An Interactive Mulitimodal Facial Animatoin System. PhD thesis, Swiss Federal Institute of Technology, Lausanne, Switzrerland.


Kendon, A. (1967) Some functions of gaze direction in social interaction. Acta Psychologica, 26:22-63


Kendon, A. (1994) Do gestures communicate? A review. Research on Language and Social Interaction, 27(3):175-200.


Kochanek, D. H. U. and Bartels, R. H. (1984) Interpolating splines with local tension, continuity, and bias control. Computer Graphics, 18(3), 33-41


Kupsh, J. and Graves, P. R. (1993) How to Create High­Impact Business Presentations. NTC Business Books.


Kushner, M. (1996) Successful Presentations for Dummies. IDG Books Worldwide.


Leech, T. (1993) How to Prepare, Stage, and Deliver Winning Presentations. 2nd ed., AMACOM.


Lisetti, C and Schiano, D. J. (2000) Automatic facial expression interpretation: Where human interaction, artificial intelligence and cognitive science intersect. Pragmatics and Cognition,(Special Issue on Facial Information Precessing and Multidisciplinary Perpective) 8(1):185-235


Lavagetto, F. and Pockaj, R. (1999) The facial animation engine: Towards a high-level interface for the design of MPEG-4 compliant animated faces. IEEE Transactions on Circuits and Systems for Video Technology. 9(2).


Nagao, K. and Hasida, K. (1998) Automatic text summarization based on the global document annotation. Technical report, Sony Computer Science Laboratory.


Nagao, K. and Takeuchi, A. (1994) Speech dialogue with facial displays: mulitmodal human computer conversation IEEE Association of Computational Linguistics (ACL-94), 102-109


Noh, J. and Neumann, U. (2000) Talking Faces Proceedings of the IEEE International Conference on Multimedia and Exposition.


Noma, T., and Badler, N. I. (1997). A virtual human presenter. In Proceedings of the IJCAI Workshop on Animated Interface Agents: Making Them Intelligent, 45--51.


Maestri, G. (1996) Digital Character Animation. New Riders Publishing, Indianapolis, Indiana.


Malandro, L. A., Barker, L. L., and Barker, D. A. (1989) Nonverbal Communication. Random House, New York, 2nd edition.


Marriot, A. (1992) Keyframe interpolation at Curtin. Technical Report 4, School of Computing, Curtin University of Technology, Western Australia.


Mauch, J. E. and Birch, J. W. (1983) Guide to the Successful Thesis and Dissertation, chapter 4, 70-73. Marcel Dekker, New York.


Maxwell, A.E. (1970) Comparing the classification of subjects by two independent judges. British Journal of Psychiatry,116, 651-655.


MBROLA (2000) The MBROLA Project : Towards a freely available multi-lingual speech synthesizer. [Online] Available http://tcts.fpms.ac.be/synthesis/mbrola.html 25th November 2000.


McNemar, Q. (1947) Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12, 153-157.


McNiell, D. (1992) Hand and Mind: what gestures reveal about thought. University of Chicago.


Miller, Patrick W. (1981) Non-verbal Communication, Washington, DC: National Education Association.


Moubaraki, L., Ohya, J., and Kishino, F. (1995). Realistic 3D facial animation in virtual space teleconferencing. Proceedings of the 4th IEEE International Workshop on Robot and human communication, Ro-MAN'95, pages 253-258.


MPEG (1999) Overview of the MPEG-4 standard. ISO/IEC JTC1/SC29/WG11N2725. Seoul, South Korea.


Ostermann, J. (1998) Animation of Synthetic Faces in MPEG-4. Computer Animation,49-51


Ostermann, J., Beutnagel, M., Fischer, A., and Wang, Y. (1998) Integration of Talking Heads and text-to-speech synthsizers for visual TTS. In Proceedings at the international conference on speech and language processing. 143-147



Parke, F.I. (1972) Computer generated animation of faces, Master's thesis, University of Utah, Salt Lake City, UT


Parke, F.I. (1982) Parametrized models for facial animation, IEEE Computer Graphics, 2(9):61-68


Parke. F.I. (1990) Parametrized facial animation revisited. State of the Art in Facial

Animation, 26:44-61. ACM Siggraph'90 Course Notes.


Parke. F.I. (1991) Control parametrization for facial animation, Computer Animation '91, 45-58.


Parke, F.I. and Waters, K. (1996) Computer Facial Animation. A K Peters, Wellesley, Massachusetts.


Pearce, A., Wyvill, B., and Hill., D. R. (1986) Speech and expression: A computer solution to face animation. Graphics and Vision Interface '86, 136-140.


Pelachaud, C., Badler, N. I., Steedman, M. (1991) Linguistic issues in facial animation. in N. M. Thalmann and D. Thalmann (Eds.) Computer Animation '91 Tokyo: Springer-Verlag.


Pelachaud, C., Badler N., & Steedman. (1996) Generating Facial Expressions for Speech Cognitive Science, 20(1).


Pelachaud, C., Badler, N. and Viaud, M-L. (1994) Final Report to NSF of the standards for facial animation workshop. Technical report, University of Pennsylvania.


Pelachaud, C. and Prevost, S. (1995) Talking heads: Physical, linguistic and cognitive issues in facial animation. Course Notes for Computer Graphics International '95.


Platt, S.M. and Badler, N. I. (1981) Animating Facial Expressions Computer Graphics, Vol. 15, No. 3, 245-252.


SABLE (1998) Draft Specification for sable version 1.0. Technical report, The Sable Consortium.


Sheperdson, R. H. (2000) The personality of a Talking Head, Computer science Honors dissertation, School of Computing, Curtin University of Technology.


Sheskin DJ (2000). Handbook of parametric and nonparametric statistical procedures (second edition). Boca Raton: Chapman & Hall.



Slater, M., Pertaub., D.-P., and Steed, A. (1999) Public speaking in virtual reality: Facing an audience of avatars. IEEE Computer Graphics and Applications, 6-9.


Sparks, R. Donnelly, J. and Best, J. (1999) Extracting Useful Information from Survey Data [Online] Available http://www.cmis.csiro.au/statline/1999/feb99.htm 24th November 2000


Stallo, J. (2000) Simulating emotional speech for a Talking Head, Computer science Honors dissertation, School of Computing, Curtin University of Technology. (Yet to be published)


St.Laurent, S. (1998) Why XML? [Online] Available http://www.simonstl.com/aritcles/whyxml.htm 17th November 2000


Somes G. (1983) McNemar test. Encyclopedia of statistical sciences, vol. 5, S. Kotz & N. Johnson, eds., 361-363. New York: Wiley.


Stuart, A. (1955) A test for homogeneity of the marginal distributions in a two-way classification. Biometrika, 42, 412-416.


Terzopoules, D. and Waters., K (1990) Physically-based facial modelling, analysis, and animation. Journal of Visualization and Computer Animation, 1(2):73-90


Terzopoules, D. and Waters., K (1993) Analysis and synthesis of facial image sequences using physical and anatomical models. IEEE transactions on Pattern Analysis and Machine Intelligence, 15(6):569-579


Thomas, F. and Johnston, O. (1981) Disney Animation: The illusion of life. Abbeville Press, New York.


TV_NEWS (2000) 12 hours channel 7 news, 10 hours channel 9 news, 5 hours ABC news.


Walker, M. B. and Trimboli., C.(1983) The expressive function of the eye flash. Journal of Nonverbal Behaviour. 8(1)3-13.


Waters, K. (1987) A muscle model for animating 3D facial expression Computer Graphics, 21: 17-24.


WDVL (2000) Web Devleopers Virtual Library; XML : Extensible Markup Language [Online] Available http://wdvl.internet.com/Authoring/Languages/XML/ 19th November 2000.


Webbink, P.(1986) The Power of the Eyes. Springer Publishing Company.


White, T. (1986) The Animators Workbook: Step-by-Step Techniques of Drawn Animation. Watson-Guptill Publications, Broadway, New York.


Williams, E. (1977) Experimental comparisons of face-to-face and medicated communication: A review. Psychological Bulletin, 84:963-976.


XML FAQ (2000) Frequently Asked Questions about the Extensible Markup Language : Originally maintained on behalf of the World Wide Web Consortium's XML Special Interest Group [Online] Available http://www.ucc.ie/xml/#FAQ-GENERAL 17th November 2000.


XML_namespace (1999) Namespaces in XML : World Wide Wed Consortium [Online] Available http://www.w3.org/TR/1999/REC-xml-names-19990114/ 24th November 2000.


Yacoob, Y. and Davis, L. (1994) Computer Vision and Pattern Recognition Conference, chapter Computing spatio­temporal representations of human faces, IEEE Computer Society, 70-75


Zajonc, R. (1994) Emotional expression and temperature modulation. In S. Van Goosen, N. Van de Poll,

and J. Sergeant (eds), Emotions: Essays on Emotion Theory. Hillsdale, NJ: Lawrence Erlbaum, 3-27.









Appendix A





FAML tag structure as delimited by the Fifth Framework consortium: Interface Project.

API description

For API description and example of implementation, the reader is sent to the APPENDIX C.


The Phoneme/Bookmark to FAP Converter (DIST, UNIGE)


This function is to convert FAP bookmarks generated by the DM (bookmark table) and phonemes by TTS into either high level FAPs (for visemes) or low level FAPs that will be encoded by the FBA Encoder into a FBA bitstream.

Dialogue Manager output will be of the SAPI format for phoneme, and it create the bookmark table as define in ISO/IEC 14496-3 Subpart 6 and ISO/IEC 14496-2 Annex C (to make the correspondence between DM bookmark and TTS \mkr tag).

The information about the timed phonemes and the bookmarks is passed to the Phoneme/Bookmark to FAP convertor by calling setPhoneme and setBookmark functions respectively.

The list of high level expressions could be extended in future from 6 standard expressions defined in MPEG4 to a higher value to accommodate a wider variety of facial expressions.

The PBtoFAP returns the FAPs on frame by frame basis when getFAPframe is called. It also returns the status of the internal data of PBtoFAP (isEmpty).

The resetConvertor function can be used to clear any previous frame data that may be resident in the PBtoFAP memory buffer.


API description

typedef tBookmark {

public int fap_number;

public int expression_select1;

public int expression_intensity1;

public int expression_select2;

public int expression_intensity2;

public int llfapvalue;

//as defined in document ISO/IEC 14496-2

public int transition_time;

public int time_curve;

}

//The llfapvalue will be valid when fap_number>2.


class PBtoFAP {

PBtoFAP(int frame_rate = 25); // Construtor

~PBtoFAP(); // Destructor


// Reset the convertor

void resetConvertor();


//

// Set parameters

//

// Set phoneme give by TTS (API phoneme expression)

void setPhoneme (QWORD currentTime, DWORD Phoneme);

// The duration is given by the next phoneme.


// Set Bookmark

void setBookmark(QWORD currentTime, tBookmark Bookmark);


//

// Get FAPs

//

FAPs *getFAPframe(bool *isEmpty);

// isEmpty = true if you have get all FAP (internal buffer is empty). In this case, FAPs = neutral expression.

}



Extension of bookmarks

Initially, dialogue Manager uses FAP high level expression (FAP n°2) for passing expressions in bookmarks:

<FAP 2 fields T C> where

fields = expression_select1, expression_intensity1, expression_select2 and expression_intensity2.

Expression_select is defined according ISO/ECC 14496-2 annex C, Table C-3 (expression select).
But in the future, it may be necessary to support more variety of high level expressions. In this case the Dialogue Manager will use expression_select number greater than 6, and Phoneme/Bookmark to FAP converter will convert it to low level FAP.

In any case, Phoneme/Bookmark to FAP converter will send an FAP stream with high and/or low level (fully MPEG-4 compatible).


The FBA Encoder (WIN)


The FBA encoder is integrated within the TTS module. As the TTS module produces speech a corresponding MPEG-4 FBA bitstream containing FBA actions (visemes, expressions, gestures) is encoded. The FBA Encoder will take form of a library and will be replaceable.

API description

// Object Functionality

#define FBA_NONE (0)

#define FBA_FACE (1)

#define FBA_BODY (2)

#define FBA_ALL (FBA_FACE|FBA_BODY)


// Frame types

#define FBA_FRAME_NONE (0)

#define FBA_FRAME_KEYFRAME (1)

#define FBA_FRAME_INTRA (2)

#define FBA_FRAME_PREDICTED (3)


// Parameters

#define FBA_FRAME_RATE (0)

#define FBA_QUANTIZATION_FACTOR (1)

#define FBA_COMPRESSION_RATIO (2)

#define FBA_FRAME_TYPE (3)

#define FBA_NUMBER_OF_FRAMES (4)

#define FBA_BITRATE (5)

#define FBA_QUALITY_SPATIAL (6)

#define FBA_QUALITY_TEMPORAL (7)

#define FBA_KEYFRAME_DISTANCE (8)


Class FBAEncoder

{

public:

// Constructor. Sets the Object Functionality.

FBAEncoder(int objectFunctionality);


// Set and get the encoder parameters

int setParam( int fbaObjectMask, int paramId, float value);

int getParam( int fbaObjectMask, int paramId, float value);


// encode a frame of FAPs and/or BAPs

// resulting bitstream in outBuffer

// return size in bytes of returned bitstream,

// or -1 on failure; size is also returned in the size parameter

int encodeFrame( FAPs *fap, BAPs *bap,

unsigned char *outBuffer, unsigned char *size);






Appendix B






Scheme Festival word expansion list source







;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;;

;; CURTIN UNIVERSITY OF TECHNOLOGY FAML - Honours Project

; Quoc Hung Huynh

;; (09525748)

;; Supervisor : Andrew Marriott

;; Copyright (c) 2000

;; All Rights Reserved.

;;

;; 5th October 2000

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;




;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

;;

;; The scheme module receives a token tree parameter and returns a, ;;

;; simplified list of the contents of the token tree. The returned ;;

;; list contains the word in the utterance as well as their equivalent ;;

;; expanded word versions. for example the url www.hotmail.com is ;;

;; expanded to form the words (w) (w) (w) (dot) (hotmail) (dot) (com) ;;

;; this information is essential to maintain the synchronisation of ;;

;; the facial markup with the synthesised speech.

;;

;; Each word in the utterance has a time calculated from the phoneme ;;

;; durations contained within the word itself. Therefore to maintain ;;

;; synchronisation between the location of the tag and the input ;;

;; utterance each expanded word needs to have its associated timings ;;

;; ;;

;; This module produced a list that can be parsed to determine which ;;

;; words have been expanded and their equivalent expansions. ;;

;; ;;

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;


(require_module 'parser)


;; GST_simple module returns eash atom contained within the list


(define (GST_simple token_tree)


(cond

((not token_tree) nil)

((car)

;; terminal node

(list

(car)

(car (car token_tree))))

(t

(cons

(car (car token_tree))

(mapcar GST_simple (cdr token_tree))))))



;; Calling module, mapcar function processes each element in the list.

;; Elements can be either atoms or lists


(define (GST_simplify_tree trees)

(mapcar GST_simple trees))






;; Print to file

(define (GST_print simple_tree)

(format file "%l\n" simple_tree)

)


;; Mainline

(define (GST_process_scheme utt1)


(set! tree (utt.relation_tree utt1 "Token"))


(set! simple_tree (GST_simplify_tree tree))


(set! file (fopen "c:\/Users\/facial_animation\/faith\/server\/Gst\/Scheme\/GST_scheme.txt" "w" ))


(mapcar GST_print simple_tree)


(fclose file)

)

















































Appendix C







News presenter text markup












<?xml version="1.0"?>

<!DOCTYPE sml SYSTEM "./sml-v01.dtd">


<sml>


<p>


<neutral>


Detectives investigating the brutal <emph_GST 2 4 1000/> <r_roll 2 3 1000/> murder of Sarah Payne, have received <hr 2 4 800/> <nod 2 3 800/> <r_roll 2 3 800/> two hundred fresh calls from <hl 2 5 900/> <l_roll 2 6 900/> <emph_GST 2 7 900/> the public.


This comes after <2b 2 9 700/> an appeal to locate a lorry driver, who <hr 2 4 800/> <r_roll 2 5 800/> <nod 2 4 800/> might have seen Sarah's <hl 2 4 800/> <l_roll 2 5 800/> <nod 2 4 800/> killer.


</neutral>

</p>


</sml>


Appendix D







Sales Assistant text Markup











<?xml version="1.0"?>

<!DOCTYPE sml SYSTEM "./sml-v01.dtd">


<sml>


<p>

<neutral>


Hi, <smile 2 6 3000/> and <l_roll 2 4 800/><emph_GST 2 3 800/> thankyou for choosing Dominoes Pizzas online.


From the <lu 2 5 3000/> top menu bar you can choose all 7 of our <smile 2 5 4000/> famous pizzas, ranging from <smile 2 4 4000/> the <2b 2 9 700/>succulent seafood to the <smile 2 4 2000/> <r_roll 2 3 800/><emph_GST 2 3 800/> tantalizing supreme. Our pizzas come in <l_roll 2 3 1000/><emph_GST 2 3 1000/> 4 different sizes, small, medium, large and the new <surprise 2 3 2000/> extravaganza. Which you can select from the <lr 2 5 1900/><lu 2 5 1900/> top left of your screen.


On the <ll 2 5 4000/> right of the screen you can see the extra toppings to add to your pizza, just <l_roll 2 3 2000/><emph_GST 2 4 2000/> click on the topping and drag it onto the pizza.


Once your <smile 2 5 4000/> satisfied with your choice, just click on the order button located at the <ld 2 6 2500/> <ll 2 6 2500/> bottom right of your screen.


Its just that easy.


Once <smile 2 5 3000/> again, Thankyou for choosing dominoes pizzas.



</neutral>

</p>



</sml>









Appendix E







Storyteller text Markup











<?xml version="1.0"?>

<!DOCTYPE sml SYSTEM "./sml-v01.dtd">


<sml>


<p>

<neutral>


The enormous dragon, ambled slowly to his feet. He gave a <blink 2 10 500/> blink, then a <r_roll 2 8 500/> <right_wink 2 10 500/> wink and reared <r_roll 2 5 2000/><lu 2 7 2000/> his ugly head towards the sky.


At first the villagers were confused <confused 2 8 4000/> by what they saw. But then the people began to stare with a look of <surprise 2 5 2000/> surprise and then awe. This quickly turned to <fear 2 5 3500/> fear as the people realized what was actually about to happen.


With a cheeky grin<smile 2 6 3000/> the dragon, lept <lu 2 10 2000/> high into the air and then looking down <ld 2 6 4000/> onto the village. breathed a ball of fire from his ravenous mouth.


</neutral>

</p>





</sml>












Appendix F





Questionnaire used in experiment





Curtin University of technology

School of Computing


FAML Talking Head Questionnaire



Thank you for very much for taking your time to fill in this questionnaire. The Talking Head application that you are about to see is an ongoing joint project with the school of Computing and the Fifth Framework consortium based in Genoa Italy. The project hopes to create a realistic talking head that is able to truly mimic human-to-human interaction.


You are about to take part in a demonstration that will hopefully provide important feedback that will help to improve upon the current work. Any extended insight you are able to provide to the project is greatly appreciated


You do NOT have to take part in the questionnaire.


All details you provide will observe all protocols of confidentiality and anonymity.


Section 1 Background Details




1.1 Age ____


    1. Gender


1.3 Nationality : __________________


1.4 Which country have you lived the most? ___________________________


1.5 Is English your first language? :


1.6 Have you seen an animated talking head before? q Yes q No

If so, can you remember where? ___________________________________________________


YOU WILL NOW BE SHOWN A SHORT VIDEO

PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO, THANKYOU








Section 2 Realism




DEMONSTRATION 1


2.1 On a scale of 1 to 5, do you think the talking head is? (Please fill in the appropriate circle)


robot-like  ƒ life-like

not-believable  ƒ believable1

fake  ƒ realistic2

boring  ƒ interesting


1 Believable is used in the context of how believable do you think the talking head is portraying a real person.

2 Realistic is used in the context of how physically real you think the talking head is, in terms of a real person


2.2 Considering ONLY the visual3 actions of the animated head, how WELL do you think the animated head is COMMUNICATING to you.


poor  ƒ very good


3Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________



2.3 Considering ONLY the visual4 actions of the animated talking head, how EXPRESSIVE do you think it is? (Please fill in the appropriate circle)


not expressive  ƒ very expressive


4Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________








YOU WILL NOW BE SHOWN A SHORT VIDEO

PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO, THANKYOU






DEMONSTRATION 2


2.4 On a scale of 1 to 5, do you think the talking head is? (Please fill in the appropriate circle)


robot-like  ƒ life-like

not-believable  ƒ believable1

fake  ƒ realistic2

boring  ƒ interesting


1 Believable is used in the context of how believable do you think the talking head is portraying a real person.

2 Realistic is used in the context of how physically real you think the talking head is, in terms of a real person



2.5 Considering ONLY the visual3 actions of the animated head, how WELL do you think the animated head is COMMUNICATING to you.


poor  ƒ very good


3Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________


2.6 Considering ONLY the visual4 actions of the animated talking head, how EXPRESSIVE do you think it is? (Please fill in the appropriate circle)


not expressive  ƒ very expressive


4Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________



2.7 Which Talking head do you think was more believable in terms of portraying a "Story telling5" character?


5A Story telling character narrates a story or reads one out loud.

Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________





YOU WILL NOW BE SHOWN A SHORT VIDEO

PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO, THANKYOU

DEMONSTRATION 3


3.1 On a scale of 1 to 5, do you think the talking head is? (Please fill in the appropriate circle)


robot-like  ƒ life-like

not-believable  ƒ believable1

fake  ƒ realistic2

boring  ƒ interesting


1 Believable is used in the context of how believable do you think the talking head is portraying a real person.

2 Realistic is used in the context of how physically real you think the talking head is, in terms of a real person


3.2 Considering ONLY the visual3 actions of the animated head, how WELL do you think the animated head is COMMUNICATING to you.


poor  ƒ very good


3Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________


3.3 Considering ONLY the visual4 actions of the animated talking head, how EXPRESSIVE do you think it is? (Please fill in the appropriate circle)


not expressive  ƒ very expressive


4Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________


















YOU WILL NOW BE SHOWN A SHORT VIDEO

PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO, THANKYOU




DEMONSTRATION 4


3.4 On a scale of 1 to 5, do you think the talking head is? (Please fill in the appropriate circle)


robot-like  ƒ life-like

not-believable  ƒ believable1

fake  ƒ realistic2

boring  ƒ interesting


1 Believable is used in the context of how believable do you think the talking head is portraying a real person.

2 Realistic is used in the context of how physically real you think the talking head is, in terms of a real person


3.5 Considering ONLY the visual3 actions of the animated head, how WELL do you think the animated head is COMMUNICATING to you.


poor  ƒ very good


3Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________



3.6 Considering ONLY the visual4 actions of the animated talking head, how EXPRESSIVE do you think it is? (Please fill in the appropriate circle)


not expressive  ƒ very expressive


4Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________



3.7 Which Talking head do you think was more believable in terms of portraying a "News Presenter" character?


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________




YOU WILL NOW BE SHOWN A SHORT VIDEO

PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO, THANKYOU




DEMONSTRATION 5


3.8 On a scale of 1 to 5, do you think the talking head is? (Please fill in the appropriate circle)


robot-like  ƒ life-like

not-believable  ƒ believable1

fake  ƒ realistic2

boring  ƒ interesting


1 Believable is used in the context of how believable do you think the talking head is portraying a real person.

2 Realistic is used in the context of how physically real you think the talking head is, in terms of a real person


3.9 Considering ONLY the visual3 actions of the animated head, how WELL do you think the animated head is COMMUNICATING to you.


poor  ƒ very good


3Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________


3.10 Considering ONLY the visual4 actions of the animated talking head, how EXPRESSIVE do you think it is? (Please fill in the appropriate circle)


not expressive  ƒ very expressive


4Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________


















YOU WILL NOW BE SHOWN A SHORT VIDEO

PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO, THANKYOU






DEMONSTRATION 6


3.11 On a scale of 1 to 5, do you think the talking head is? (Please fill in the appropriate circle)


robot-like  ƒ life-like

not-believable  ƒ believable1

fake  ƒ realistic2

boring  ƒ interesting


1 Believable is used in the context of how believable do you think the talking head is portraying a real person.

2 Realistic is used in the context of how physically real you think the talking head is, in terms of a real person


3.12 Considering ONLY the visual3 actions of the animated head, how WELL do you think the animated head is COMMUNICATING to you.


poor  ƒ very good


3Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________



3.13 Considering ONLY the visual4 actions of the animated talking head, how EXPRESSIVE do you think it is? (Please fill in the appropriate circle)


not expressive  ƒ very expressive


4Visual actions only include the facial expressions, facial gestures and head movements


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________



3.14 Which Talking head do you think was more believable in terms of portraying a "Sales Assistant" character?


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________



YOU WILL NOW BE SHOWN A SHORT VIDEO

PLEASE DO NOT TURN THE PAGE UNTIL TOLD TO DO SO, THANKYOU


Section 4



DEMONSTRATION 7 & 8



4.1 Which talking head do you think portrayed the best "Story teller" character?


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

4.2 Which talking head do you think was more natural / human-like?


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________

4.3 Which talking head do you think was more expressive?


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________



4.4 Which talking head do you find more interesting?


Why? (Comments on your choice)

____________________________________________________________________________________

____________________________________________________________________________________

____________________________________________________________________________________


End of Questionnaire

THANKYOU VERY MUCH FOR YOUR HELP



Appendix G






Statistical Analysis of Storyteller, News presenter and Sales assistant character demonstrations.



Life-Like Story Teller


5 categories

AFTER is row variable

BEFORE is column variable

ordered categories

0 0 0 0 0

0 4 0 0 0

1 5 3 0 0

0 2 12 6 0

1 0 0 1 0

Total number of cases: 35

Tests of Individual Proportions

---------------------------------------------------------------------

Proportion

Frequency (Base Rate)

Level ---------------- ---------------- Chi-

(k) AFTER BEFORE AFTER BEFORE squared(a) p

---------------------------------------------------------------------

1 0 2 0.000 0.057 exact test 0.5000

2 4 11 0.114 0.314 exact test 0.0156

3 9 15 0.257 0.429 2.000 0.1573

4 20 7 0.571 0.200 11.267 0.0008*

5 2 0 0.057 0.000 exact test 0.5000

---------------------------------------------------------------------

(a) or exact test

* p < Bonferroni-adjusted significance criterion of 0.013.

Stuart-Maxwell Test of Overall Marginal Homogeneity

-------------------------------------------------------

Chi-squared = 19.557 df = 4 p = 0.0006

Marginal Distributions of Categories

for AFTER (**) and BEFORE (==)

0.571 + **

| **

| == **

| == **

| == == **

| == == **

| == ** == ** ==

| == ** == ** ==

| ** == ** == ** ==

| == ** == ** == ** == **

0 +----+-------+-------+-------+-------+----

1 2 3 4 5


Notes: x-axis is category number or level.

y-axis is proportion of cases.

***TESTS OF EQUAL THRESHOLDS***

Four-fold tables tested

0 0 2 33

4 0 9 22

13 0 15 7

33 0 2 0

Tests of Individual Thresholds

---------------------------------------------------------------------

Proportion

of cases

below

level k Threshold(a)

Level ---------------- ---------------- Chi-

(k) AFTER BEFORE AFTER BEFORE squared(b) p

---------------------------------------------------------------------

2 0.000 0.057 0.000 -1.579 exact test 0.5000

3 0.114 0.371 -1.204 -0.328 exact test 0.0039*

4 0.371 0.800 -0.328 0.842 15.000 0.0001*

5 0.943 1.000 1.579 0.000 exact test 0.5000

---------------------------------------------------------------------

(a) for probit model

(b) or exact test

* p < Bonferroni-adjusted significance criterion of 0.013.


McNemar Test of Overall Bias

or Directional Change

--------------------------------------------

Cases where AFTER level is higher: 22

Cases where BEFORE level is higher: 0


Chi-squared = 22.000 df = 1 p = 0.0000










Believable Story Teller

5 categories

AFTER is row variable

BEFORE is column variable

ordered categories

0 0 0 0 0

0 3 1 0 0

1 5 6 0 0

0 1 8 7 0

1 0 2 0 0

Total number of cases: 35

***TESTS OF MARGINAL HOMOGENEITY***

Four-fold tables tested

0 0 2 33

3 1 6 25

6 6 11 12

7 9 0 19

0 3 0 32

Tests of Individual Proportions

---------------------------------------------------------------------

Proportion

Frequency (Base Rate)

Level ---------------- ---------------- Chi-

(k) AFTER BEFORE AFTER BEFORE squared(a) p

---------------------------------------------------------------------

1 0 2 0.000 0.057 exact test 0.5000

2 4 9 0.114 0.257 exact test 0.1250

3 12 17 0.343 0.486 1.471 0.2253

4 16 7 0.457 0.200 exact test 0.0039*

5 3 0 0.086 0.000 exact test 0.2500

---------------------------------------------------------------------

(a) or exact test

* p < Bonferroni-adjusted significance criterion of 0.013.

Stuart-Maxwell Test of Overall Marginal Homogeneity

-------------------------------------------------------

Chi-squared = 14.923 df = 4 p = 0.0049







Marginal Distributions of Categories

for AFTER (**) and BEFORE (==)

0.486 + ==

| == **

| == **

| ** == **

| ** == **

| == ** == **

| == ** == ** ==

| == ** == ** ==

| ** == ** == ** == **

| == ** == ** == ** == **

0 +----+-------+-------+-------+-------+----

1 2 3 4 5


Notes: x-axis is category number or level.

y-axis is proportion of cases.

***TESTS OF EQUAL THRESHOLDS***

Four-fold tables tested

0 0 2 33

3 1 8 23

16 0 12 7