Working Draft 13 March 2001

This version:

http://www.interface.computing.edu.au/documents/VHML/2001/WD-VHML-20010313

Latest version:

http://www.interface.computing.edu.au/documents/VHML

Previous version:

http://www.interface.computing.edu.au/documents/VHML

Editors:

Andrew Marriott

Simon Beard

John Stallo

Quoc Huynh


Copyright ©2001 Curtin University of Technology, InterFace. All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply.


Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the Curtin InterFace Website.


This is the 9th March 2001 Working Draft of the "Virtual Human Markup Language Specification".


This working draft relies on several other standards - the various sub-languages of VHML use and extend these standards.

Abstract

This document describes a Virtual Human Markup Language. The language is designed to accommodate the various aspects of Human-Computer Interaction with regards to Facial Animation, Body Animation, Dialogue Manager interaction, Text to Speech production, Emotional Representation plus Hyper and Multi Media information. [Input here: am I missing any required sub-system?]


It will use / build on existing (de facto) standards such as those specified by the W3C Voice Browser Activity, and will describe new languages to accommodate functionality that is not catered for.


The language will be XML/XSL based and will consist of the following sub-systems:


The language will use XML Namespaces for inheritance of existing standards.


Although general in nature, the intent of this language is to facilitate the natural and realistic interaction of a Talking Head or Talking Human with a user via a Web page or application. One specific intended use can be found in the deliverables of the Interface project (http://www.ist-interface.org/).



Figure 1 The user->Dialogue Manager->user data flow



Table of Contents

Status of this Document 1

Abstract 2

Terminology and Design Concepts 6

Rendering Processes 6

Document Generation, Applications and Contexts 8

The Language Structure 10

Virtual Human Markup Language (VHML) 11

Root Element 11

vhml 11

Miscellaneous Elements 11

embed 11

Emotion Markup Language (EML) 12

Emotions 12

Emotion Default Attributes 12

Notes: 12

anger 13

joy == happy 13

neutral 13

sadness 14

fear 14

disgust 14

surprise 14

dazed 15

confused 15

bored 15

Other Virtual Human Emotional Responses 16

Notes: 16

agree 17

disagree 17

emphasis 18

smile 18

shrug 19

Emotional Markup Language Examples 19

Facial Animation Markup Language (FAML) 20

Emotion Default Attributes 20

Direction/Orientation 20

Notes 20

22

anger 22

joy == happy 22

neutral 22

sadness 22

fear 22

disgust 22

surprise 22

confused 22

bored 22

look_left 23

look_right 23

look_up 23

look_down 23

head_left 24

head_right 24

head_up 24

head_down 24

eyes_left 25

eyes_right 25

eyes_up 25

eyes_down 25

head_left_roll 26

head_right_roll 26

EyeBrows 27

Notes: 27

eyebrow_up 28

eyebrow_down 28

eyebrow_squeeze 28

Blinks/Winks 29

Notes 29

blink 29

double_blink 29

30

left_wink 30

right_wink 30

Hyper Text Markup Language (HTML) 31

Body Animation Markup Langauge (BAML) 33

34

anger 34

joy == happy 34

neutral 34

sadness 34

fear 34

disgust 34

surprise 34

confused 34

bored 34

Dialogue Manager Markup Language (DMML) 35

Dialogue Manager Response 35

List of DMML elements: 35

Recognised variable names 36

Speech Markup Language (SML) 38

Speech markup Language default Attributes 38

39

xml:lang 39

40

anger 40

joy == happy 40

neutral 40

sadness 40

fear 40

disgust 40

surprise 40

confused 40

bored 40

p == paragraph 41

s == sentence 41

42

say-as 42

phoneme 44

voice 45

emphasis 47

break 47

prosody 48

audio 49

mark 50

emphasise_syllable == emphasize_syllable 51

pause 52

pitch 52

Conformance 53

Conforming Virtual Human Markup Document Fragments 53

Conforming Stand-Alone Virtual Human Markup Language Documents 53

Conforming Virtual Human Markup Language Processors 53

The Rendering 55

References 56

Acknowledgements 57


Terminology and Design Concepts

The design and standardization process has adopted the approach of the Speech Synthesis Markup Requirements for Voice Markup Languages published December 23, 1999 by the W3C Voice Browser Working Group.


The following items were the key design criteria.

Rendering Processes

A rendering system that supports the Virtual Human Markup Language will be responsible for rendering a document as visual and spoken output and for using the information contained in the markup to render the document as intended by the author.


Document creation: A text document provided as input to the system may be produced automatically, by human authoring, or through a combination of these forms. The Virtual Human Markup Language defines the form of the document.


Document processing: The following are the nine major processing steps undertaken by a VHML system to convert marked-up text input into automatically generated output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control or direct the final rendered output of the Virtual Human.


  1. XML Parse: An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.

  2. Culling of un-needed VHML tags: For example, at this stage any tags which produce audiowhen the final rendering device/environment does not support audio may be removed. Similarly for other tags. It should be noted that since the timing synchronisation is based upon vocal production, the spoken text may need to be processed regardless of the output device's capabilities.

  3. Structure analysis: The structure of a document influences the way in which a document should be read. For example, there are common speaking and acting patterns associated with paragraphs and sentences.

- Markup support: Various elements defined in the VHML markup language explicitly indicate document structures that affect the visual and spoken output.

- Non-markup behavior: In documents and parts of documents where these elements are not used, the VHML system is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data. [How good could we make this?]

  1. Text normalization: All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the TTS system that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on.

- Markup support: The "say-as" element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked includes dates, times, numbers, acronyms, current amounts and more. The set covers many of the common constructs that require special treatment across a wide number of languages but is not and cannot be a complete set.

- Non-markup behavior: For text content that is not marked with the say-as element the TTS system is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different systems to render the same document differently.[What is the BAP equivalent of this text normalisation?]

  1. Text-to-phoneme conversion: Once the system has determined the set of words to be spoken it must convert those words to a string of phonemes. A phoneme is the basic unit of sound in a language. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g. most US English dialects have around 45 phonemes. In many languages this conversion is ambiguous since the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book).

    Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English TTS system will often have trouble determining how to speak some non-English-origin names; e.g. "Tlalpachicatl" which has a Mexican/Aztec origin.

- Markup support: The "phoneme" element allows a phonemic sequence to be provided for any word or word sequence. This provides the content creator with explicit control over pronunciations. The "say-as" element may also be used to indicate that text is a proper name that may allow a TTS system to apply special rules to determine a pronunciation.

- Non-markup behavior: In the absence of a "phoneme" element the TTS system must apply automated capabilities to determine pronunciations. This is typically achieved by looking up words in a pronunciation dictionary and applying rules to determine other pronunciations. Most TTS systems are expert at performing text-to-phoneme conversions so most words of most documents can be handled automatically.

  1. Prosody analysis: Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.

    - Markup support: The "emphasis" element, "break" element and "prosody" element may all be used by document creators to guide the TTS system is generating appropriate prosodic features in the speech output.

- Non-markup behavior: In the absence of these elements, TTS systems are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.

  1. Waveform production: The phonemes and prosodic information are used by the TTS system in the production of the audio waveform. There are many approaches to this processing step so there may be considerable platform-specific variation.

  1. Facial and BodyAnimation production: Timing information will be used to synchronise the spoken text with facial gestures and expressions as well aswith body movements and gestures.

  2. Rendering the multiple streams (Audio, Graphics, Hyper and Multi Media) onto the output device(s). XSL Transformation - here or in the earlier stage?

[Need info about the FAP and BAP production in here]

Document Generation, Applications and Contexts

There are many classes of document creator that will produce marked-up documents to be spoken by a VHML system. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the previous section. The following are some of the common cases.

It is important that any XML elements or tags that are part of VHML use existing tags specified in existing (de facto) or developing standards (for example such as HTML or SSML). This will aid in minimising learning curves for new developers as well as maximising opportunites for th emigration of legacy data.


The Language Structure


Figure 2 The VHML Language Structure



VHML uses the languages shown in Figure2 tofacilitate the direction of a Virtual human interacting with a user via a Web page or stand alone application. In response to a user enquiry, the Virtual human will have to react in a realistic and humane way using appropriate words, voice, facial and body gestures. For example, a Virtual Human that has to give some bad news to the user - "I'm sorry Dave, I can't find that file you want." - mayspeak in a sad way, with a sorry face and with a bowed body stance. In a similar way, a different message may be delivered with a happy voice, a smiley face and with a lively body.


The following sections detail the individual XML based languages which make this possible through VHML.

Virtual Human Markup Language (VHML)

Root Element

The Virtual human Markup Language is an XML application. The root element is vhml. See the section on Conformance.

<?xml version="1.0"?>

<vhml>

... the body ...

</vhml>


vhml

Description:

Root element that encapsulates all other vhml elements.

Attributes: none.

Properties: root node, can only occur once.

Example:

<vhml>

<p>

<happy>

The vhml element encapsulates all other elements

</happy>

</p>

</vhml>


Notes: Should we allow <viewset> and <view> a la <frame> and <frameset>? This would allow multiple rendered scenes plus a Virtual Human with an HTML page for hyper information.

Miscellaneous Elements

embed

Description:

Gives the ability to embed foreign file types within a VHML document such as sound files, MML files etc., and for them to be processed appropriately.

Attributes:

Name

Description

Values

type

Specifies the type of file that is being embedded. (Required)

audio - embedded file is an audio file.

mml - an mml file is embedded.

[What values should we have here?]

src

Gives path to audio file. (Required)

A character string.

Properties: empty.

Example:

<embed type="mml" src="songs/aaf.mml"/>


Emotion Markup Language (EML)

Emotions

The following elements will affect the emotion shown by the Virtual Human. These elements will affect the voice, face and body.


Emotion Default Attributes

Each element has at least 3 attributes associated with it:


Name

Description

Values

Default

intensity

This value ranges from 0 to-100 and represents a percentage value of the maximum intensity of that particular facial gesture, expression or emotion.

0 - 100

100


duration

The duration value represents the time span in seconds or milliseconds that the element expression, gesture or emotion will persist in the Virtual Human animation.

A numeric value representing time (conforms to Times attribute from CSS specification ).

Until closing element


mark

This attribute can be used to set an arbitrary mark at a given place in the text, so that, for example, an engine can report back to the calling application that it has reached the given location.

Character-string identifier for this tag.

No default - optional attribute


Notes:


EML emotion elements can be placed in sequence to produce a seamless flow from one emotion to the other. Emotion elements can also be blended together at the same instance to produce different expressions and emotions entirely, as desired.


[How would we do this? Contribution attributes which are combined to produce 100% emotion? No contribution value means 100% of that emotion?]


OTHER EMOTIONS?????


Should the TAG names be nouns (sadness, anger) or verbs (sad, angry)?

Should we also allow subjective durations - short, medium, long - similar to the pause element?

anger

Description:

Simulates the effect of anger on the rendering (i.e. generates a Virtual Human that looks and sounds angry).

Attributes: Default EML Attributes.

Properties: Can contain other non-emotion elements.

Example:

<anger>

I would not give you the time of day

</anger>



joy == happy

Description:

Simulates the effect of happiness on the rendering (i.e. generates a Virtual Human that looks and sounds joyful).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<joy>

I have some wonderful news for you.

</joy>


neutral

Description:

Gives a neutral intonation to the Virtual Human's appearance and sound..

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<neutral>

I can sometimes sound non-commital like this.

</neutral>



sadness

Description:

Simulates the effect of sadness on the rendering (i.e. generates a Virtual human that looks and sounds sad).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<sadness>

Honesty is hardly ever heard.

</sadness>


fear

Description:

Simulates the effect of fear on the rendering (i.e.generates a Virtual Human that looks and sounds afraid).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<fear>

I am afraid of flying.

</fear>


disgust

Description:

Simulates the effect of disgust on the rendering (i.e.

generates a Virtual Human that looks and sounds disgusted).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<disgust>

How could you eat Roquefort cheese!

</disgust>


surprise

Description:

Simulates the effect of surprise on the rendering (i.e.

generates a Virtual Human that looks and sounds surprised).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<surprise>

I did not expect to find that in my lasagne!

</surprise>



dazed

Description:

Simulates the effect of being dazed on the rendering (i.e.

generates a Virtual Human that looks and sounds dazed).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<dazed>

Did you get the number of that truck?

</dazed>


confused

Description:

Simulates the effect of confusion on the rendering (i.e.

generates a Virtual Human that looks and sounds confused).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<confused>

If this is Tuesday, then this must be Linköping.

</confused>


bored

Description:

Simulates the effect of boredom on the rendering (i.e.

generates a Virtual Human that looks and sounds bored).

Attributes: Default EML Attributes. .

Properties: Can contain other non-emotion elements.

Example:

<bored>

Writing specifications is real fun.

</bored>



Other Virtual Human Emotional Responses

The following elements will accommodate other well known human emotional reactions. These will affect the voice, face and body of the Virtual Human.


[Should these be EML?]


Notes:

1: The timing is such that the action is performed at the place where the element is (i.e.depends on what has been spoken/acted out before this element is met.) This must take into account Text Normalisation differences between what the text is and what is actually spoken.


A <smile intensity="50" duration="5000/>

little dog goes into

<head_left_roll intensity="40" duration="1200"/> <agree intensity="30" duration="1200"/>

a saloon in the Wild West, and

<head_right_roll intensity="60" duration="1000"/> <agree intensity="30" duration="1000"/>

<head_left intensity="40" duration="1000"/> beckons to the bartender.


2: These elements also have intensity and duration attributes as for the EML elements. The duration must be specified.



agree

Description:

The agree element animates a nod of the Virtual Human. The agree element animation is broken into two sections: the head raise and then the head lower.

Observations have shown that there is a raise of the head before the nod is initiated. The agree element mimics this and 10 percent of the duration for the agree element is allocated for the head raise, with an intensity of 10 percent of the authored intensity value; the other 90 percent is allocated to the head lower.

The agree element can typically be used to gesture "yes" or "agreement". Only the vertical angle of the head is altered during the element animation, the eye gaze is still focused forward.

[Body animation for this element?]

[Should % be an attribute?]

Attributes: Default EML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

That's certainly <agree duration="1000"/>right Olly.


disagree

Description:

The disagree element animates a shake of the head. The element animates two shakes, a single shake is considered to be a head movement from the left to the right.

The disagree element can be used as a facial gesture for "no" or "disagree".

The element only affects the horizontal displacement of the head and no other facial features are affected.

Animation involves moving first to the left, then right, repeated and then returning to the central plane.

[Body animation for this element?]

[Other attributes? - # of shakes, left or right first?]

Attributes: Default EML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

I <disagree duration="2000"/> will not have that smelly cheese on my spaghetti



emphasis

Description:

The emphasis element is very similar in animation to the agree element. The difference being the emphasis element incorporates a lowering of the eyebrow into the nod itself as described by Pelachaud and Prevost (1995). This serves to further emphasize or accentuate words in the spoken text.

The emphasis element similarly has raise and lower stages as found in the agree element animation. It is noted however that the eyebrow are lowered at the same rate as the nod and if a different intensity of eyebrow lowering is needed the emphasis element can be used in conjunction with the brow_down element to produce an emphasis animation with a greater lowering of the eyebrow or a more subtle one.

[Body animation for this element?]

Attributes: Default EML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

I <emphasis duration="500"/> will not buy this record, it is scratched.


smile

Description:

The smile element, as the name suggest animates the expression of a smile into the Talking Head animation.

The mouth is widened and the corners pulled back towards the ears. The larger the intensity value for the smile element, the greater the intensity of the smile. However a value too large, produces a rather "cheesy" looking grin and can look disconcerting or phony. This however can be used to the animator's advantage, if a mischievous grin or masking smile is required.

The smile element is generally used to start sentences and is used quite often when accentuating positive or cheerful words in the spoken text (Pelachaud and Prevost, 1995).

Attributes: Default EML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<smile duration="5000"/> Potatoes must be almost as good as chocolate to eat!



shrug

Description:

The shrug element animation mimics the facial and body expression "I don't know".

A facial shrug consists of the head tilting back, the corners of the mouth pulled downward and the inner eyebrow tilted upwards and squeezed together.

A body shrug consists of [INFO needed here please.]

Attributes: Default EML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<shrug duration="5000"/>I neither know nor care!


Emotional Markup Language Examples


<?xml version="1.0"?>

<!DOCTYPE vhml SYSTEM "./vhml-v01.dtd">


<vhml>


<p>

<angry>Don't tell me what to do</angry>


<happy>I have some wonderful news for you</happy>


<neutral>I am saying this in a neutral voice</neutral>


<sad>I can not come to your party tomorrow</sad>

</p>

</vml>


Facial Animation Markup Language (FAML)

Emotion Default Attributes

Each element has at least 3 attributes associated with it:


Name

Description

Values

Default

intensity

This value ranges from 0 to-100 and represents a percentage value of the maximum intensity of that particular facial gesture, expression or emotion.

0 - 100

100


duration

The duration value represents the time span in milliseconds that the element expression, gesture or emotion will persist in the Virtual Human animation.

A numeric value representing time in milliseconds.

Must be specified


mark

This attribute can be used to set an arbitrary mark at a given place in the text, so that, for example, an engine can report back to the calling application that it has reached the given location.

Character-string identifier for this tag.

No default - optional attribute


Direction/Orientation

The following elements affect the direction or orientation of the head and the eyes (directions are wrt Talking Head).


The animation of the head movement can be broken down into three main parts: pitch, yaw and roll.


The pitch affects the elevation and depression of the head in the vertical field. The yaw affects the rotational angle of the head in the horizontal field and roll affects the axial angle. The combination of these three factors allow full directional movement for the animation of the Talking Head.


Notes

1: There are 12 main elements that control and animate the direction and orientation of the Talking Head. [Should we have independent eye/head movement?]


2: It is noted that the eyes and head move at the same rate during the animation of the looking elements.


3: All combinations of the above directional elements allow the head to have full range of orientation. A combination of the <look_left/> and <look_up/> elements will enable the head to look to the top left in the animation sequence, whilst <look_right/> <look_down/> will enable the head to look to the bottom right.


4: The eye_xxx directional elements allow four independent directions for eye movement. This entails movement in the vertical and horizontal planes. As with head directional elements, the elements can be combined together to provide full range of eye gaze even those not humanly possible. It is however noted that the eyes cannot be animated independently of each other. [ Is this a problem???? We could use the which attribute of eyebrow_up]


anger

Description:

Inherited from EML.


joy == happy

Description:

Inherited from EML.

neutral

Description:

Inherited from EML.

sadness

Description:

Inherited from EML.

fear

Description:

Inherited from EML.

disgust

Description:

Inherited from EML.

surprise

Description:

Inherited from EML.

dazed

Description:

Inherited from EML.

confused

Description:

Inherited from EML.

bored

Description:

Inherited from EML.


look_left

Description:

Turns both the eyes and head to look left.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<look_left duration="1000"/>Cheese to the left of me!


look_right

Description: Turns both the eyes and head to look right.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<look_right duration="800"/>Cheese to the right of me!


look_up

Description:

Turns both the eyes and head to look up.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<look_up duration="5000"/>Dear God, is there no escaping this smelly cheese?


look_down

Description:

Turns both the eyes and head to look down.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<look_down duration="1000"/>Perhaps it is just my feet!


head_left

Description:

Only the head turns left, the eyes remain looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_left duration="2000" intensity="30"/>What, no potatoes?


head_right

Description:

Only the head turns right, the eyes remain looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_right duration="100"/>Where is the chocolate?


head_up

Description:

Only the head turns upward, the eyes remain looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_up intensity="100" duration="1000"/>You are an insolent swine!


head_down

Description:

Only the head turns downward, the eyes remain looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_down duration="2500"/>Are you happy now?




eyes_left

Description:

Only the eyes turn left, the head remains looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<eyes_left duration="1000"/>There is the door, please use it.


eyes_right

Description:

Only the eyes turn right, the head remains looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<eyes_right duration="1000"/>Stand still laddie!


eyes_up

Description:

Only the eyes turn upward, the head remains looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<eyes_up intensity="75" duration="1000"/>Not that turnip!


eyes_down

Description:

Only the eyes turn downward, the head remains looking forward.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<eyes_down duration="1000"/>Sorry seems to be the hardest word.




head_left_roll

Description:

The roll element animates the roll of the Talking Head in the axial plane. Roll, although subtle in normal movement, is essential for realism.

This element allows the author to script roll movement in the Talking Head, typically in conjunction with other elements, such as nodding and head movements, to add further realism to the Talking Head.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_left_roll duration="1000"/>Way over yonder.


head_right_roll

Description:

The roll element animates the roll of the Talking Head in the axial plane. Roll, although subtle in normal movement, is essential for realism.

This element allows the author to script roll movement in the Talking Head, typically in conjunction with other elements, such as nodding and head movements, to add further realism to the Talking Head.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<head_right_roll duration="800"/>What a strange sight!


EyeBrows


Notes:

1: The eyebrow movement element enables the author to script certain eyebrow movements to accentuate words or phrases. MPEG-4 separates the eyebrow into 3 regions, inner, middle and outer. The eyebrow elements affect all three regions of the eyebrow to animate movement.


[individual sections to be moved independently???]

[Should we mention MPEG-4?]


eyebrow_up

Description :

vertical eyebrow movement upwards.

Attributes: Default FAML Attributes.

duration must have a value.


Name

Description

Values

Default

which

which eyebrow to move

both

right

left

both



Properties: none (Atomic element).

Example:

<eyebrow_up which="left" duration="1000"/> Fascinating Captain.


eyebrow_down

Description:

vertical eyebrow movement downwards.

Attributes: Default FAML Attributes.

duration must have a value.


Name

Description

Values

Default

which

which eyebrow to move

both

right

left

both



Properties: none (Atomic element).

Example:

<eyebrow_down duration="1000"/>I am not happy with you!


eyebrow_squeeze

Description:

Squeezing of the eyebrow together.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

<eyebrow_squeeze duration="1000"/>Oooh, that's difficult.



Blinks/Winks


Notes


blink

Description:

The blink element animates a blink of both eyes in the Talking Head animation.

The blink element only affects the upper and lower eyelid facial features of the head. By alternating the intensity value, the amount of eye closure is affected in the animation. An intensity value of 50 denotes 50 percent of the max amplitude for the blinking element, and as such the animation would only reflect half blinking where only half of the eyeball is covered.

Attributes: Default FAML Attributes.

duration must have a value.

[Attributes for left/right start time?]

Properties: none (Atomic element).

Example:

He gave a <blink intensity="10" duration="500"/> blink, then a <right_wink duration="500"/> wink and laughed.


double_blink

Description:

Not all blinks in humans are singular. Observation has shown that double blinking is quite common and can precede changes in emotion or denote sympathetic output.

Attributes: Default FAML Attributes.

duration must have a value.

[Attributes for left/right start time?]

Properties: none (Atomic element).

Example:

<double_blink duration="20"/>What a surprise!!



left_wink

Description:

Animates a wink of the left eye. The wink is not just the blinking of one eye, but the head pitch, roll and yaw is affected as well as the outer eyebrow and cheek. The combination of these animated features add to the realism of the wink itself.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

Nudge, nudge, <left_wink duration="500"/> wink,

<left_wink duration="2000"/>wink.


right_wink

Description:

Animates a wink of the right eye. the wink is not just the blinking of one eye, but the head pitch, roll and yaw is affected as well as the outer eyebrow and cheek. The combination of these animated features add to the realism of the wink itself.

Attributes: Default FAML Attributes.

duration must have a value.

Properties: none (Atomic element).

Example:

Nudge, nudge, <left_wink duration="500"/> wink,

<right_wink duration="2000"/>wink.


Hyper Text Markup Language (HTML)


[Should we translate HTML into the ACSS as shown or only allow a minimum subset of well formed HTML?]


H1, H2, H3,

H4, H5, H6 { voice-family: paul, male; stress: 20; richness: 90 }

H1 { pitch: x-low; pitch-range: 90 }

H2 { pitch: x-low; pitch-range: 80 }

H3 { pitch: low; pitch-range: 70 }

H4 { pitch: medium; pitch-range: 60 }

H5 { pitch: medium; pitch-range: 50 }

H6 { pitch: medium; pitch-range: 40 }

LI, DT, DD { pitch: medium; richness: 60 }

DT { stress: 80 }

PRE, CODE, TT { pitch: medium; pitch-range: 0; stress: 0; richness: 80 }

EM { pitch: medium; pitch-range: 60; stress: 60; richness: 50 }

STRONG { pitch: medium; pitch-range: 60; stress: 90; richness: 90 }

DFN { pitch: high; pitch-range: 60; stress: 60 }

S, STRIKE { richness: 0 }

I { pitch: medium; pitch-range: 60; stress: 60; richness: 50 }

B { pitch: medium; pitch-range: 60; stress: 90; richness: 90 }

U { richness: 0 }

A:link { voice-family: harry, male }

A:visited { voice-family: betty, female }

A:active { voice-family: betty, female; pitch-range: 80; pitch: x-high }


Body Animation Markup Langauge (BAML)


[Input here please, what extra markup is needed?]

1: Movement

2: Stance

3: Uses EML

4: Gestures


anger

Description:

Inherited from EML.


joy == happy

Description:

Inherited from EML.

neutral

Description:

Inherited from EML.

sadness

Description:

Inherited from EML.

fear

Description:

Inherited from EML.

disgust

Description:

Inherited from EML.

surprise

Description:

Inherited from EML.

dazed

Description:

Inherited from EML.

confused

Description:

Inherited from EML.

bored

Description:

Inherited from EML.


Dialogue Manager Markup Language (DMML)

Dialogue Manager Response

This language covers the Dialogue Manager's response only, not the pattern matching or the overall Knowledge base format.


[Since this work has already begun we need to talk about a preferred subset of AIML that can be used for the DMML.]


Therefore, the AIML tags,

<alice></alice> root element of Alice

<category></category> categorization of an Alice topic.

<pattern></pattern> the user input pattern.

<template>XXXX<template> the marking of the DM's response


are not part of DMML.


The XXXX in the above is covered by DMML.. For example, in the Alice fragment:

<template>

My name is <getvar name="botname"/>.

What is your name?

</template>

the DMML would handle the plain text "My name is ", the XML element "<getvar name="botname"/>" and the trailing text ". What is your name?".


List of DMML elements:

<star/> indicates the input text fragment matching the pattern '*' or '_'.


<that></that> If previous bot reply matches the THAT this event is fired.

<that/> = <that><star/></that>

<justbeforethat> </justbeforethat>

<justthat> </justthat>


<person2> X </person2> change X from 1st to 2nd person

<person2/> = <person2><star/></person2>

<person> X </person> exchange 1st and 3rd person

<person/> = <person><star/></person>

<srai> X </srai> calls the pattern matches recursively on X.

<sr/> =<srai><star/></srai>


<random> <li>X1</li><li>X2</li> </random> Say one of X1 or X2 randomly

<system>X</system> tag to run the shell command X

<think> X </think> tag pair is to evaluate the AIML expression X, but "nullify" or hide the result from the client reply.


<gossip> X </gossip> Save X as gossip.


<getvar name = "Name Of Variable" default="Default if no variable found"/>

and

<setvar name = "Name Of Variable"> Set it to this </setvar>


Recognised variable names

The recognised variable names are:


preferred legacy equivalent name deprecated Atomic tag


DMbirthplace botbirthplace <birthplace/>

DMbirthday botbirthday <birthday/>

DMmaster botmaster <botmaster/>

DMboyfriend botboyfriend <boyfriend/>

DMband botband <favorite_band/>

DMbook botbook <favorite_book/>

DMcolor botcolor <favorite_color/>

DMfood botfood <favorite_food/>

DMmovie botmovie <favorite_movie/>

DMsong botsong <favorite_song/>

DMfun botfun <for_fun/>

DMfriends botfriends <friends/>

DMgender botgender <gender/>

DMgirlfriend botgirlfriend <girlfriend/>

DMmusic botmusic <kind_music/>

DMlooks botlooks <look_like/>

DMname botname <name/>

DMsize botsize <getsize/>


question <question/>


name <getname/>

topic <gettopic/>


age <get_age/>

gender <get_gender/>

has <get_has/>

he <get_he/>

ip <get_ip/>

it <get_it/>

location <get_location/>

she <get_she/>

they <get_they/>

we <get_we/>


dialogueManagerName

dialogueManagerwhoami


dialogueManagerGender

dialogueManagerHisHer

dialogueManagerHimHer

dialogueManagerHeShe

dialogueManagerMaster

dialogueManagerBirthPlace

dialogueManagerBirthDay

dialogueManagerAge

dialogueManagerDescription


dialogueManagerFavouriteColour

dialogueManagerFavouriteSport

dialogueManagerFavouriteFood

dialogueManagerFavouritePainter

dialogueManagerFavouriteArtist

dialogueManagerFavouriteBook

dialogueManagerFavouriteMovie

dialogueManagerFavouriteMusic

dialogueManagerFavouriteSong

dialogueManagerFavouriteAlbum


dialogueManagerPurpose

dialogueManagerHomeURL

Speech Markup Language (SML)

The following list is a description of each of SML's elements. As with any XML element, all SML elements are case sensitive; therefore, all SML elements must appear in lower case, otherwise they will be ignored.


Speech markup Language default Attributes


Name

Description

Values

Default

mark

This attribute can be used to set an arbitrary mark at a given place in the text, so that, for example, an engine can report back to the calling application that it has reached the given location.

Character-string identifier for this tag.

No default - optional attribute



xml:lang

Description:

Following the XML convention, languages are indicated by an xml:lang attribute on the enclosing element with the value following RFC 1766 to define language codes. Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.

Example:

<vhml xml:lang="en-US">

<paragraph>I don't speak Japanese.</paragraph>

<paragraph xml:lang="ja">

Nihongo-ga wakarimasen.

</paragraph>

</vhml>

Notes:

1: The speech output platform determines behavior in the case that a document requires speech output in a language not supported by the speech output platform. This is currently only one of two allowed exceptions to the conformance criteria.


2: There may be variation across conformant platforms in the implementation of xml:lang for different markup elements. A document author should beware that intra-sentential language changes may not be supported on all platforms.


3: A language change often necessitates a change in the voice. Where the platform does not have the same voice in both the enclosing and enclosed languages it should select a new voice with the inherited voice attributes. Any change in voice will reset the prosodic attributes to the default values for the new voice of the enclosed text. Where the xml:lang value is the same as the inherited value there is no need for any changes in the voice or prosody.


4: All elements should process their contents specific to the enclosing language. For instance, the phoneme, emphasis and break element should each be rendered in a manner that is appropriate to the current language.


5: Unsupported languages on a conforming platform could be handled by specifying nothing and relying on platform behavior, issuing an event to the host environment, or by providing substitute text in the Markup Language.


[Should this be for all markups? Body Language as well?]


anger

Description:

Inherited from EML.


joy == happy

Description:

Inherited from EML.

neutral

Description:

Inherited from EML.

sadness

Description:

Inherited from EML.

fear

Description:

Inherited from EML.

disgust

Description:

Inherited from EML.

surprise

Description:

Inherited from EML.

dazed

Description:

Inherited from EML.

confused

Description:

Inherited from EML.

bored

Description:

Inherited from EML.


p == paragraph

Description:

Element used to divide text into paragraphs. Can only occur directly within a vhml element. The p element wraps emotion elements.

Attributes: none.

Properties: Can contain all other elements, except itself and vhml.

Example:

<p>

<sad>Today it's been raining all day,</sad>

<happy>

But they're calling for sunny skies tomorrow.

</happy>

</p>


Notes:

1: For brevity, the markup supports <p> as an exact equivalent of <paragraph>. (Note: XML requires that the opening and closing elements be identical so <p> text </paragraph> is not legal.).


2: The use of paragraph elements is optional. Where text occurs without an enclosing paragraph element the speech output system should attempt to determine the structure using language-specific knowledge of the format of plain text.


s == sentence

Description:

Element used to divide text into sentences. Can only occur directly within a vhml element.

Attributes: none.

Properties: Can contain all other elements, except itself and vhml.

Example:

<p>

<sentence>Today it's been raining ,</sentence>

<happy>

But they're calling for sunny skies tomorrow.

</happy>

</p>


Notes:

1: For brevity, the markup also supports <s> as exact equivalent of <sentence>. (Note: XML requires that the opening and closing elements be identical so <s> text </sentence> is not legal.). Also note that <s> means "strike-out" in HTML 4.0 and earlier, and in XHTML-1.0-Transitional but not in XHTML-1.0-Strict.


2: The use of the sentence element is optional. Where text occurs without an enclosing sentence element the speech output system should attempt to determine the structure using language-specific knowledge of the format of plain text.



say-as

Description:

The say-as element indicates the type of text construct contained within the element. This information is used to help specify the pronunciation of the contained text. Defining a comprehensive set of text format types is difficult because of the variety of languages that must be considered and because of the innate flexibility of written languages.

Attributes:

The say-as element has been specified with a reasonable set of format types. Text substitution may be utilized for unsupported constructs.

The type attribute is a required attribute that indicates the contained text construct. The format is a text type optionally followed by a colon and a format. The base set of type values, divided according to broad functionality, is as follows:


Pronunciation Types


<say-as type="acronym"> USA </say-as>

<!-- U. S. A. -->

Numerical Types


Rocky <say-as type="number"> XIII </say-as>

<!-- Rocky thirteen -->

Pope John the <say-as type="number:ordinal"> VI </say-as>

<!-- Pope John the sixth -->

Deliver to <say-as type="number:digits"> 123 </say-as> Brookwood.

<!-- Deliver to one two three Brookwood-->


Time, Date and Measure Types

"dmy", "mdy", "ymd" (day, month , year), (month, day, year), (year, month, day)

"ym", "my", "md" (year, month), (month, year), (month, day)

"y", "m", "d" (year), (month), (day).

"hms", "hm", "h" (hours, minutes, seconds), (hours, minutes), (hours).

"hms", "hm", "ms", "h", "m", "s" (hours, minutes, seconds), (hours, minutes), (minutes, seconds), (hours), (minutes), (seconds).


<say-as type="date:ymd"> 2000/1/20 </say-as>

<!-- January 20th two thousand -->

Proposals are due in <say-as type="date:my"> 5/2001 </say-as>

<!-- Proposals are due in May two thousand and one -->

The total is <say-as type="currency"> $20.45</say-as>

<!-- The total is twenty dollars and forty-five cents -->


When multi-field quantities are specified ("dmy", "my", etc.), it is assumed that the fields are separated by single, non-alphanumeric character.

Address, Name, Net Types


<say-as type="net:email"> road.runner@acme.com </say-as>



<say-as sub="World Wide Web Consortium"> W3C

</say-as>

<!-- World Wide Web Consortium -->


Notes:

1: The conversion of the various types of text and text markup to spoken forms is language and platform-dependent. For example, <say-as type="date:ymd"> 2000/1/20 </say-as> may be read as "January twentieth two thousand" or as "the twentieth of January two thousand" and so on. The markup examples above are provided for usage illustration purposes only.


2: It is assumed that pronunciations generated by the use of explicit text markup always take precedence over pronunciations produced by a lexicon.


phoneme

Description:

The phoneme element provides a phonetic pronunciation for the contained text. The phoneme element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.

Attributes:

The alphabet attribute is an optional attribute that specifies the phonetic alphabet.

The ph attribute is a required attribute that specifies the phoneme string.


Example:

<phoneme alphabet="ipa" ph="t&#x252;m&#x251;to&#x28A;"> tomato </phoneme>

<!-- This is an example of IPA using character entities -->

Notes:

1: Characters composing many of the IPA phonemes are known to display improperly on most platforms. Additional IPA limitations include the fact that IPA is difficult to understand even when using ASCII equivalents, IPA is missing symbols required for many of the world's languages, and IPA editors and fonts containing IPA characters are not widely available.


2: Entity definitions may be used for repeated pronunciations. For example:


<!ENTITY uk_tomato "t&#x252;m&#x251;to&#x28A;">

... you say <phoneme ph="&uk_tomato;"> tomato </phoneme>

I say...


3: In addition to an exhaustive set of vowel and consonant symbols, IPA supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more.


voice

Description:

The voice element is a production element that requests a change in speaking voice.

Attributes:


Examples:

<voice gender="female" category="child">

Mary had a little lamb,

</voice>

<!-- now request a different female child's voice -->

<voice gender="female" category="child" variant="2"> It's fleece was white as snow.

</voice>

<!-- platform-specific voice selection -->

<voice name="Mike">

I want to be like Mike.

</voice>

Notes:

1: When there is not a voice available that exactly matches the attributes specified in the document, the voice selection algorithm may be platform-specific.


2: Voice attributes are inherited down the tree including to within elements that change the language.


<voice gender="female">

Any female voice here.

<voice category="child">

A female child voice here.

<paragraph xml:lang="ja">

<!-- A female child voice in Japanese. -->

</paragraph>

</voice>

</voice>


3: A change in voice resets the prosodic parameters since different voices have different natural pitch and speaking rates. Volume is the only exception. It may be possible to preserve prosodic parameters across a voice change by employing a style sheet. Characteristics specified as "+" or "-" voice attributes with respect to absolute voice attributes would not be preserved.


4: The xml:lang attribute may be used specially to request usage of a voice with a specific dialect or other variant of the enclosing language.


<voice xml:lang="en-cockney">Try a Cockney voice

(London area).</voice>

<voice xml:lang="en-brooklyn">Try one New York

accent.</voice>



emphasis

Description:

The emphasis element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The synthesizer determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices.

See also emphasise_syllable

Attributes:

Examples:

That is a <emphasis> big </emphasis> car!

That is a <emphasis level="strong"> huge </emphasis>

bank account!


break

Description:

The break element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. If the element is not defined, the speech synthesizer is expected to automatically determine a break based on the linguistic context. In practice, the break element is most often used to override the typical automatic behavior of a speech synthesizer.

See also pause element.

Attributes:


Examples:


Take a deep breath <break/> then continue.


Press 1 or wait for the tone. <break time="3s"/>

I didn't hear you!

Notes:

1: Using the size attribute is generally preferable to the time attribute within normal speech. This is because the speech synthesizer will modify the properties of the break according to the speaking rate, voice and possibly other factors. As an example, a fixed 250ms pause (placed with the time attribute) sounds much longer in fast speech than in slow speech.


prosody

Description:

The prosody element permits control of the pitch, speaking rate and volume of the speech output.

See also pitch element.

Attributes:


Relative values

Relative changes for any of the attributes above are specified as floating-point values: "+10", "-5.5", "+15.2%", "-8.0%". For the pitch and range attributes, relative changes in semitones are permitted: "+5st", "-2st". Since speech synthesizers are not able to apply arbitrary prosodic values, conforming speech synthesis processors may set platform-specific limits on the values. This is the second of only two exceptions allowed in the conformance criteria for a VHML processor.


The price of XYZ is <prosody rate="-10%">

<say-as type="currency">$45</say-as></prosody>

Pitch contour

The pitch contour is defined as a set of targets at specified intervals in the speech output. The algorithm for interpolating between the targets is platform-specific. In each pair of the form (interval,target), the first value is a percentage of the period of the contained text and the second value is the value of the pitch attribute (absolute, relative, relative semitone, or descriptive values are all permitted). Interval values outside 0% to 100% are ignored. If a value is not defined for 0% or 100% then the nearest pitch target is copied.


<prosody contour="(0%,+20)(10%,+30%)(40%,+10)">

good morning

</prosody>


Notes:

1: The descriptive values ("high", "medium" etc.) may be specific to the platform, to user preferences or to the current language and voice. As such, it is generally preferable to use the descriptive values or the relative changes over absolute values.


2: The default value of all prosodic attributes is no change. For example, omitting the rate attribute means that the rate is the same within the element as outside.


3: The duration attribute takes precedence over the rate attribute. The contour attribute takes precedence over the pitch and range attributes.


4: All prosodic attribute values are indicative: if a speech synthesizer is unable to accurately render a document as specified it will make a best effort (e.g. trying to set the pitch to 1Mhz, or the speaking rate to 1,000,000 words per minute.)


audio

Description:

The audio element supports the insertion of recorded audio files and the insertion of other audio formats in conjunction with synthesized speech output. The audio element may be empty. If the audio element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The contents may also be used when rendering the document to non-audible output and for accessibility.

Attributes:

The required attribute is src, which is the URI of a document with an appropriate mime-type.

Examples:

<!-- Empty element -->

Please say your name after the tone. <audio src="beep.wav"/>

<!-- Container element with alternative text -->

<audio src="prompt.au">What city do you want to fly from?</audio>

Notes:

1: The audio element is not intended to be a complete mechanism for synchronizing synthetic speech output with other audio output or other output media (video etc.). Instead the audio element is intended to support the common case of embedding audio files in voice output.


2: The alternative text may contain markup. The alternative text may be used when the audio file is not available, when rendering the document as non-audio output, or when the speech synthesizer does not support inclusion of audio files.


mark

Description:

A mark element is an empty element that places a marker into the output stream for asynchronous notification. When audio output of the TTS document reaches the mark, the speech synthesizer issues an event that includes the required name attribute of the element. The platform defines the destination of the event. The mark element does not affect the speech output process.

Attributes:

The required attribute is name, which is a character string.

Examples:

Go from <mark name="here"/> here, to <mark name="there"/> there!

Notes:

1: When supported by the implementation, requests can be made to pause and resume at document locations specified by the mark values.


2: The mark name is not required to be unique within a document.


emphasise_syllable == emphasize_syllable

Description:

Emphasizes a syllable within a word.

Attributes:

Name

Description

Values

target

Specifies which phoneme in contained text will be the target phoneme. If target is not specified, default target will be the first phoneme found within the contained text.

A character string representing a phoneme symbol. Uses the MPRA phoneme set.

level

The strength of the emphasis. (Default level is weak).

weakest, weak, moderate, strong.

affect

Specifies if the element is to affect the contained text's phoneme pitch values, or duration values, or both. (Default is pitch only).

p - affect pitch only.

d - affect duration only.

b - affect both pitch and duration.


Properties: Cannot contain other elements.

Example:

I have told you <emph affect="b" level="moderate">so</emph> many times.



pause

Description:

Inserts a pause in the utterance.

Attributes:

Name

Description

Values

length

Specified the length of the utterance using descriptive value.

short, medium, long.

msec

Specifies the length of the utterance in seconds or milliseconds.

A positive number.

smooth

Specifies if the last phonemes before this pause need to be lengthened slightly.

yes, no (default = yes)


Properties: empty.

Example:

I'll take a deep breath <pause length="long"/> and try it again.


pitch

Description:

Element that changes pitch properties of contained text.

Attributes:

Name

Description

Values

middle

Increases/decreases pitch average of contained text by N%

(+/-)N%, highest, high, medium, low, lowest.

range

Increases/decreases pitch range of contained text by N%.

(+/-)N%


Properties: Can contain other non-emotion elements.

Example:

'Not I', <pitch middle="-20%">said the dog</pitch>



Conformance

This section is Normative.

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this conformance section are to be interpreted as described in RFC 2119


Conforming Virtual Human Markup Document Fragments

A Virtual Human markup document fragment is a Conforming XML Document Fragment if it adheres to the specification described in this document including the DTD (see Document Type Definition) and also:

The Virtual Human Markup Language or these conformance criteria provide no designated size limits on any aspect of Virtual Human markup documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.


Conforming Stand-Alone Virtual Human Markup Language Documents

A file is a Conforming Stand-Alone Virtual Human Markup Language Document if:

Conforming Virtual Human Markup Language Processors

A Virtual Human Markup Language processor is a program that can parse and process Virtual Human Markup Language documents.


In a Conforming Virtual Human Markup Language Processor, the XML parser must be able to parse and process all XML constructs defined within XML 1.0 and XML Namespaces.


A Conforming Virtual Human Markup Language Processor must correctly understand and apply the command logic defined for each markup element as described by this document. Exceptions to this requirement are allowed when an xml:lang attribute is utilized to specify a language not present on a given platform, and when a non-enumerated attribute value is specified that is out-of-range for the platform. The response of the Conforming Virtual Human Markup Language Processor in both cases would be platform-dependent.


A Conforming Virtual Human Markup Language Processor should inform its hosting environment if it encounters an element, element attribute, or syntactic combination of elements or attributes that it is unable to support. A Conforming Virtual Human Markup Language Processor should also inform its hosting environment if it encounters an illegal Virtual Human document or unknown XML entity reference.


The Rendering

FAP / BAP / TTS Renderer










Rendering Web Page Direct - XML


Rendering Web Page via engines







References

Normative.

Java Speech API Markup Language

http://java.sun.com/products/java-media/speech/forDevelopers/JSML/index.html JSML is an XML specification for controlling text-to-speech engines. Implementations are available from IBM, Lernout & Hauspie and in the Festival speech synthesis platform and in other implementations of the Java Speech API.

S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, Harvard University, March 1997

Informative.

SABLE

http://www.research.att.com/~rws/Sable.v1_0.htm SABLE is a markup language for controlling text to speech engines. It has evolved out of work on combining three existing text to speech languages: SSML, STML and JSML. Implementations are available for the Bell Labs synthesizer and in the Festvial speech synthesizer. The following are two of the papers written about SABLE and its applications:

Spoken Text Markup Language

(http://www.cstr.ed.ac.uk/publications/1997/Sproat_1997_a.ps) STML is an SGML language for controlling text to speech engines developed jointly by Bell Laboratories and by the Centre for Speech Technology Research, Edinburgh University.

Microsoft Speech API Control Codes

(http://www.microsoft.com/iit/) SAPI defines a set of inline control codes for manipulating speech output by SAPI speech synthesizers.

VoiceXML Prompts

(http://www.voicexml.com/) The Voice XML specification for dialog systems development includes a set of prompt elements for generating speech synthesis and other audio output that are very similar to elements of JSML and SABLE.



Pelachaud, C. and Prevost, S. (1995) Talking heads: Physical, linguistic and cognitive issues in facial animation. Course Notes for Computer Graphics International '95.

Acknowledgements

This document was ripped off from various sources as a Working Draft.


3