Text-To-Speech (TTS)
Aculab Cloud supports Amazon Polly and Cepstral Text To Speech (TTS) engines.
- Selecting a voice in the REST API
- Selecting a voice in the UAS API
- Using a Polly voice
- Using a Cepstral voice
- Reserved characters
- Text length
- Common SSML tags
- Charging
Selecting a voice in the REST API
In the REST API Play action, the text_to_say property supports Speech Synthesis Markup Language (SSML) allowing
you to change the way your text is spoken. However, this cannot be used to select the voice used by TTS
to say your text. This defaults to the voice configured in your service. You can choose a different voice
by setting tts_voice to a Selector from the voice tables below.
For example, to set English US Female Polly Kimberly use the following setting for tts_voice:
"tts_voice" : "English US Female Polly Kimberly"
Selecting a voice in the UAS API
In the UAS API, the Say methods support Speech Synthesis Markup Language (SSML) allowing you to change the way
your text is spoken, for example, by choosing which voice you'd like to use using the voice tag. You can
also choose the TTS engine to use, via the optional acu-engine tag which, if
provided, must be outermost in the string. If you don't provide these tags your account's Default TTS voice will be used.
For example, to set English US Female Polly Kimberly use the following SSML:
channel.FilePlayer.Say(""); I have something to say.
Using a Polly voice
Note
The preset default for your account will usually be a Standard Polly voice
We support both Standard and Neural Polly voices. Standard voices synthesize lifelike natural speech that is suitable for many applications. Neural voices are enhanced through the use of deep learning technologies to deliver even more natural sounding speech.
Pricing information for Standard and Neural voices is available on the Pricing page of your Cloud Console.
Polly's website has a demo which allows you to select a voice and immediately hear how different text will
sound - see Polly demos.
Polly TTS supports a subset of SSML, which can optionally be embedded within the text you supply to the say
function. For a summary of the SSML tags which may be used, see Common SSML tags
below. For more detailed information, to go
W3C SSML 1.1 recommendation.
We support the following Polly voices:
<li>
<a>Standard</a>
Name | Selector | Audio Clip |
---|
Name | Selector | Audio Clip |
---|
Using a Cepstral voice
Cepstral's website has a demo which allows you to select a voice and immediately hear how different
text will sound - see Cepstral demos.
Cepstral TTS supports a subset of the Speech Synthesis Markup Language (SSML), which can optionally be
embedded within the text you supply to the say function. For a summary of the SSML tags which may be used,
see Common SSML tags below. For more detailed information, go to
Cepstral SSML FAQ
and scroll down to the 'Common Usage Examples'. With reference to that page, please bear in mind the following:
We support the following Cepstral voices:
<thead>
<tr><th>Name</th><th><b>Selector</b></th></tr>
</thead>
<tbody>
</tbody>
We don't support:
- Inserting recorded audio files (our APIs' play functions already allow file replay)
- Applying Cepstral special effects
- Inserting bookmarks
Reserved characters
Some characters are reserved so, if the text you need to say contains any of these, replace them as shown:
Reserved Character | Replace With |
---|---|
< | < |
> | > |
& | & |
| | |
^ |
For example, "Bill & Ben played in the garden" would be become "Bill & Ben played in the garden".
Text length
The maximum length of the text to be converted is 1500 characters. As the length of the text is increased the generation time for the associated audio will also increase and, if is not a repeated phrase (and therefore may be cached) there will be a longer delay before the audio is played.
Common SSML tags
Polly and Cepstral both support a subset of SSML. Details of common tags can be found below. It is highly recommended that you test your application before deploying with a different TTS engine.
<tr><th>Tag</th><th>Description</th></tr>
<tr>
<td>break</td><td><p>Inserts a break or pause in the speech. <br /></p>
<p>Optional arguments are <span class="optional parameter">time</span> and <span class="optional parameter">strength</span>.<br /></p>
<p><span class="optional parameter">time</span> sets an absolute value for the pause. For example <span class="example"><break time="3s"></span> and <span class="example"><break time="3ms"></span> set the break time to be three seconds and three milliseconds respectively. The length of a break may be up to 10 seconds in duration<br /></p>
<p><span class="optional parameter">strength</span> sets the relative value of the pause. These are <span class="option">none</span>, <span class="option">x-weak</span>, <span class="option">weak</span>, <span class="option">medium</span>, <span class="option">strong</span> and <span class="option">x-strong</span>. <br /></p>
<p>Examples:</p>
<pre>
This is a <break /> sentence break. This is a <break time="2s"/> two second break. This is a dramatic <break strength="x-strong"/> break.
<tr>
<td>voice</td><td>
<p>Allows the user to change the voice used. Parameter <span class="parameter">name</span> is required, specifying the voice to use. The supported voices for each TTS are listed above.</p>
<div class="uk-alert uk-alert-primary">
<i class="uk-icon-info"></i> This SSML tag is supported in the UAS API only. For the REST API please use the <a href="#rest-api-voice">tts_voice setting</a>.
</div>
<div class="uk-alert uk-alert-primary">
<i class="uk-icon-info"></i> Polly does not support using more than one voice in a request. The first voice tag will set the voice used for all the text.
</div>
<p>Examples:</p>
<pre>
<acu-engine name='Polly'><voice name='Amy'>I'm using Amy instead of the default voice.</voice></acu-engine>
<tr>
<td>prosody</td>
<td><p>Allows the user to change the pitch, speed and volume of a segment of speech.<br /></p>
<p>Common optional parameters are: <span class="optional parameter">pitch</span>, <span class="optional parameter">rate</span> and <span class="optional parameter">volume</span>.<br /></p>
<p><span class="optional parameter">pitch</span> can be used to set the pitch of speech. Options are: <span class="option">x-low</span>, <span class="option">low</span>, <span class="option">medium</span>, <span class="option">high</span>, <span class="option">x-high</span>,a relative change (measured in Hz) e.g. <span class="option">+50Hz</span>, or a percentage change e.g <span class="option">+50%</span>.<br /></p>
<p><span class="optional parameter">rate</span> sets the rate of speech. Options are: <span class="option">x-slow</span>, <span class="option">slow</span>, <span class="option">medium</span>, <span class="option">fast</span> and <span class="option">x-fast</span>,a relative change (measured in Hz) e.g. <span class="option">+50Hz</span>, or a percentage change e.g <span class="option">+50%</span>.<br /></p>
<p><span class="optional parameter">volume</span> sets the volume for speech. Options are: <span class="option">silent</span>, <span class="option">x-soft</span>, <span class="option">soft</span>, <span class="option">medium</span>, <span class="option">loud</span> and <span class="option">x-loud</span>, a relative change (measured in Hz) e.g. <span class="option">+50Hz</span>, or a percentage change e.g <span class="option">+50%</span>.<br /></p>
<p>Examples:</p>
<pre>
<prosody rate="x-fast">I'm using a very fast rate.</prosody> This is normal volume. <prosody volume="soft">This is a soft volume.</prosody> I can talk very <prosody rate="slow" pitch="low">deeply and slowly.</prosody> Today's date is the <prosody rate="-50%">15th April, 2012.</prosody>
<tr>
<td>emphasis</td>
<td><p>Can be used to read with empasis.<br /></p>
<p>Required parameter: <span class="parameter">level</span>. Options are: <span class="option">reduced</span>, <span class="option">moderate</span> and <span class="option">strong</span>.<br /></p>
<p>Examples:</p>
<pre>
This is a <emphasis level="strong">level of emphasis</emphasis>, which can be used to highlight important information.
Charging
Our TTS is charged per conversion, per minute with 15 second granularity. So, for example:
- A play action that plays for 12 seconds will be charged for 15 seconds.
- A get input action that plays a prompt of 5 seconds and then plays "I'm sorry I didn't catch what you said" which lasts 6 seconds and the 5 second prompt again will be charged for 30 seconds (5+6+5=16, rounded up to 2 periods of 15 seconds).
You can obtain detailed charge information for a specific call using the Application Status web service. You can obtain detailed charge information for calls over a period of time using the Managing Reports web services.