Voice Over Agent Guide

Everything you need to produce professional voice-overs with the Voice Over Agent. This guide covers voice selection, delivery control, pronunciation fine-tuning, and troubleshooting — so your scripts sound exactly how you intend.

Supported languages	70+
Available voices	8 curated + thousands via Voice Library
Output format	MP3 (high-quality, 44.1 kHz)
Speed range	0.75x – 1.5x

Quick Start

1Choose a voice from the dropdown — each has a unique tone and style.
2Select your desired speed (1.0x is natural; 1.1x is the default).
3Paste or type your script in the text area, or click Upload .txt to load a file.
4Optionally insert Voice Inflections (audio tags) at the cursor position.
5Click Generate Voice Over — the MP3 downloads automatically.

Voices & Settings

Available Voices

Voice	Style
Liam Callahan	Narrative, American Male
Sarah	Mature & Confident, American Female
Lily	Velvety Actress, British Female
Roger	Laid-Back & Casual, American Male
Brian	Deep & Resonant, American Male
River	Relaxed & Neutral, American
Alice	Clear Educator, British Female
Charlie	Deep & Confident, Australian Male

Choosing the Right Voice

• Corporate narration: Liam Callahan, Brian, or Alice — authoritative and clear
• Conversational ads: Roger or River — relaxed, approachable
• Dramatic / storytelling: Lily or Sarah — expressive range
• Explainer / educational: Alice or Charlie — steady pacing

Speed Guide

Speed	Best For
0.75x	Slow, deliberate delivery — emphasis-heavy content, dramatic reads
0.9x	Slightly relaxed — educational, explainer videos
1.0x	Natural speaking pace — general narration
1.1x (default)	Slightly upbeat — promos, social media ads
1.25x	Energetic — fast-paced ads, teasers
1.5x	Rapid — legal disclaimers, fine-print reads

Stability Presets

The Stability setting is the most important control in the Voice Over Agent. It determines how closely the generated voice follows the original reference audio.

Preset	Behavior	Best For
Creative	More emotional and expressive, but prone to hallucinations	Dramatic reads, storytelling, character voices
Natural	Closest to the original voice recording — balanced and neutral	General narration, promos, most use cases (default)
Robust	Highly stable but less responsive to directional prompts — consistent output	Corporate, legal, long-form where consistency matters

How Stability Affects Audio Tags

For maximum expressiveness with audio tags (like [excited] or [whispers]), use Creative or Natural. Robust reduces responsiveness to these directional prompts.

Neutral Voices + Stability

Neutral voices (like River or Alice) tend to be more stable across languages and styles, providing reliable baseline performance. Pairing a neutral voice with Natural stability gives you a dependable foundation without sacrificing moderate expressiveness.

• Neutral voice + Creative — good range of expression with fewer hallucinations than emotive voices
• Neutral voice + Natural — safest all-round choice
• Emotive voice + Robust — tames an expressive voice for consistent output

Delivery Control

Pauses

The voice engine reads punctuation as natural pauses. Use these in your script to control rhythm:

Technique	Effect
`,` (comma)	Brief pause
`.` (period)	Full stop — natural sentence break
`...` (ellipsis)	Weighted, dramatic pause
`—` or `--`	Short, abrupt pause (thought break)
Line break	Separates ideas with a breath

Emphasis

Use CAPITALIZATION to stress individual words:

We don't just grow crops. We grow FUTURES.

Emotion

Emotion comes from text context, not settings. Add narrative cues to guide the voice's tone:

(excitedly) This changes everything for South African farmers!

(with quiet confidence) ProAgri has been at the forefront... for decades.

Tip: generate with cues, then re-generate without them if you prefer a subtler read. The cues “prime” the voice even when removed from surrounding text.

Break Times & Pauses

Use the Insert Pause buttons below the script area to add timed breaks, or type them manually. The voice engine interprets these as natural pauses of varying length.

Duration	What to Type	Use Case
~0.5s	`[pause]`	Brief breath between clauses
~1.0s	`[long pause]`	Between sentences or ideas
~1.5s	`...`	Dramatic pause, building tension
~2.0s	`... ...`	Scene transition, topic change
~3.0s	`... ... ...`	Long break between sections

Other Pause Techniques

• Comma , — brief, natural pause
• Period . — full sentence stop
• Dash — or -- — abrupt thought break
• Line break — separates ideas with a breath

Example: Using Pauses for Impact

We don't just grow crops.

[long pause]

We grow FUTURES.

... ...

ProAgri -- where agriculture meets innovation.

Multi-Speaker Dialogue

You can write dialogue-style scripts with speaker labels and stage directions. The voice engine picks up on these cues to adjust timing, emotion, and delivery. Generate each speaker's lines separately with different voices, then combine in your editor.

Dialogue Format

Use [direction] tags inline to control how lines are delivered:

Speaker 1: [starting to speak] So I was thinking we could—
Speaker 2: [jumping in] —test our new timing features?
Speaker 1: [surprised] Exactly! How did you—
Speaker 2: [overlapping] —know what you were thinking? Lucky guess!
Speaker 1: [pause] Sorry, go ahead.
Speaker 2: [cautiously] Okay, so if we both try to talk at the same time—
Speaker 1: [overlapping] —we'll probably crash the system!
Speaker 2: [panicking] Wait, are we crashing? I can't tell if this is a feature or a—
Speaker 1: [interrupting, then stopping abruptly] Bug! ...Did I just cut you off again?
Speaker 2: [sighing] Yes, but honestly? This is kind of fun.
Speaker 1: [mischievously] Race you to the next sentence!
Speaker 2: [laughing] We're definitely going to break something!

How to Produce Multi-Speaker Audio

1Write the full dialogue script with speaker labels
2Extract Speaker 1's lines — paste into the agent, choose a voice (e.g. Liam), generate
3Extract Speaker 2's lines — choose a different voice (e.g. Sarah), generate
4Combine the audio files in an editor, overlapping where marked

Direction Tags That Work Well

Tag	Effect
`[starting to speak]`	Gradual onset, natural beginning
`[jumping in]`	Quick, eager interruption
`[surprised]`	Raised pitch, taken aback
`[cautiously]`	Careful, measured delivery
`[overlapping]`	Rushed, talking over someone
`[panicking]`	Fast, stressed delivery
`[interrupting, then stopping abruptly]`	Sharp cut, sudden silence
`[pause]`	Brief silence before continuing

Pronunciation Control (CMU Phonemes)

Note:Phoneme tags only work when the agent uses the Flash v2 or English v1 model internally. They only apply to Englishwords — for other languages, use alias tags (see Dictionaries below).

What Is CMU?

CMU (Carnegie Mellon Pronouncing Dictionary) maps words to ARPABET phonemes— a standardized set of sound codes. Stress is marked with numbers: 0 (unstressed),1 (primary), 2 (secondary).

Example: "hello" → HH AH0 L OW1

SSML Phoneme Tag

<phoneme alphabet="cmu-arpabet" ph="HH AH0 L OW1">hello</phoneme>

Consonant Reference

Sound	CMU	Example
b	`B`	bat
d	`D`	dog
f	`F`	fan
g	`G`	goat
h	`HH`	hat
j (jar)	`JH`	joy
k	`K`	kite
l	`L`	leg
m	`M`	man
n	`N`	net
ng	`NG`	sing
p	`P`	pen
r	`R`	red
s	`S`	sun
sh	`SH`	ship
t	`T`	top
th (thin)	`TH`	think
th (this)	`DH`	this
v	`V`	van
w	`W`	wet
y	`Y`	yes
z	`Z`	zoo
zh	`ZH`	measure
ch	`CH`	chin

Vowel Reference

Sound	CMU	Example
ah (sofa)	`AH0` / `AH1`	about
ae (cat)	`AE1`	cat
ee (see)	`IY1`	see
eh (bed)	`EH1`	bed
ih (sit)	`IH1`	sit
oh (go)	`OW1`	go
oo (blue)	`UW1`	blue
uh (put)	`UH1`	put
aw (saw)	`AO1`	saw
er (bird)	`ER1`	bird
ay (my)	`AY1`	my
oy (boy)	`OY1`	boy
ow (cow)	`AW1`	cow

Step-by-Step: Fixing a Pronunciation

1Break the word into syllables: “ProAgri” → Pro-Ag-ri
2Map each sound: → P R OW1 AE1 G R IY0
3Wrap in SSML:

<phoneme alphabet="cmu-arpabet" ph="P R OW1 AE1 G R IY0">ProAgri</phoneme>

Common Examples

Word	Phonemes
Daniel	`D AE1 N Y AH0 L`
Nike	`N AY1 K IY0`
Xander	`Z AH0 N D ER1`
Agri	`AE1 G R IY0`
ProAgri	`P R OW1 AE1 G R IY0`

Tips

• Adjust stress first, then vowels — stress errors are more noticeable
• Modify one phoneme at a time and re-test
• CMU Arpabet is more consistent than IPA with current voice models
• Each word needs its own phoneme tag

Pronunciation Dictionaries

For words you use repeatedly, create a dictionary file instead of adding inline tags each time. Dictionaries use PLS (XML) format:

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
  xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
  alphabet="cmu-arpabet" xml:lang="en-US">
  <lexeme>
    <grapheme>ProAgri</grapheme>
    <phoneme>P R OW1 AE1 G R IY0</phoneme>
  </lexeme>
  <lexeme>
    <grapheme>Claughton</grapheme>
    <alias>Cloffton</alias>
  </lexeme>
</lexicon>

Key Rules

• First match wins — the system uses only the first matching replacement
• Case-sensitive — create separate entries for “ProAgri” and “proagri”
• Alias tags work across all models — use them when phoneme tags aren't supported

Text Normalization

Voice models work best with written-out text. Digits, symbols, and abbreviations often cause mispronunciations. Normalize them before pasting:

Raw	Write as
`$42.50`	forty-two dollars and fifty cents
`123-456-7890`	one two three, four five six, seven eight nine zero
`9:23 AM`	nine twenty-three A M
`Dr. Smith`	Doctor Smith
`5kg`	five kilograms
`25%`	twenty-five percent

Audio Tags (Voice Inflections)

Insert these tags in your script to control emotion and delivery. You can click them directly from the Voice Inflections panel on the generator page, or type them manually.

Emotions

[excited] We just hit our target!
[sarcastic] Oh, what a surprise.
[curious] Have you ever wondered why?
[crying] I can't believe it's over.
[mischievously] I have a little secret...

Delivery

[whispers] This is just between us.
[sighs] Another Monday morning.
[exhales] Finally, it's done.
[muttering] I knew this would happen...
[clears throat] Right, let's begin.

Reactions & Sounds

[laughs] That was brilliant!
[chuckles] Classic.
[gulps] This is it...
[applause]

Tips

• Tag effectiveness depends on the voice — some respond better than others
• Match the tag to the voice's natural range — a calm voice won't shout convincingly
• Place the tag before the text it should affect
• You can combine tags with punctuation for more nuance

Troubleshooting

Problem	Likely Cause	Solution
Words mispronounced	Model guessing	Add CMU phoneme tags or use phonetic spelling
Numbers read wrong	Not normalized	Write out numbers as words
Emotion sounds flat	Voice doesn't match	Try a different voice or add narrative cues
Audio tag ignored	Voice incompatibility	Try a different voice — not all respond equally
Speed sounds unnatural	Extreme speed value	Stay between 0.9x–1.25x for best quality
Pauses missing	No punctuation	Use commas, periods, ellipses, or dashes
Generation fails	Script too long or empty	Check character count; split very long scripts
Hallucinated words	Ambiguous text	Simplify complex sentences; remove special chars