Back to Agent

Voice Over Agent Guide

Everything you need to produce professional voice-overs with the Voice Over Agent. This guide covers voice selection, delivery control, pronunciation fine-tuning, and troubleshooting — so your scripts sound exactly how you intend.

Supported languages70+
Available voices8 curated + thousands via Voice Library
Output formatMP3 (high-quality, 44.1 kHz)
Speed range0.75x – 1.5x

Quick Start

  1. 1Choose a voice from the dropdown — each has a unique tone and style.
  2. 2Select your desired speed (1.0x is natural; 1.1x is the default).
  3. 3Paste or type your script in the text area, or click Upload .txt to load a file.
  4. 4Optionally insert Voice Inflections (audio tags) at the cursor position.
  5. 5Click Generate Voice Over — the MP3 downloads automatically.

Voices & Settings

Available Voices

VoiceStyle
Liam CallahanNarrative, American Male
SarahMature & Confident, American Female
LilyVelvety Actress, British Female
RogerLaid-Back & Casual, American Male
BrianDeep & Resonant, American Male
RiverRelaxed & Neutral, American
AliceClear Educator, British Female
CharlieDeep & Confident, Australian Male

Choosing the Right Voice

  • Corporate narration: Liam Callahan, Brian, or Alice — authoritative and clear
  • Conversational ads: Roger or River — relaxed, approachable
  • Dramatic / storytelling: Lily or Sarah — expressive range
  • Explainer / educational: Alice or Charlie — steady pacing

Speed Guide

SpeedBest For
0.75xSlow, deliberate delivery — emphasis-heavy content, dramatic reads
0.9xSlightly relaxed — educational, explainer videos
1.0xNatural speaking pace — general narration
1.1x (default)Slightly upbeat — promos, social media ads
1.25xEnergetic — fast-paced ads, teasers
1.5xRapid — legal disclaimers, fine-print reads

Stability Presets

The Stability setting is the most important control in the Voice Over Agent. It determines how closely the generated voice follows the original reference audio.

PresetBehaviorBest For
CreativeMore emotional and expressive, but prone to hallucinationsDramatic reads, storytelling, character voices
NaturalClosest to the original voice recording — balanced and neutralGeneral narration, promos, most use cases (default)
RobustHighly stable but less responsive to directional prompts — consistent outputCorporate, legal, long-form where consistency matters

How Stability Affects Audio Tags

For maximum expressiveness with audio tags (like [excited] or [whispers]), use Creative or Natural. Robust reduces responsiveness to these directional prompts.

Neutral Voices + Stability

Neutral voices (like River or Alice) tend to be more stable across languages and styles, providing reliable baseline performance. Pairing a neutral voice with Natural stability gives you a dependable foundation without sacrificing moderate expressiveness.

  • Neutral voice + Creative — good range of expression with fewer hallucinations than emotive voices
  • Neutral voice + Natural — safest all-round choice
  • Emotive voice + Robust — tames an expressive voice for consistent output

Delivery Control

Pauses

The voice engine reads punctuation as natural pauses. Use these in your script to control rhythm:

TechniqueEffect
, (comma)Brief pause
. (period)Full stop — natural sentence break
... (ellipsis)Weighted, dramatic pause
or --Short, abrupt pause (thought break)
Line breakSeparates ideas with a breath

Emphasis

Use CAPITALIZATION to stress individual words:

We don't just grow crops. We grow FUTURES.

Emotion

Emotion comes from text context, not settings. Add narrative cues to guide the voice's tone:

(excitedly) This changes everything for South African farmers!

(with quiet confidence) ProAgri has been at the forefront... for decades.

Tip: generate with cues, then re-generate without them if you prefer a subtler read. The cues “prime” the voice even when removed from surrounding text.

Break Times & Pauses

Use the Insert Pause buttons below the script area to add timed breaks, or type them manually. The voice engine interprets these as natural pauses of varying length.

DurationWhat to TypeUse Case
~0.5s[pause]Brief breath between clauses
~1.0s[long pause]Between sentences or ideas
~1.5s...Dramatic pause, building tension
~2.0s... ...Scene transition, topic change
~3.0s... ... ...Long break between sections

Other Pause Techniques

  • Comma , — brief, natural pause
  • Period . — full sentence stop
  • Dash or -- — abrupt thought break
  • Line break — separates ideas with a breath

Example: Using Pauses for Impact

We don't just grow crops.

[long pause]

We grow FUTURES.

... ...

ProAgri -- where agriculture meets innovation.

Multi-Speaker Dialogue

You can write dialogue-style scripts with speaker labels and stage directions. The voice engine picks up on these cues to adjust timing, emotion, and delivery. Generate each speaker's lines separately with different voices, then combine in your editor.

Dialogue Format

Use [direction] tags inline to control how lines are delivered:

Speaker 1: [starting to speak] So I was thinking we could—
Speaker 2: [jumping in] —test our new timing features?
Speaker 1: [surprised] Exactly! How did you—
Speaker 2: [overlapping] —know what you were thinking? Lucky guess!
Speaker 1: [pause] Sorry, go ahead.
Speaker 2: [cautiously] Okay, so if we both try to talk at the same time—
Speaker 1: [overlapping] —we'll probably crash the system!
Speaker 2: [panicking] Wait, are we crashing? I can't tell if this is a feature or a—
Speaker 1: [interrupting, then stopping abruptly] Bug! ...Did I just cut you off again?
Speaker 2: [sighing] Yes, but honestly? This is kind of fun.
Speaker 1: [mischievously] Race you to the next sentence!
Speaker 2: [laughing] We're definitely going to break something!

How to Produce Multi-Speaker Audio

  1. 1Write the full dialogue script with speaker labels
  2. 2Extract Speaker 1's lines — paste into the agent, choose a voice (e.g. Liam), generate
  3. 3Extract Speaker 2's lines — choose a different voice (e.g. Sarah), generate
  4. 4Combine the audio files in an editor, overlapping where marked

Direction Tags That Work Well

TagEffect
[starting to speak]Gradual onset, natural beginning
[jumping in]Quick, eager interruption
[surprised]Raised pitch, taken aback
[cautiously]Careful, measured delivery
[overlapping]Rushed, talking over someone
[panicking]Fast, stressed delivery
[interrupting, then stopping abruptly]Sharp cut, sudden silence
[pause]Brief silence before continuing

Pronunciation Control (CMU Phonemes)

Note:Phoneme tags only work when the agent uses the Flash v2 or English v1 model internally. They only apply to Englishwords — for other languages, use alias tags (see Dictionaries below).

What Is CMU?

CMU (Carnegie Mellon Pronouncing Dictionary) maps words to ARPABET phonemes— a standardized set of sound codes. Stress is marked with numbers: 0 (unstressed),1 (primary), 2 (secondary).

Example: "hello"HH AH0 L OW1

SSML Phoneme Tag

<phoneme alphabet="cmu-arpabet" ph="HH AH0 L OW1">hello</phoneme>

Consonant Reference

SoundCMUExample
bBbat
dDdog
fFfan
gGgoat
hHHhat
j (jar)JHjoy
kKkite
lLleg
mMman
nNnet
ngNGsing
pPpen
rRred
sSsun
shSHship
tTtop
th (thin)THthink
th (this)DHthis
vVvan
wWwet
yYyes
zZzoo
zhZHmeasure
chCHchin

Vowel Reference

SoundCMUExample
ah (sofa)AH0 / AH1about
ae (cat)AE1cat
ee (see)IY1see
eh (bed)EH1bed
ih (sit)IH1sit
oh (go)OW1go
oo (blue)UW1blue
uh (put)UH1put
aw (saw)AO1saw
er (bird)ER1bird
ay (my)AY1my
oy (boy)OY1boy
ow (cow)AW1cow

Step-by-Step: Fixing a Pronunciation

  1. 1Break the word into syllables: “ProAgri” → Pro-Ag-ri
  2. 2Map each sound: → P R OW1 AE1 G R IY0
  3. 3Wrap in SSML:
<phoneme alphabet="cmu-arpabet" ph="P R OW1 AE1 G R IY0">ProAgri</phoneme>

Common Examples

WordPhonemes
DanielD AE1 N Y AH0 L
NikeN AY1 K IY0
XanderZ AH0 N D ER1
AgriAE1 G R IY0
ProAgriP R OW1 AE1 G R IY0

Tips

  • • Adjust stress first, then vowels — stress errors are more noticeable
  • • Modify one phoneme at a time and re-test
  • • CMU Arpabet is more consistent than IPA with current voice models
  • • Each word needs its own phoneme tag

Pronunciation Dictionaries

For words you use repeatedly, create a dictionary file instead of adding inline tags each time. Dictionaries use PLS (XML) format:

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
  xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
  alphabet="cmu-arpabet" xml:lang="en-US">
  <lexeme>
    <grapheme>ProAgri</grapheme>
    <phoneme>P R OW1 AE1 G R IY0</phoneme>
  </lexeme>
  <lexeme>
    <grapheme>Claughton</grapheme>
    <alias>Cloffton</alias>
  </lexeme>
</lexicon>

Key Rules

  • First match wins — the system uses only the first matching replacement
  • Case-sensitive — create separate entries for “ProAgri” and “proagri”
  • Alias tags work across all models — use them when phoneme tags aren't supported

Text Normalization

Voice models work best with written-out text. Digits, symbols, and abbreviations often cause mispronunciations. Normalize them before pasting:

RawWrite as
$42.50forty-two dollars and fifty cents
123-456-7890one two three, four five six, seven eight nine zero
9:23 AMnine twenty-three A M
Dr. SmithDoctor Smith
5kgfive kilograms
25%twenty-five percent

Audio Tags (Voice Inflections)

Insert these tags in your script to control emotion and delivery. You can click them directly from the Voice Inflections panel on the generator page, or type them manually.

Emotions

[excited] We just hit our target!
[sarcastic] Oh, what a surprise.
[curious] Have you ever wondered why?
[crying] I can't believe it's over.
[mischievously] I have a little secret...

Delivery

[whispers] This is just between us.
[sighs] Another Monday morning.
[exhales] Finally, it's done.
[muttering] I knew this would happen...
[clears throat] Right, let's begin.

Reactions & Sounds

[laughs] That was brilliant!
[chuckles] Classic.
[gulps] This is it...
[applause]

Tips

  • • Tag effectiveness depends on the voice — some respond better than others
  • • Match the tag to the voice's natural range — a calm voice won't shout convincingly
  • • Place the tag before the text it should affect
  • • You can combine tags with punctuation for more nuance

Troubleshooting

ProblemLikely CauseSolution
Words mispronouncedModel guessingAdd CMU phoneme tags or use phonetic spelling
Numbers read wrongNot normalizedWrite out numbers as words
Emotion sounds flatVoice doesn't matchTry a different voice or add narrative cues
Audio tag ignoredVoice incompatibilityTry a different voice — not all respond equally
Speed sounds unnaturalExtreme speed valueStay between 0.9x–1.25x for best quality
Pauses missingNo punctuationUse commas, periods, ellipses, or dashes
Generation failsScript too long or emptyCheck character count; split very long scripts
Hallucinated wordsAmbiguous textSimplify complex sentences; remove special chars