Voice Over Agent Guide
Everything you need to produce professional voice-overs with the Voice Over Agent. This guide covers voice selection, delivery control, pronunciation fine-tuning, and troubleshooting — so your scripts sound exactly how you intend.
| Supported languages | 70+ |
| Available voices | 8 curated + thousands via Voice Library |
| Output format | MP3 (high-quality, 44.1 kHz) |
| Speed range | 0.75x – 1.5x |
Quick Start
- 1Choose a voice from the dropdown — each has a unique tone and style.
- 2Select your desired speed (1.0x is natural; 1.1x is the default).
- 3Paste or type your script in the text area, or click Upload .txt to load a file.
- 4Optionally insert Voice Inflections (audio tags) at the cursor position.
- 5Click Generate Voice Over — the MP3 downloads automatically.
Voices & Settings
Available Voices
| Voice | Style |
|---|---|
| Liam Callahan | Narrative, American Male |
| Sarah | Mature & Confident, American Female |
| Lily | Velvety Actress, British Female |
| Roger | Laid-Back & Casual, American Male |
| Brian | Deep & Resonant, American Male |
| River | Relaxed & Neutral, American |
| Alice | Clear Educator, British Female |
| Charlie | Deep & Confident, Australian Male |
Choosing the Right Voice
- • Corporate narration: Liam Callahan, Brian, or Alice — authoritative and clear
- • Conversational ads: Roger or River — relaxed, approachable
- • Dramatic / storytelling: Lily or Sarah — expressive range
- • Explainer / educational: Alice or Charlie — steady pacing
Speed Guide
| Speed | Best For |
|---|---|
| 0.75x | Slow, deliberate delivery — emphasis-heavy content, dramatic reads |
| 0.9x | Slightly relaxed — educational, explainer videos |
| 1.0x | Natural speaking pace — general narration |
| 1.1x (default) | Slightly upbeat — promos, social media ads |
| 1.25x | Energetic — fast-paced ads, teasers |
| 1.5x | Rapid — legal disclaimers, fine-print reads |
Stability Presets
The Stability setting is the most important control in the Voice Over Agent. It determines how closely the generated voice follows the original reference audio.
| Preset | Behavior | Best For |
|---|---|---|
| Creative | More emotional and expressive, but prone to hallucinations | Dramatic reads, storytelling, character voices |
| Natural | Closest to the original voice recording — balanced and neutral | General narration, promos, most use cases (default) |
| Robust | Highly stable but less responsive to directional prompts — consistent output | Corporate, legal, long-form where consistency matters |
How Stability Affects Audio Tags
For maximum expressiveness with audio tags (like [excited] or [whispers]), use Creative or Natural. Robust reduces responsiveness to these directional prompts.
Neutral Voices + Stability
Neutral voices (like River or Alice) tend to be more stable across languages and styles, providing reliable baseline performance. Pairing a neutral voice with Natural stability gives you a dependable foundation without sacrificing moderate expressiveness.
- • Neutral voice + Creative — good range of expression with fewer hallucinations than emotive voices
- • Neutral voice + Natural — safest all-round choice
- • Emotive voice + Robust — tames an expressive voice for consistent output
Delivery Control
Pauses
The voice engine reads punctuation as natural pauses. Use these in your script to control rhythm:
| Technique | Effect |
|---|---|
, (comma) | Brief pause |
. (period) | Full stop — natural sentence break |
... (ellipsis) | Weighted, dramatic pause |
— or -- | Short, abrupt pause (thought break) |
| Line break | Separates ideas with a breath |
Emphasis
Use CAPITALIZATION to stress individual words:
We don't just grow crops. We grow FUTURES.Emotion
Emotion comes from text context, not settings. Add narrative cues to guide the voice's tone:
(excitedly) This changes everything for South African farmers!
(with quiet confidence) ProAgri has been at the forefront... for decades.Tip: generate with cues, then re-generate without them if you prefer a subtler read. The cues “prime” the voice even when removed from surrounding text.
Break Times & Pauses
Use the Insert Pause buttons below the script area to add timed breaks, or type them manually. The voice engine interprets these as natural pauses of varying length.
| Duration | What to Type | Use Case |
|---|---|---|
| ~0.5s | [pause] | Brief breath between clauses |
| ~1.0s | [long pause] | Between sentences or ideas |
| ~1.5s | ... | Dramatic pause, building tension |
| ~2.0s | ... ... | Scene transition, topic change |
| ~3.0s | ... ... ... | Long break between sections |
Other Pause Techniques
- • Comma
,— brief, natural pause - • Period
.— full sentence stop - • Dash
—or--— abrupt thought break - • Line break — separates ideas with a breath
Example: Using Pauses for Impact
We don't just grow crops.
[long pause]
We grow FUTURES.
... ...
ProAgri -- where agriculture meets innovation.Multi-Speaker Dialogue
You can write dialogue-style scripts with speaker labels and stage directions. The voice engine picks up on these cues to adjust timing, emotion, and delivery. Generate each speaker's lines separately with different voices, then combine in your editor.
Dialogue Format
Use [direction] tags inline to control how lines are delivered:
Speaker 1: [starting to speak] So I was thinking we could—
Speaker 2: [jumping in] —test our new timing features?
Speaker 1: [surprised] Exactly! How did you—
Speaker 2: [overlapping] —know what you were thinking? Lucky guess!
Speaker 1: [pause] Sorry, go ahead.
Speaker 2: [cautiously] Okay, so if we both try to talk at the same time—
Speaker 1: [overlapping] —we'll probably crash the system!
Speaker 2: [panicking] Wait, are we crashing? I can't tell if this is a feature or a—
Speaker 1: [interrupting, then stopping abruptly] Bug! ...Did I just cut you off again?
Speaker 2: [sighing] Yes, but honestly? This is kind of fun.
Speaker 1: [mischievously] Race you to the next sentence!
Speaker 2: [laughing] We're definitely going to break something!How to Produce Multi-Speaker Audio
- 1Write the full dialogue script with speaker labels
- 2Extract Speaker 1's lines — paste into the agent, choose a voice (e.g. Liam), generate
- 3Extract Speaker 2's lines — choose a different voice (e.g. Sarah), generate
- 4Combine the audio files in an editor, overlapping where marked
Direction Tags That Work Well
| Tag | Effect |
|---|---|
[starting to speak] | Gradual onset, natural beginning |
[jumping in] | Quick, eager interruption |
[surprised] | Raised pitch, taken aback |
[cautiously] | Careful, measured delivery |
[overlapping] | Rushed, talking over someone |
[panicking] | Fast, stressed delivery |
[interrupting, then stopping abruptly] | Sharp cut, sudden silence |
[pause] | Brief silence before continuing |
Pronunciation Control (CMU Phonemes)
What Is CMU?
CMU (Carnegie Mellon Pronouncing Dictionary) maps words to ARPABET phonemes— a standardized set of sound codes. Stress is marked with numbers: 0 (unstressed),1 (primary), 2 (secondary).
Example: "hello" → HH AH0 L OW1
SSML Phoneme Tag
<phoneme alphabet="cmu-arpabet" ph="HH AH0 L OW1">hello</phoneme>Consonant Reference
| Sound | CMU | Example |
|---|---|---|
| b | B | bat |
| d | D | dog |
| f | F | fan |
| g | G | goat |
| h | HH | hat |
| j (jar) | JH | joy |
| k | K | kite |
| l | L | leg |
| m | M | man |
| n | N | net |
| ng | NG | sing |
| p | P | pen |
| r | R | red |
| s | S | sun |
| sh | SH | ship |
| t | T | top |
| th (thin) | TH | think |
| th (this) | DH | this |
| v | V | van |
| w | W | wet |
| y | Y | yes |
| z | Z | zoo |
| zh | ZH | measure |
| ch | CH | chin |
Vowel Reference
| Sound | CMU | Example |
|---|---|---|
| ah (sofa) | AH0 / AH1 | about |
| ae (cat) | AE1 | cat |
| ee (see) | IY1 | see |
| eh (bed) | EH1 | bed |
| ih (sit) | IH1 | sit |
| oh (go) | OW1 | go |
| oo (blue) | UW1 | blue |
| uh (put) | UH1 | put |
| aw (saw) | AO1 | saw |
| er (bird) | ER1 | bird |
| ay (my) | AY1 | my |
| oy (boy) | OY1 | boy |
| ow (cow) | AW1 | cow |
Step-by-Step: Fixing a Pronunciation
- 1Break the word into syllables: “ProAgri” → Pro-Ag-ri
- 2Map each sound: →
P R OW1 AE1 G R IY0 - 3Wrap in SSML:
<phoneme alphabet="cmu-arpabet" ph="P R OW1 AE1 G R IY0">ProAgri</phoneme>Common Examples
| Word | Phonemes |
|---|---|
| Daniel | D AE1 N Y AH0 L |
| Nike | N AY1 K IY0 |
| Xander | Z AH0 N D ER1 |
| Agri | AE1 G R IY0 |
| ProAgri | P R OW1 AE1 G R IY0 |
Tips
- • Adjust stress first, then vowels — stress errors are more noticeable
- • Modify one phoneme at a time and re-test
- • CMU Arpabet is more consistent than IPA with current voice models
- • Each word needs its own phoneme tag
Pronunciation Dictionaries
For words you use repeatedly, create a dictionary file instead of adding inline tags each time. Dictionaries use PLS (XML) format:
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
alphabet="cmu-arpabet" xml:lang="en-US">
<lexeme>
<grapheme>ProAgri</grapheme>
<phoneme>P R OW1 AE1 G R IY0</phoneme>
</lexeme>
<lexeme>
<grapheme>Claughton</grapheme>
<alias>Cloffton</alias>
</lexeme>
</lexicon>Key Rules
- • First match wins — the system uses only the first matching replacement
- • Case-sensitive — create separate entries for “ProAgri” and “proagri”
- • Alias tags work across all models — use them when phoneme tags aren't supported
Text Normalization
Voice models work best with written-out text. Digits, symbols, and abbreviations often cause mispronunciations. Normalize them before pasting:
| Raw | Write as |
|---|---|
$42.50 | forty-two dollars and fifty cents |
123-456-7890 | one two three, four five six, seven eight nine zero |
9:23 AM | nine twenty-three A M |
Dr. Smith | Doctor Smith |
5kg | five kilograms |
25% | twenty-five percent |
Troubleshooting
| Problem | Likely Cause | Solution |
|---|---|---|
| Words mispronounced | Model guessing | Add CMU phoneme tags or use phonetic spelling |
| Numbers read wrong | Not normalized | Write out numbers as words |
| Emotion sounds flat | Voice doesn't match | Try a different voice or add narrative cues |
| Audio tag ignored | Voice incompatibility | Try a different voice — not all respond equally |
| Speed sounds unnatural | Extreme speed value | Stay between 0.9x–1.25x for best quality |
| Pauses missing | No punctuation | Use commas, periods, ellipses, or dashes |
| Generation fails | Script too long or empty | Check character count; split very long scripts |
| Hallucinated words | Ambiguous text | Simplify complex sentences; remove special chars |