11Sections — tap to jump
What you can build
Generate talking-avatar videos programmatically. Same engine and credit pricing as the web app. Available on paid plans (Basic and higher).
Two ways to integrate: a plain REST API for backends and scripts, and a native MCP (Streamable HTTP) server for AI agents (Claude, Cursor, Codex, ChatGPT-with-MCP).
New here? Start with the step-by-step Manual — it walks through scripts, b-roll, the REST API and MCP with worked examples.
Authentication
Every request needs a Bearer token. Tokens are shown once at creation time — store them in your secret manager.
Generate a token in Settings → API. Pass it on every request:
Authorization: Bearer vlm_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxThe base URL for the REST API is:
https://staging.vlogme.ai/api/public/v1Endpoints
A compact, predictable surface. JSON in, JSON out.
/meUser id, plan, credit balance/voicesVoice IDs for synthesis/videosStart a new render/videos/:idStatus + signed download URL/videosList recent renders (paged)/videos/:idDelete + refund if in-progressGET /me
curl https://staging.vlogme.ai/api/public/v1/me \
-H "Authorization: Bearer $VLOGME_TOKEN"GET /voices
curl https://staging.vlogme.ai/api/public/v1/voices \
-H "Authorization: Bearer $VLOGME_TOKEN"GET /videos/:id
Returns current status. When status is completed, video_url is a signed download URL valid for 1 hour.
curl https://staging.vlogme.ai/api/public/v1/videos/$VIDEO_ID \
-H "Authorization: Bearer $VLOGME_TOKEN"GET /videos
Lists your recent renders (newest first). Supports ?limit (max 100) and ?cursor.
DELETE /videos/:id
Deletes a render. If it's still in progress, credits are refunded.
Create a video
Provide a portrait (URL or base64) and either a script (with voice_id) or an audio file. The endpoint returns immediately and the render runs in the background.
curl -X POST https://staging.vlogme.ai/api/public/v1/videos \
-H "Authorization: Bearer $VLOGME_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"portrait_url": "https://example.com/portrait.jpg",
"script": "Hello from the Vlogme API.",
"voice_id": "EXAVITQu4vr4xnSDxMaL",
"emotion_preset": "natural",
"live_subtitles": true,
"webhook_url": "https://yourapp.com/webhooks/vlogme"
}'Optional fields: title, face_restore (0–1), watermark_text, video_prompt, video_negative_prompt, audio_url, audio_base64, portrait_base64.
{ id, status: "preparing" } immediately. Poll GET /videos/:id every 10 s or use a webhook_url. Typical render time: 1–3 minutes.Script grammar — the complete spec
The full, machine-readable DSL spec that powers multi-scene scripts, b-roll inserts and pauses. Paste it into ChatGPT, Claude or any LLM and the model will write valid scripts for you.
The grammar covers @imageN scene switches, overlay vs chain b-roll ({ ... }:D), [pause] tags, and the full list of ElevenLabs voice tags. MCP clients can also call the script_grammar_help tool to read the same spec inline.
# VlogMe Script Grammar — single source of truth
> Drop this file into ChatGPT, Claude, or any LLM. It is the complete spec for VlogMe scripts as the app actually runs them end‑to‑end. **Anything not described here does not exist.** If the prompt the AI sees and this doc ever disagree, this doc wins — fix the prompt.
---
## 0. What you are writing
A **VlogMe script** is plain text that drives a 9:16 short‑form video (TikTok / Reels / Shorts).
Inputs the user uploaded before you write anything:
- **N portrait photos**, numbered `1..N`, referenced as `@image1`, `@image2`, …
- **M audio clips** (optional), numbered `1..M`, referenced as `@audio1`, `@audio2`, …
Every photo carries **one role** (stored on the card, not in the script line):
| Role | Pastel tint | Meaning |
| ----------- | ----------- | --------------------------------------------------------------------------------- |
| `avatar` | amber | Talking head. Lip‑synced TTS or uploaded audio. |
| `video` | red | Full‑screen cinematic shot generated from the photo via Hunyuan i2v. |
| `overlay` | sky | Short clip layered **on top of** the currently‑speaking avatar. |
| `continue` | violet | A `video` shot whose first frame chains from the previous shot's last frame. |
> Roles are decided per photo, not per line. If photo 2 is `video`, **every** `@image2` reference in the script is a video shot — you don't pick the role on the line.
Output: one block of plain text using ONLY the grammar in §2.
---
## 1. Pipeline (who owns what)
```text
WishesField.tsx ← user types a brief + uploads photos/audio
│
▼
buildScriptFromBrief (server) ← src/lib/script-from-brief.functions.ts
│ Gemini Pro/Flash writes:
│ • script (the text below, §2)
│ • roleChanges (per-photo role overrides)
│ • videoPrompts (per-photo cinematic prompt for non-avatar roles)
│ • assetProposals (new @imageN / @audioN to generate)
│ • imageCaptions (Flash captions for new uploads)
│ • reply (1–3 friendly sentences shown in chat)
▼
PhotoMentionEditor.tsx ← user-editable script with inline @chips
│
▼
useFlatScriptBridge.ts ← canonicalises tags, preserves slot indices
│
▼
scenes.ts (serializer) ← emits the render plan
│
▼
render fleet ← Hunyuan / overlay / avatar pipeline
```
**Key rule**: anything that *visually customises* a non‑avatar photo (cinematic prompt, duration, role) lives on **the card** — it is read from `videoPrompts` / role state, **not** parsed back out of the script text every render. The script body is just the timeline order.
---
## 2. The only grammar that exists
There are exactly **four** line forms. Anything else is a broken response.
### 2.1 Avatar speech
```
@imageN <spoken words>
```
- Plain words, optionally with ElevenLabs emotion brackets (`[excited]`, `[whispers]`, `[laughs]`, `[pause]`). Brackets stay English even when the speech is in another language.
- Photo `N` must have role `avatar`.
### 2.2 Avatar lip‑sync to uploaded audio
```
@imageN @audioM
```
- Lip‑syncs the avatar photo to audio clip `M`.
- Two tokens on the line, nothing else. Never duplicate the same line as both text **and** audio.
### 2.3 Silent cinematic shot (video / overlay‑as‑fullscreen / continue)
```
@imageN
```
- A bare `@imageN` token on its own line.
- The cinematic instruction lives on the **card** (`videoPrompt`), not in the script body.
- Duration lives on the **card** (slider, 1–5s, default 5s).
- Photo `N` must have role `video`, `overlay`, or `continue`.
> Why bare? Because the editor renders this line as a read‑only "cinematic shot" chip backed by the card's prompt. Putting the prompt inline would desync the two surfaces and is exactly the bug the `{@imageN v:…}:D` form used to cause.
### 2.4 Overlay over a currently‑speaking avatar (the most common short‑form pattern)
```
@image1 Watch this {@image3 slow push-in on a coffee cup}:3 and boom.
```
**What an overlay actually does:** a short video clip visually **covers the avatar at the exact spot where `{…}:D` sits inside the speech line**. The avatar's voice does **not** pause — TTS keeps running through the whole line from start to finish. Only the picture is replaced for `D` seconds; when the clip ends, the avatar's face comes back and finishes the line.
This is the canonical **"talking head → b‑roll → talking head"** pattern. The person talks, a clip appears on screen while their voice keeps narrating, then the person reappears and keeps talking. Almost every short‑form video uses this constantly — reach for it before §2.3 silent shots.
- The position of `{…}:D` inside the line decides **when** the clip appears. Put the brace after the words that should still be heard while the avatar is on screen; the clip starts on the next word and runs for `D` seconds.
- The brace body **IS** the prompt — no `v:` prefix, no other prefixes.
- `:D` is duration in seconds, integer or fractional, **1..5**. Omit `:D` to default to 5.
- Photo `N` in `{@imageN …}` must have role `overlay`. (If you also need that same content as a full‑screen shot somewhere else, use a different `@imageN` — one role per photo.)
- The clip's own ambient/SFX is mixed **under** the avatar voice at low level. If you want loud sound, bake it into the overlay prompt body.
- **Multiple overlays in one speech line are allowed** and fire in order:
```
@image1 First this {@image3 slow push-in}:2 then this {@image4 quick whip-pan}:2 and we land here.
```
- **Trailing overlay** — if the line's spoken words end *before* `D` elapses, the remaining time plays the overlay with just its own ambient sound. Useful as a deliberate emotional outro:
```
@image1 And then everything changed {@image4 wide pull-back of the empty street at dusk}:5
```
- Optional extras inside the braces, separated by ` | `:
```
{@image3 slow push-in on a coffee cup | n:no people | s:gentle ambient cafe}:3
```
Most of the time just the bare prompt is enough.
#### Overlay vs silent shot — how to choose
| You want… | Use |
| ------------------------------------------------------------------------ | -------------------------------------------------- |
| Person talks the whole time, b‑roll appears in the middle of the line | **Overlay** (§2.4) — `{…}:D` inline |
| Person talks, then a wordless cinematic frame, then person talks again | **Silent shot** between two speech lines (§2.3) |
| Person talks AND the visual stays on the avatar (no clip) | Plain `@imageN <words>` (§2.1) |
#### Overlay hard don'ts
- **No braces on their own line, no line breaks inside braces.** The whole `{…}:D` block is one inline token on one speech line.
- **No `v:` prefix.** The brace body is the prompt verbatim.
- **No overlay without a surrounding avatar speech line.** A bare `{@imageN …}:D` on its own line is forbidden — if you want full‑screen, use §2.3 bare `@imageN`.
- **Don't double up.** Don't follow an overlay with a silent shot of the same content back‑to‑back; pick one.
### 2.5 Standalone audio (rare)
```
@audioN
```
- Music / sfx bed not tied to an avatar. Used only when the user uploaded a clip they want played, or asked for standalone background music spanning shots.
- Prefer baking sound into a cinematic shot's prompt instead.
---
## 3. The "code‑prompt" data the AI returns
`buildScriptFromBrief` returns a JSON object (via the `submit_script` tool call). The script body in §2 is only one field; the rest are side‑channels the client merges into the project state:
```ts
{
reply: "Short friendly 1–3 sentence summary shown in chat.",
script: "the script, exactly per §2",
roleChanges: { "@image2": "video", "@image4": "continue" },
videoPrompts: { "@image2": "Surfer drops into a glassy barrel, low GoPro angle, spray flying.",
"@image4": "Same surfer carves out, slow dolly-back, distant crowd cheer." },
imageCaptions:{ "@image3": "Young woman smiling in a sunlit kitchen." },
assetProposals: {
"@image5": { kind:"image", role:"video",
prompt:"Aerial shot of a coastline at golden hour, slow lateral drift." },
"@audio2": { kind:"music", durationSeconds:30,
prompt:"Upbeat lo-fi groove, soft kick, warm pads." }
}
}
```
### 3.1 Hard rules for the data
1. **One role per photo.** `roleChanges` lists only photos whose role you are *changing* from the user's hint. Same photo cannot be both avatar and video.
2. **`videoPrompts` is required** for every `@imageN` whose final role is `video`, `overlay`, or `continue` — including newly proposed photos. Plain prose, 1–2 sentences, ≤ 200 chars, no `@tags`, no brackets, no boilerplate like "Cinematic 9:16 reference image". Subject + motion + camera move + mood + lighting + any sound that should be baked in.
3. **Every `@tag` not in the upload list** must have an `assetProposals` entry — otherwise the user sees a dead placeholder. Reuse uploaded media whenever it fits; cap proposals at ≤ 6 images / ≤ 2 audio per turn.
4. **Avatar proposals must be a person.** When `kind:"image"` and `role:"avatar"`, the `prompt` describes a **portrait of a person** (subject + framing + lighting). It **must not** echo objects, scenery, drinks, or props from the brief. If the brief is about a coffee shop, the avatar isn't a coffee cup — it's the person ordering it.
5. **Bake sound into shots.** If a cinematic shot needs sound, write it into that shot's `videoPrompt` ("…spray flying, loud wave roar"). Don't spawn a parallel `@audioN` just for sfx that belongs to one shot.
6. **One beat per line.** Two tokens per line is only legal as `@imageN @audioM` (§2.2).
7. **Total script ≤ 8000 characters.**
### 3.2 Role decision tree (when the user's hint is wrong)
The role the user picked at upload is a **hint**, not a contract. Override only when the photo clearly can't play that role:
- Landscape / object / wide shot tagged `avatar` → `video`.
- Clean talking‑head portrait tagged `video` **and** the brief gives that person a speaking line → `avatar`.
- Near‑duplicate of the previous photo, same moment, same subject → `continue`.
- Quick punchy insert (logo, prop close‑up, text frame) layered while the avatar talks → `overlay`.
Otherwise **keep the hint**. Never flip an avatar to video just because you'd prefer a cinematic shot.
---
## 4. Worked examples
### 4.1 Good
Uploaded: `@image1` (portrait, hint=avatar), `@image2` (surfer wave, hint=video).
```
@image1 [excited] Watch this drop in!
@image2
@image1 [laughs] Insane wave!
```
```ts
roleChanges: { "@image2": "video" } // already matches hint; could be omitted
videoPrompts: { "@image2": "Surfer drops into a glassy barrel, low GoPro angle, spray flying, loud wave roar." }
```
### 4.2 Good — overlay while talking
```
@image1 We open the doors {@image3 slow push-in on the neon OPEN sign}:3 and the line is already around the block.
```
```ts
roleChanges: { "@image3": "overlay" }
videoPrompts: { "@image3": "Slow push-in on a glowing neon OPEN sign, dim street light, soft hum." }
```
### 4.3 Good — invent a missing avatar
User uploaded a coffee‑shop photo, no portrait, asked for a barista monologue.
```
@image2
@image3 Welcome in — what'll it be today?
```
```ts
roleChanges: { "@image2": "video" }
videoPrompts: { "@image2": "Steam curling off an espresso shot, macro lens, warm overhead light." }
assetProposals: {
"@image3": { kind:"image", role:"avatar",
prompt:"Friendly barista in a black apron, eye-level portrait, soft window light, shallow depth of field." }
}
```
### 4.4 Bad — do NOT do this
```
@image2 ← BARE tag for an AVATAR photo. Avatars need speech (§2.1) or @audioM (§2.2).
{@image2 v:...}:5 ← The "v:" form does not exist. Use bare @image2 on its own line (§2.3) or inline {@image2 ...}:D overlay (§2.4).
{@image2}:5 ← Braces without a prompt. Forbidden.
@image2:4 @image1 hi ← Two tags + invented :4 duration on the line. Forbidden.
{
@image2 cinematic shot
}:5 ← Brace on its own line. Forbidden — the whole brace block is ONE line.
```
```ts
// BAD assetProposals
"@image1": { kind:"image", role:"avatar",
prompt:"A cold glass of beer on a wooden table." }
// The user asked for an avatar. A glass is not a person. See §3.1 rule 4.
```
---
## 5. Backwards compatibility
Older chat history may contain `{@imageN v:<prompt>}:D` lines from the previous prompt version. The server still parses those — it strips the braces, moves the `v:` body into `videoPrompts`, and emits the canonical bare `@imageN` line. **Do not produce that form in new responses.** It is dead weight kept only so old projects keep loading.
---
## 6. Quick reference card
| Want | Write |
| ------------------------------------------------- | -------------------------------------------------------------- |
| Avatar speaks | `@image1 Here is what I want to say.` |
| Avatar lip‑syncs uploaded audio | `@image1 @audio2` |
| Silent cinematic shot (video / overlay / continue) | `@image2` on its own line + `videoPrompts["@image2"]` |
| Overlay clip while avatar keeps talking | `@image1 Look at this {@image3 slow push-in}:3 right here.` |
| Standalone music / sfx bed | `@audio1` on its own line |
| Change a photo's role | `roleChanges: { "@image2": "video" }` |
| Invent a new photo | use `@image5` in the script + `assetProposals["@image5"]` |
| Invent music | use `@audio2` + `assetProposals["@audio2"] = { kind:"music" }` |
Webhooks
Pass webhook_url in the create request and we POST to it once when the render finishes — success or failure.
Verify the X-Vlogme-Signature header — it is sha256=<hmac> where the HMAC secret is the lowercase hex sha256 of your raw API token. We never store the raw token, so we sign with the same hash we keep in our database. Your code derives the secret once from the token.
# Verifying X-Vlogme-Signature
secret = sha256(YOUR_RAW_TOKEN).hex() # lowercase hex
expected = "sha256=" + hmac_sha256(secret, raw_body).hex()
assert constant_time_eq(expected, request.headers["X-Vlogme-Signature"]){
"event": "video.completed",
"video": {
"id": "...",
"status": "completed",
"video_url": "https://..."
}
}MCP server
A native Streamable-HTTP Model Context Protocol endpoint. Authenticate with the same Bearer token your REST clients use.
POST https://staging.vlogme.ai/api/mcpAvailable tools
script_grammar_helplist_voicesget_balanceestimate_creditslist_portraitsgenerate_videoget_videolist_my_videosscript_grammar_help to read the same spec inline.Smoke test
curl -X POST https://staging.vlogme.ai/api/mcp \
-H "Authorization: Bearer $VLOGME_TOKEN" \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'Client setup
Claude Code, Cursor and Codex CLI speak Streamable HTTP natively. Older / stdio-only clients use the mcp-remote bridge.
Claude Code
claude mcp add --transport http vlogme https://staging.vlogme.ai/api/mcp \
--header "Authorization: Bearer $VLOGME_TOKEN"Cursor
Add to ~/.cursor/mcp.json (or the project's .cursor/mcp.json):
{
"mcpServers": {
"vlogme": {
"url": "https://staging.vlogme.ai/api/mcp",
"headers": { "Authorization": "Bearer YOUR_TOKEN_HERE" }
}
}
}Codex CLI — Option A: OAuth (recommended)
Opens a browser, you click "Allow" once:
codex mcp add vlogme --url https://staging.vlogme.ai/api/mcp
codex mcp login vlogme
# verify
codex mcp list # should show: vlogme Auth: OAuth
codex mcp get vlogme
codex doctor --summaryCodex CLI — Option B: API token
Use a vlm_live_* token from Settings → API. Export it first so codex doctor doesn't warn about a missing env var:
export VLOGME_TOKEN=vlm_live_xxxxxxxxxxxx
codex mcp add vlogme --url https://staging.vlogme.ai/api/mcp --bearer-token-env-var VLOGME_TOKEN--bearer-token-env-var with OAuth. If you added the env-var flag but never set the variable, codex doctor reports missing MCP env vars. Remove the server (codex mcp remove vlogme) and re-add it with only one mode. Start a new Codex session after any change — a running agent won't pick up new MCP tools dynamically.Claude Desktop / stdio-only clients
Older or stdio-only clients (Claude Desktop, legacy Codex builds) need the mcp-remote bridge. Add to claude_desktop_config.json:
{
"mcpServers": {
"vlogme": {
"command": "npx",
"args": [
"-y", "mcp-remote",
"https://staging.vlogme.ai/api/mcp",
"--header", "Authorization: Bearer YOUR_TOKEN_HERE"
]
}
}
}End-to-end example
What a one-shot natural-language prompt actually does once your assistant is wired to the Vlogme MCP server.
"Render a 15-second talking-portrait of my first portrait, voice Adam, saying: 'Mondays are for shipping. Let's go.'"
Claude (or any MCP-capable assistant) will run, in order:
list_portraits→ pick portrait id #1list_voices→ resolve "Adam" → voice idestimate_credits→ confirm cost ≈ 15generate_video→ returns a video idget_video(polled) → returns the final mp4 URL
- No progress streaming yet — poll
get_videoevery few seconds. - Portraits cannot be uploaded through MCP (binary uploads aren't part of the protocol). Upload them in the web app or via REST first, then reference them by id.
- Per-plan rate limits apply to MCP calls the same way they do to REST.
/.well-known/oauth-protected-resource (RFC 9728), so any compliant client (Claude Desktop, ChatGPT) can pop the sign-in flow automatically on a 401. Scripts and Cursor configs should use a Bearer token from Settings → API.Agent skill
Drop this into .agents/skills/vlogme/SKILL.md (or the equivalent for your agent) so the agent knows when and how to use the tools.
---
name: vlogme
description: Generate talking-avatar videos via the Vlogme MCP server. Use when the user wants to turn a portrait + script into a short video.
---
You have access to the `vlogme` MCP server. Workflow:
1. Call `list_voices` once per session to discover voice ids; cache the result.
2. Call `get_balance` to confirm the user has enough credits
(≈ 1 credit per second of output, minimum 10 per render).
3. Call `generate_video` with:
- `portrait_url` (https URL of a clear face photo, 9:16 or square)
OR `portrait_base64`.
- `script` (text to speak, ≤ 30 000 chars) AND `voice_id` from step 1
— OR `audio_url`/`audio_base64` if narration already exists.
- Optional: `emotion_preset` (calm | natural | expressive | sad),
`live_subtitles` (default true), `title`, `video_prompt`.
4. Poll `get_video { id }` every 10 s until `status` is `completed`
(typical 1–3 minutes). Then `video_url` is a signed mp4 link valid
for 1 hour — surface or download it immediately.
If `status` becomes `failed`, read `error_message`. Credits are auto-refunded on failure.
Tips: scripts under ~150 words render fastest. Portrait must show exactly one clear face.
Output is always 9:16 vertical.Errors
All errors are JSON in the shape { error: { code, message } }. Codes are stable — branch on code, not on the human-readable message.
missing_token401No Authorization headerinvalid_token401Token revoked or malformedplan_required403Free plan — upgrade to Basic+insufficient_credits402Not enough credits to start renderinvalid_input400Validation, unknown voice_id, etc.invalid_asset400Portrait/audio URL unreachable or too largeinvalid_json400Request body is not JSONnot_found404Resource doesn't exist or isn't yoursinternal_error500Unexpected server error