vlogme.ai APIで何ができますか？

プログラムによる話すアバター動画生成が必要なものは何でも可能です。パーソナライズされたセールス動画、AIチューター、ニュースキャスター、製品説明、NPCダイアログ、大規模なローカライズされたナレーションなど。1つのPOSTリクエストでポートレートURL + スクリプト + 音声IDを受け取り、完成したMP4を返します。

MCPサーバーはClaude Code、Codex、Cursorとどのように連携しますか？

Vlogmeは、Streamable-HTTP MCPエンドポイントを/api/mcpで公開しています。このページにあるCLIスニペットで一度追加すると、エージェントはlist_voices、generate_video、get_videoをネイティブに呼び出すことができます。グルーコードや追加のSDKは不要です。

APIレンダリングのクレジットはどのように請求されますか？

ウェブアプリと同じで、完成した動画の1秒あたり約1クレジット、最低10クレジットです。/videosにPOSTしたときにクレジットが課金され、レンダリングが失敗したり、完了前に削除したりすると自動的に返金されます。

レート制限はありますか？

デフォルトの制限は寛大です（Basicで数百回/日、Proで数千回/日）。ローンチや一括移行のために高いスループットが必要な場合は、support@vlogme.aiまでメールでお問い合わせください。制限を引き上げます。

ウェブフックはリトライと署名検証をサポートしていますか？

はい。すべてのウェブフックにはX-Vlogme-Signatureヘッダー（生ボディのsha256 HMAC、パスワードとしてトークン）が含まれています。配信に失敗した場合、24時間指数関数的なバックオフでリトライされます。

APIアクセスはどのプランに含まれていますか？

APIとMCPは、すべての有料プラン（Basic以上）で利用可能です。無料プランは悪用を防ぐためにウェブ専用です。

VlogMe.AI API & MCP — プログラマブル AI アバターライブ配信

11Sections — tap to jump

01Overview
02Authentication
03Endpoints
04Create a video
05Script grammar
06Webhooks
07MCP server
08Client setup
09End-to-end example
10Agent skill
11Errors

Intro

What you can build

Generate talking-avatar videos programmatically. Same engine and credit pricing as the web app. Available on paid plans (Basic and higher).

Two ways to integrate: a plain REST API for backends and scripts, and a native MCP (Streamable HTTP) server for AI agents (Claude, Cursor, Codex, ChatGPT-with-MCP).

New here? Start with the step-by-step Manual — it walks through scripts, b-roll, the REST API and MCP with worked examples.

Pricing

≈ 1 credit per second of output, minimum 10 credits per render. Credits are charged up front and auto-refunded on failure.

Setup

Authentication

Every request needs a Bearer token. Tokens are shown once at creation time — store them in your secret manager.

Generate a token in Settings → API. Pass it on every request:

http

Authorization: Bearer vlm_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The base URL for the REST API is:

url

https://staging.vlogme.ai/api/public/v1

REST

Endpoints

A compact, predictable surface. JSON in, JSON out.

GET/meUser id, plan, credit balance

GET/voicesVoice IDs for synthesis

POST/videosStart a new render

GET/videos/:idStatus + signed download URL

GET/videosList recent renders (paged)

DELETE/videos/:idDelete + refund if in-progress

GET /me

bash

curl https://staging.vlogme.ai/api/public/v1/me \
  -H "Authorization: Bearer $VLOGME_TOKEN"

GET /voices

bash

curl https://staging.vlogme.ai/api/public/v1/voices \
  -H "Authorization: Bearer $VLOGME_TOKEN"

GET /videos/:id

Returns current status. When status is completed, video_url is a signed download URL valid for 1 hour.

bash

curl https://staging.vlogme.ai/api/public/v1/videos/$VIDEO_ID \
  -H "Authorization: Bearer $VLOGME_TOKEN"

GET /videos

Lists your recent renders (newest first). Supports ?limit (max 100) and ?cursor.

DELETE /videos/:id

Deletes a render. If it's still in progress, credits are refunded.

REST

Create a video

Provide a portrait (URL or base64) and either a script (with voice_id) or an audio file. The endpoint returns immediately and the render runs in the background.

bash

curl -X POST https://staging.vlogme.ai/api/public/v1/videos \
  -H "Authorization: Bearer $VLOGME_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "portrait_url": "https://example.com/portrait.jpg",
    "script": "Hello from the Vlogme API.",
    "voice_id": "EXAVITQu4vr4xnSDxMaL",
    "emotion_preset": "natural",
    "live_subtitles": true,
    "webhook_url": "https://yourapp.com/webhooks/vlogme"
  }'

Optional fields: title, face_restore (0–1), watermark_text, video_prompt, video_negative_prompt, audio_url, audio_base64, portrait_base64.

Response

Returns { id, status: "preparing" } immediately. Poll GET /videos/:id every 10 s or use a webhook_url. Typical render time: 1–3 minutes.

Scripts

Script grammar — the complete spec

The full, machine-readable DSL spec that powers multi-scene scripts, b-roll inserts and pauses. Paste it into ChatGPT, Claude or any LLM and the model will write valid scripts for you.

The grammar covers @imageN scene switches, overlay vs chain b-roll ({ ... }:D), [pause] tags, and the full list of ElevenLabs voice tags. MCP clients can also call the script_grammar_help tool to read the same spec inline.

Download .md Open raw URL

SCRIPT-GRAMMAR.md

# VlogMe Script Grammar — single source of truth

> Drop this file into ChatGPT, Claude, or any LLM. It is the complete spec for VlogMe scripts as the app actually runs them end‑to‑end. **Anything not described here does not exist.** If the prompt the AI sees and this doc ever disagree, this doc wins — fix the prompt.

---

## 0. What you are writing

A **VlogMe script** is plain text that drives a 9:16 short‑form video (TikTok / Reels / Shorts).

Inputs the user uploaded before you write anything:

- **N portrait photos**, numbered `1..N`, referenced as `@image1`, `@image2`, …
- **M audio clips** (optional), numbered `1..M`, referenced as `@audio1`, `@audio2`, …

Every photo carries **one role** (stored on the card, not in the script line):

| Role        | Pastel tint | Meaning                                                                           |
| ----------- | ----------- | --------------------------------------------------------------------------------- |
| `avatar`    | amber       | Talking head. Lip‑synced TTS or uploaded audio.                                   |
| `video`     | red         | Full‑screen cinematic shot generated from the photo via Hunyuan i2v.              |
| `overlay`   | sky         | Short clip layered **on top of** the currently‑speaking avatar.                   |
| `continue`  | violet      | A `video` shot whose first frame chains from the previous shot's last frame.      |

> Roles are decided per photo, not per line. If photo 2 is `video`, **every** `@image2` reference in the script is a video shot — you don't pick the role on the line.

Output: one block of plain text using ONLY the grammar in §2.

---

## 1. Pipeline (who owns what)

```text
WishesField.tsx                       ← user types a brief + uploads photos/audio
   │
   ▼
buildScriptFromBrief (server)         ← src/lib/script-from-brief.functions.ts
   │   Gemini Pro/Flash writes:
   │     • script           (the text below, §2)
   │     • roleChanges      (per-photo role overrides)
   │     • videoPrompts     (per-photo cinematic prompt for non-avatar roles)
   │     • assetProposals   (new @imageN / @audioN to generate)
   │     • imageCaptions    (Flash captions for new uploads)
   │     • reply            (1–3 friendly sentences shown in chat)
   ▼
PhotoMentionEditor.tsx                ← user-editable script with inline @chips
   │
   ▼
useFlatScriptBridge.ts                ← canonicalises tags, preserves slot indices
   │
   ▼
scenes.ts (serializer)                ← emits the render plan
   │
   ▼
render fleet                          ← Hunyuan / overlay / avatar pipeline
```

**Key rule**: anything that *visually customises* a non‑avatar photo (cinematic prompt, duration, role) lives on **the card** — it is read from `videoPrompts` / role state, **not** parsed back out of the script text every render. The script body is just the timeline order.

---

## 2. The only grammar that exists

There are exactly **four** line forms. Anything else is a broken response.

### 2.1 Avatar speech
```
@imageN <spoken words>
```
- Plain words, optionally with ElevenLabs emotion brackets (`[excited]`, `[whispers]`, `[laughs]`, `[pause]`). Brackets stay English even when the speech is in another language.
- Photo `N` must have role `avatar`.

### 2.2 Avatar lip‑sync to uploaded audio
```
@imageN @audioM
```
- Lip‑syncs the avatar photo to audio clip `M`.
- Two tokens on the line, nothing else. Never duplicate the same line as both text **and** audio.

### 2.3 Silent cinematic shot (video / overlay‑as‑fullscreen / continue)
```
@imageN
```
- A bare `@imageN` token on its own line.
- The cinematic instruction lives on the **card** (`videoPrompt`), not in the script body.
- Duration lives on the **card** (slider, 1–5s, default 5s).
- Photo `N` must have role `video`, `overlay`, or `continue`.

> Why bare? Because the editor renders this line as a read‑only "cinematic shot" chip backed by the card's prompt. Putting the prompt inline would desync the two surfaces and is exactly the bug the `{@imageN v:…}:D` form used to cause.

### 2.4 Overlay over a currently‑speaking avatar (the most common short‑form pattern)
```
@image1 Watch this {@image3 slow push-in on a coffee cup}:3 and boom.
```

**What an overlay actually does:** a short video clip visually **covers the avatar at the exact spot where `{…}:D` sits inside the speech line**. The avatar's voice does **not** pause — TTS keeps running through the whole line from start to finish. Only the picture is replaced for `D` seconds; when the clip ends, the avatar's face comes back and finishes the line.

This is the canonical **"talking head → b‑roll → talking head"** pattern. The person talks, a clip appears on screen while their voice keeps narrating, then the person reappears and keeps talking. Almost every short‑form video uses this constantly — reach for it before §2.3 silent shots.

- The position of `{…}:D` inside the line decides **when** the clip appears. Put the brace after the words that should still be heard while the avatar is on screen; the clip starts on the next word and runs for `D` seconds.
- The brace body **IS** the prompt — no `v:` prefix, no other prefixes.
- `:D` is duration in seconds, integer or fractional, **1..5**. Omit `:D` to default to 5.
- Photo `N` in `{@imageN …}` must have role `overlay`. (If you also need that same content as a full‑screen shot somewhere else, use a different `@imageN` — one role per photo.)
- The clip's own ambient/SFX is mixed **under** the avatar voice at low level. If you want loud sound, bake it into the overlay prompt body.
- **Multiple overlays in one speech line are allowed** and fire in order:
  ```
  @image1 First this {@image3 slow push-in}:2 then this {@image4 quick whip-pan}:2 and we land here.
  ```
- **Trailing overlay** — if the line's spoken words end *before* `D` elapses, the remaining time plays the overlay with just its own ambient sound. Useful as a deliberate emotional outro:
  ```
  @image1 And then everything changed {@image4 wide pull-back of the empty street at dusk}:5
  ```
- Optional extras inside the braces, separated by ` | `:
  ```
  {@image3 slow push-in on a coffee cup | n:no people | s:gentle ambient cafe}:3
  ```
  Most of the time just the bare prompt is enough.

#### Overlay vs silent shot — how to choose

| You want…                                                                | Use                                                |
| ------------------------------------------------------------------------ | -------------------------------------------------- |
| Person talks the whole time, b‑roll appears in the middle of the line    | **Overlay** (§2.4) — `{…}:D` inline                |
| Person talks, then a wordless cinematic frame, then person talks again   | **Silent shot** between two speech lines (§2.3)    |
| Person talks AND the visual stays on the avatar (no clip)                | Plain `@imageN <words>` (§2.1)                     |

#### Overlay hard don'ts

- **No braces on their own line, no line breaks inside braces.** The whole `{…}:D` block is one inline token on one speech line.
- **No `v:` prefix.** The brace body is the prompt verbatim.
- **No overlay without a surrounding avatar speech line.** A bare `{@imageN …}:D` on its own line is forbidden — if you want full‑screen, use §2.3 bare `@imageN`.
- **Don't double up.** Don't follow an overlay with a silent shot of the same content back‑to‑back; pick one.

### 2.5 Standalone audio (rare)
```
@audioN
```
- Music / sfx bed not tied to an avatar. Used only when the user uploaded a clip they want played, or asked for standalone background music spanning shots.
- Prefer baking sound into a cinematic shot's prompt instead.

---

## 3. The "code‑prompt" data the AI returns

`buildScriptFromBrief` returns a JSON object (via the `submit_script` tool call). The script body in §2 is only one field; the rest are side‑channels the client merges into the project state:

```ts
{
  reply:        "Short friendly 1–3 sentence summary shown in chat.",
  script:       "the script, exactly per §2",
  roleChanges:  { "@image2": "video", "@image4": "continue" },
  videoPrompts: { "@image2": "Surfer drops into a glassy barrel, low GoPro angle, spray flying.",
                  "@image4": "Same surfer carves out, slow dolly-back, distant crowd cheer." },
  imageCaptions:{ "@image3": "Young woman smiling in a sunlit kitchen." },
  assetProposals: {
    "@image5": { kind:"image", role:"video",
                 prompt:"Aerial shot of a coastline at golden hour, slow lateral drift." },
    "@audio2": { kind:"music", durationSeconds:30,
                 prompt:"Upbeat lo-fi groove, soft kick, warm pads." }
  }
}
```

### 3.1 Hard rules for the data

1. **One role per photo.** `roleChanges` lists only photos whose role you are *changing* from the user's hint. Same photo cannot be both avatar and video.
2. **`videoPrompts` is required** for every `@imageN` whose final role is `video`, `overlay`, or `continue` — including newly proposed photos. Plain prose, 1–2 sentences, ≤ 200 chars, no `@tags`, no brackets, no boilerplate like "Cinematic 9:16 reference image". Subject + motion + camera move + mood + lighting + any sound that should be baked in.
3. **Every `@tag` not in the upload list** must have an `assetProposals` entry — otherwise the user sees a dead placeholder. Reuse uploaded media whenever it fits; cap proposals at ≤ 6 images / ≤ 2 audio per turn.
4. **Avatar proposals must be a person.** When `kind:"image"` and `role:"avatar"`, the `prompt` describes a **portrait of a person** (subject + framing + lighting). It **must not** echo objects, scenery, drinks, or props from the brief. If the brief is about a coffee shop, the avatar isn't a coffee cup — it's the person ordering it.
5. **Bake sound into shots.** If a cinematic shot needs sound, write it into that shot's `videoPrompt` ("…spray flying, loud wave roar"). Don't spawn a parallel `@audioN` just for sfx that belongs to one shot.
6. **One beat per line.** Two tokens per line is only legal as `@imageN @audioM` (§2.2).
7. **Total script ≤ 8000 characters.**

### 3.2 Role decision tree (when the user's hint is wrong)

The role the user picked at upload is a **hint**, not a contract. Override only when the photo clearly can't play that role:

- Landscape / object / wide shot tagged `avatar` → `video`.
- Clean talking‑head portrait tagged `video` **and** the brief gives that person a speaking line → `avatar`.
- Near‑duplicate of the previous photo, same moment, same subject → `continue`.
- Quick punchy insert (logo, prop close‑up, text frame) layered while the avatar talks → `overlay`.

Otherwise **keep the hint**. Never flip an avatar to video just because you'd prefer a cinematic shot.

---

## 4. Worked examples

### 4.1 Good

Uploaded: `@image1` (portrait, hint=avatar), `@image2` (surfer wave, hint=video).

```
@image1 [excited] Watch this drop in!
@image2
@image1 [laughs] Insane wave!
```
```ts
roleChanges:  { "@image2": "video" }    // already matches hint; could be omitted
videoPrompts: { "@image2": "Surfer drops into a glassy barrel, low GoPro angle, spray flying, loud wave roar." }
```

### 4.2 Good — overlay while talking

```
@image1 We open the doors {@image3 slow push-in on the neon OPEN sign}:3 and the line is already around the block.
```
```ts
roleChanges:  { "@image3": "overlay" }
videoPrompts: { "@image3": "Slow push-in on a glowing neon OPEN sign, dim street light, soft hum." }
```

### 4.3 Good — invent a missing avatar

User uploaded a coffee‑shop photo, no portrait, asked for a barista monologue.

```
@image2
@image3 Welcome in — what'll it be today?
```
```ts
roleChanges:  { "@image2": "video" }
videoPrompts: { "@image2": "Steam curling off an espresso shot, macro lens, warm overhead light." }
assetProposals: {
  "@image3": { kind:"image", role:"avatar",
               prompt:"Friendly barista in a black apron, eye-level portrait, soft window light, shallow depth of field." }
}
```

### 4.4 Bad — do NOT do this

```
@image2                    ← BARE tag for an AVATAR photo. Avatars need speech (§2.1) or @audioM (§2.2).
{@image2 v:...}:5          ← The "v:" form does not exist. Use bare @image2 on its own line (§2.3) or inline {@image2 ...}:D overlay (§2.4).
{@image2}:5                ← Braces without a prompt. Forbidden.
@image2:4 @image1 hi       ← Two tags + invented :4 duration on the line. Forbidden.
{
  @image2 cinematic shot
}:5                        ← Brace on its own line. Forbidden — the whole brace block is ONE line.
```

```ts
// BAD assetProposals
"@image1": { kind:"image", role:"avatar",
             prompt:"A cold glass of beer on a wooden table." }
// The user asked for an avatar. A glass is not a person. See §3.1 rule 4.
```

---

## 5. Backwards compatibility

Older chat history may contain `{@imageN v:<prompt>}:D` lines from the previous prompt version. The server still parses those — it strips the braces, moves the `v:` body into `videoPrompts`, and emits the canonical bare `@imageN` line. **Do not produce that form in new responses.** It is dead weight kept only so old projects keep loading.

---

## 6. Quick reference card

| Want                                              | Write                                                          |
| ------------------------------------------------- | -------------------------------------------------------------- |
| Avatar speaks                                     | `@image1 Here is what I want to say.`                          |
| Avatar lip‑syncs uploaded audio                   | `@image1 @audio2`                                              |
| Silent cinematic shot (video / overlay / continue) | `@image2` on its own line + `videoPrompts["@image2"]`          |
| Overlay clip while avatar keeps talking           | `@image1 Look at this {@image3 slow push-in}:3 right here.`    |
| Standalone music / sfx bed                        | `@audio1` on its own line                                      |
| Change a photo's role                             | `roleChanges: { "@image2": "video" }`                          |
| Invent a new photo                                | use `@image5` in the script + `assetProposals["@image5"]`      |
| Invent music                                      | use `@audio2` + `assetProposals["@audio2"] = { kind:"music" }` |

REST

Webhooks

Pass webhook_url in the create request and we POST to it once when the render finishes — success or failure.

Verify the X-Vlogme-Signature header — it is sha256=<hmac> where the HMAC secret is the lowercase hex sha256 of your raw API token. We never store the raw token, so we sign with the same hash we keep in our database. Your code derives the secret once from the token.

python

# Verifying X-Vlogme-Signature
secret   = sha256(YOUR_RAW_TOKEN).hex()          # lowercase hex
expected = "sha256=" + hmac_sha256(secret, raw_body).hex()
assert constant_time_eq(expected, request.headers["X-Vlogme-Signature"])

json

{
  "event": "video.completed",
  "video": {
    "id": "...",
    "status": "completed",
    "video_url": "https://..."
  }
}

MCP

MCP server

A native Streamable-HTTP Model Context Protocol endpoint. Authenticate with the same Bearer token your REST clients use.

endpoint

POST  https://staging.vlogme.ai/api/mcp

Available tools

script_grammar_help

list_voices

get_balance

estimate_credits

list_portraits

generate_video

get_video

list_my_videos

LLM-friendly grammar

Writing scripts with ChatGPT or Claude? Drop this self-contained spec into the chat: https://staging.vlogme.ai/docs/script-grammar.md. MCP clients can also call script_grammar_help to read the same spec inline.

Smoke test

bash

curl -X POST https://staging.vlogme.ai/api/mcp \
  -H "Authorization: Bearer $VLOGME_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list","params":{}}'

MCP

Client setup

Claude Code, Cursor and Codex CLI speak Streamable HTTP natively. Older / stdio-only clients use the mcp-remote bridge.

Claude Code

bash

claude mcp add --transport http vlogme https://staging.vlogme.ai/api/mcp \
  --header "Authorization: Bearer $VLOGME_TOKEN"

Cursor

Add to ~/.cursor/mcp.json (or the project's .cursor/mcp.json):

json

{
  "mcpServers": {
    "vlogme": {
      "url": "https://staging.vlogme.ai/api/mcp",
      "headers": { "Authorization": "Bearer YOUR_TOKEN_HERE" }
    }
  }
}

Codex CLI — Option A: OAuth (recommended)

Opens a browser, you click "Allow" once:

bash

codex mcp add vlogme --url https://staging.vlogme.ai/api/mcp
codex mcp login vlogme

# verify
codex mcp list           # should show: vlogme   Auth: OAuth
codex mcp get vlogme
codex doctor --summary

Codex CLI — Option B: API token

Use a vlm_live_* token from Settings → API. Export it first so codex doctor doesn't warn about a missing env var:

bash

export VLOGME_TOKEN=vlm_live_xxxxxxxxxxxx
codex mcp add vlogme --url https://staging.vlogme.ai/api/mcp --bearer-token-env-var VLOGME_TOKEN

Don't mix auth modes

Don't combine --bearer-token-env-var with OAuth. If you added the env-var flag but never set the variable, codex doctor reports missing MCP env vars. Remove the server (codex mcp remove vlogme) and re-add it with only one mode. Start a new Codex session after any change — a running agent won't pick up new MCP tools dynamically.

Claude Desktop / stdio-only clients

Older or stdio-only clients (Claude Desktop, legacy Codex builds) need the mcp-remote bridge. Add to claude_desktop_config.json:

json

{
  "mcpServers": {
    "vlogme": {
      "command": "npx",
      "args": [
        "-y", "mcp-remote",
        "https://staging.vlogme.ai/api/mcp",
        "--header", "Authorization: Bearer YOUR_TOKEN_HERE"
      ]
    }
  }
}

MCP

End-to-end example

What a one-shot natural-language prompt actually does once your assistant is wired to the Vlogme MCP server.

"Render a 15-second talking-portrait of my first portrait, voice Adam, saying: 'Mondays are for shipping. Let's go.'"

Claude (or any MCP-capable assistant) will run, in order:

list_portraits → pick portrait id #1
list_voices → resolve "Adam" → voice id
estimate_credits → confirm cost ≈ 15
generate_video → returns a video id
get_video (polled) → returns the final mp4 URL

Honest limitations

No progress streaming yet — poll get_video every few seconds.
Portraits cannot be uploaded through MCP (binary uploads aren't part of the protocol). Upload them in the web app or via REST first, then reference them by id.
Per-plan rate limits apply to MCP calls the same way they do to REST.

OAuth discovery

OAuth discovery is published at /.well-known/oauth-protected-resource (RFC 9728), so any compliant client (Claude Desktop, ChatGPT) can pop the sign-in flow automatically on a 401. Scripts and Cursor configs should use a Bearer token from Settings → API.

Agents

Agent skill

Drop this into .agents/skills/vlogme/SKILL.md (or the equivalent for your agent) so the agent knows when and how to use the tools.

markdown

---
name: vlogme
description: Generate talking-avatar videos via the Vlogme MCP server. Use when the user wants to turn a portrait + script into a short video.
---

You have access to the `vlogme` MCP server. Workflow:

1. Call `list_voices` once per session to discover voice ids; cache the result.
2. Call `get_balance` to confirm the user has enough credits
   (≈ 1 credit per second of output, minimum 10 per render).
3. Call `generate_video` with:
   - `portrait_url` (https URL of a clear face photo, 9:16 or square)
     OR `portrait_base64`.
   - `script` (text to speak, ≤ 30 000 chars) AND `voice_id` from step 1
     — OR `audio_url`/`audio_base64` if narration already exists.
   - Optional: `emotion_preset` (calm | natural | expressive | sad),
     `live_subtitles` (default true), `title`, `video_prompt`.
4. Poll `get_video { id }` every 10 s until `status` is `completed`
   (typical 1–3 minutes). Then `video_url` is a signed mp4 link valid
   for 1 hour — surface or download it immediately.

If `status` becomes `failed`, read `error_message`. Credits are auto-refunded on failure.

Tips: scripts under ~150 words render fastest. Portrait must show exactly one clear face.
Output is always 9:16 vertical.

Reference

Errors

All errors are JSON in the shape { error: { code, message } }. Codes are stable — branch on code, not on the human-readable message.

CodeStatusMeaningStatus

missing_token401No Authorization header

invalid_token401Token revoked or malformed

plan_required403Free plan — upgrade to Basic+

insufficient_credits402Not enough credits to start render

invalid_input400Validation, unknown voice_id, etc.

invalid_asset400Portrait/audio URL unreachable or too large

invalid_json400Request body is not JSON

not_found404Resource doesn't exist or isn't yours

internal_error500Unexpected server error

Talking-avatar videos from your code, agent, or LLM.

What you can build

Authentication

Endpoints

GET /me

GET /voices

GET /videos/:id

GET /videos

DELETE /videos/:id

Create a video

Script grammar — the complete spec

Webhooks

MCP server

Available tools

Smoke test

Client setup

Claude Code

Cursor

Codex CLI — Option A: OAuth (recommended)

Codex CLI — Option B: API token

Claude Desktop / stdio-only clients

End-to-end example

Agent skill

Errors

開発者の質問

Talking-avatar videos from your code, agent, or LLM.

GET /me

GET /voices

GET /videos/:id

GET /videos

DELETE /videos/:id

Available tools

Smoke test

Claude Code

Cursor

Codex CLI — Option A: OAuth (recommended)

Codex CLI — Option B: API token

Claude Desktop / stdio-only clients

開発者 の質問

開発者の質問