# VlogMe Script Grammar — single source of truth

> Drop this file into ChatGPT, Claude, or any LLM. It is the complete spec for VlogMe scripts as the app actually runs them end‑to‑end. **Anything not described here does not exist.** If the prompt the AI sees and this doc ever disagree, this doc wins — fix the prompt.

---

## 0. What you are writing

A **VlogMe script** is plain text that drives a 9:16 short‑form video (TikTok / Reels / Shorts).

Inputs the user uploaded before you write anything:

- **N portrait photos**, numbered `1..N`, referenced as `@image1`, `@image2`, …
- **M audio clips** (optional), numbered `1..M`, referenced as `@audio1`, `@audio2`, …

Every photo carries **one role** (stored on the card, not in the script line):

| Role        | Pastel tint | Meaning                                                                           |
| ----------- | ----------- | --------------------------------------------------------------------------------- |
| `avatar`    | amber       | Talking head. Lip‑synced TTS or uploaded audio.                                   |
| `video`     | red         | Full‑screen cinematic shot generated from the photo via Hunyuan i2v.              |
| `overlay`   | sky         | Short clip layered **on top of** the currently‑speaking avatar.                   |
| `continue`  | violet      | A `video` shot whose first frame chains from the previous shot's last frame.      |

> Roles are decided per photo, not per line. If photo 2 is `video`, **every** `@image2` reference in the script is a video shot — you don't pick the role on the line.

Output: one block of plain text using ONLY the grammar in §2.

---

## 1. Pipeline (who owns what)

```text
WishesField.tsx                       ← user types a brief + uploads photos/audio
   │
   ▼
buildScriptFromBrief (server)         ← src/lib/script-from-brief.functions.ts
   │   Gemini Pro/Flash writes:
   │     • script           (the text below, §2)
   │     • roleChanges      (per-photo role overrides)
   │     • videoPrompts     (per-photo cinematic prompt for non-avatar roles)
   │     • assetProposals   (new @imageN / @audioN to generate)
   │     • imageCaptions    (Flash captions for new uploads)
   │     • reply            (1–3 friendly sentences shown in chat)
   ▼
PhotoMentionEditor.tsx                ← user-editable script with inline @chips
   │
   ▼
useFlatScriptBridge.ts                ← canonicalises tags, preserves slot indices
   │
   ▼
scenes.ts (serializer)                ← emits the render plan
   │
   ▼
render fleet                          ← Hunyuan / overlay / avatar pipeline
```

**Key rule**: anything that *visually customises* a non‑avatar photo (cinematic prompt, duration, role) lives on **the card** — it is read from `videoPrompts` / role state, **not** parsed back out of the script text every render. The script body is just the timeline order.

---

## 2. The only grammar that exists

There are exactly **four** line forms. Anything else is a broken response.

### 2.1 Avatar speech
```
@imageN <spoken words>
```
- Plain words, optionally with ElevenLabs emotion brackets (`[excited]`, `[whispers]`, `[laughs]`, `[pause]`). Brackets stay English even when the speech is in another language.
- Photo `N` must have role `avatar`.

### 2.2 Avatar lip‑sync to uploaded audio
```
@imageN @audioM
```
- Lip‑syncs the avatar photo to audio clip `M`.
- Two tokens on the line, nothing else. Never duplicate the same line as both text **and** audio.

### 2.3 Silent cinematic shot (video / overlay‑as‑fullscreen / continue)
```
@imageN
```
- A bare `@imageN` token on its own line.
- The cinematic instruction lives on the **card** (`videoPrompt`), not in the script body.
- Duration lives on the **card** (slider, 1–5s, default 5s).
- Photo `N` must have role `video`, `overlay`, or `continue`.

> Why bare? Because the editor renders this line as a read‑only "cinematic shot" chip backed by the card's prompt. Putting the prompt inline would desync the two surfaces and is exactly the bug the `{@imageN v:…}:D` form used to cause.

### 2.4 Overlay over a currently‑speaking avatar (the most common short‑form pattern)
```
@image1 Watch this {@image3 slow push-in on a coffee cup}:3 and boom.
```

**What an overlay actually does:** a short video clip visually **covers the avatar at the exact spot where `{…}:D` sits inside the speech line**. The avatar's voice does **not** pause — TTS keeps running through the whole line from start to finish. Only the picture is replaced for `D` seconds; when the clip ends, the avatar's face comes back and finishes the line.

This is the canonical **"talking head → b‑roll → talking head"** pattern. The person talks, a clip appears on screen while their voice keeps narrating, then the person reappears and keeps talking. Almost every short‑form video uses this constantly — reach for it before §2.3 silent shots.

- The position of `{…}:D` inside the line decides **when** the clip appears. Put the brace after the words that should still be heard while the avatar is on screen; the clip starts on the next word and runs for `D` seconds.
- The brace body **IS** the prompt — no `v:` prefix, no other prefixes.
- `:D` is duration in seconds, integer or fractional, **1..5**. Omit `:D` to default to 5.
- Photo `N` in `{@imageN …}` must have role `overlay`. (If you also need that same content as a full‑screen shot somewhere else, use a different `@imageN` — one role per photo.)
- The clip's own ambient/SFX is mixed **under** the avatar voice at low level. If you want loud sound, bake it into the overlay prompt body.
- **Multiple overlays in one speech line are allowed** and fire in order:
  ```
  @image1 First this {@image3 slow push-in}:2 then this {@image4 quick whip-pan}:2 and we land here.
  ```
- **Trailing overlay** — if the line's spoken words end *before* `D` elapses, the remaining time plays the overlay with just its own ambient sound. Useful as a deliberate emotional outro:
  ```
  @image1 And then everything changed {@image4 wide pull-back of the empty street at dusk}:5
  ```
- Optional extras inside the braces, separated by ` | `:
  ```
  {@image3 slow push-in on a coffee cup | n:no people | s:gentle ambient cafe}:3
  ```
  Most of the time just the bare prompt is enough.

#### Overlay vs silent shot — how to choose

| You want…                                                                | Use                                                |
| ------------------------------------------------------------------------ | -------------------------------------------------- |
| Person talks the whole time, b‑roll appears in the middle of the line    | **Overlay** (§2.4) — `{…}:D` inline                |
| Person talks, then a wordless cinematic frame, then person talks again   | **Silent shot** between two speech lines (§2.3)    |
| Person talks AND the visual stays on the avatar (no clip)                | Plain `@imageN <words>` (§2.1)                     |

#### Overlay hard don'ts

- **No braces on their own line, no line breaks inside braces.** The whole `{…}:D` block is one inline token on one speech line.
- **No `v:` prefix.** The brace body is the prompt verbatim.
- **No overlay without a surrounding avatar speech line.** A bare `{@imageN …}:D` on its own line is forbidden — if you want full‑screen, use §2.3 bare `@imageN`.
- **Don't double up.** Don't follow an overlay with a silent shot of the same content back‑to‑back; pick one.


### 2.5 Standalone audio (rare)
```
@audioN
```
- Music / sfx bed not tied to an avatar. Used only when the user uploaded a clip they want played, or asked for standalone background music spanning shots.
- Prefer baking sound into a cinematic shot's prompt instead.

---

## 3. The "code‑prompt" data the AI returns

`buildScriptFromBrief` returns a JSON object (via the `submit_script` tool call). The script body in §2 is only one field; the rest are side‑channels the client merges into the project state:

```ts
{
  reply:        "Short friendly 1–3 sentence summary shown in chat.",
  script:       "the script, exactly per §2",
  roleChanges:  { "@image2": "video", "@image4": "continue" },
  videoPrompts: { "@image2": "Surfer drops into a glassy barrel, low GoPro angle, spray flying.",
                  "@image4": "Same surfer carves out, slow dolly-back, distant crowd cheer." },
  imageCaptions:{ "@image3": "Young woman smiling in a sunlit kitchen." },
  assetProposals: {
    "@image5": { kind:"image", role:"video",
                 prompt:"Aerial shot of a coastline at golden hour, slow lateral drift." },
    "@audio2": { kind:"music", durationSeconds:30,
                 prompt:"Upbeat lo-fi groove, soft kick, warm pads." }
  }
}
```

### 3.1 Hard rules for the data

1. **One role per photo.** `roleChanges` lists only photos whose role you are *changing* from the user's hint. Same photo cannot be both avatar and video.
2. **`videoPrompts` is required** for every `@imageN` whose final role is `video`, `overlay`, or `continue` — including newly proposed photos. Plain prose, 1–2 sentences, ≤ 200 chars, no `@tags`, no brackets, no boilerplate like "Cinematic 9:16 reference image". Subject + motion + camera move + mood + lighting + any sound that should be baked in.
3. **Every `@tag` not in the upload list** must have an `assetProposals` entry — otherwise the user sees a dead placeholder. Reuse uploaded media whenever it fits; cap proposals at ≤ 6 images / ≤ 2 audio per turn.
4. **Avatar proposals must be a person.** When `kind:"image"` and `role:"avatar"`, the `prompt` describes a **portrait of a person** (subject + framing + lighting). It **must not** echo objects, scenery, drinks, or props from the brief. If the brief is about a coffee shop, the avatar isn't a coffee cup — it's the person ordering it.
5. **Bake sound into shots.** If a cinematic shot needs sound, write it into that shot's `videoPrompt` ("…spray flying, loud wave roar"). Don't spawn a parallel `@audioN` just for sfx that belongs to one shot.
6. **One beat per line.** Two tokens per line is only legal as `@imageN @audioM` (§2.2).
7. **Total script ≤ 8000 characters.**

### 3.2 Role decision tree (when the user's hint is wrong)

The role the user picked at upload is a **hint**, not a contract. Override only when the photo clearly can't play that role:

- Landscape / object / wide shot tagged `avatar` → `video`.
- Clean talking‑head portrait tagged `video` **and** the brief gives that person a speaking line → `avatar`.
- Near‑duplicate of the previous photo, same moment, same subject → `continue`.
- Quick punchy insert (logo, prop close‑up, text frame) layered while the avatar talks → `overlay`.

Otherwise **keep the hint**. Never flip an avatar to video just because you'd prefer a cinematic shot.

---

## 4. Worked examples

### 4.1 Good

Uploaded: `@image1` (portrait, hint=avatar), `@image2` (surfer wave, hint=video).

```
@image1 [excited] Watch this drop in!
@image2
@image1 [laughs] Insane wave!
```
```ts
roleChanges:  { "@image2": "video" }    // already matches hint; could be omitted
videoPrompts: { "@image2": "Surfer drops into a glassy barrel, low GoPro angle, spray flying, loud wave roar." }
```

### 4.2 Good — overlay while talking

```
@image1 We open the doors {@image3 slow push-in on the neon OPEN sign}:3 and the line is already around the block.
```
```ts
roleChanges:  { "@image3": "overlay" }
videoPrompts: { "@image3": "Slow push-in on a glowing neon OPEN sign, dim street light, soft hum." }
```

### 4.3 Good — invent a missing avatar

User uploaded a coffee‑shop photo, no portrait, asked for a barista monologue.

```
@image2
@image3 Welcome in — what'll it be today?
```
```ts
roleChanges:  { "@image2": "video" }
videoPrompts: { "@image2": "Steam curling off an espresso shot, macro lens, warm overhead light." }
assetProposals: {
  "@image3": { kind:"image", role:"avatar",
               prompt:"Friendly barista in a black apron, eye-level portrait, soft window light, shallow depth of field." }
}
```

### 4.4 Bad — do NOT do this

```
@image2                    ← BARE tag for an AVATAR photo. Avatars need speech (§2.1) or @audioM (§2.2).
{@image2 v:...}:5          ← The "v:" form does not exist. Use bare @image2 on its own line (§2.3) or inline {@image2 ...}:D overlay (§2.4).
{@image2}:5                ← Braces without a prompt. Forbidden.
@image2:4 @image1 hi       ← Two tags + invented :4 duration on the line. Forbidden.
{
  @image2 cinematic shot
}:5                        ← Brace on its own line. Forbidden — the whole brace block is ONE line.
```

```ts
// BAD assetProposals
"@image1": { kind:"image", role:"avatar",
             prompt:"A cold glass of beer on a wooden table." }
// The user asked for an avatar. A glass is not a person. See §3.1 rule 4.
```

---

## 5. Backwards compatibility

Older chat history may contain `{@imageN v:<prompt>}:D` lines from the previous prompt version. The server still parses those — it strips the braces, moves the `v:` body into `videoPrompts`, and emits the canonical bare `@imageN` line. **Do not produce that form in new responses.** It is dead weight kept only so old projects keep loading.

---

## 6. Quick reference card

| Want                                              | Write                                                          |
| ------------------------------------------------- | -------------------------------------------------------------- |
| Avatar speaks                                     | `@image1 Here is what I want to say.`                          |
| Avatar lip‑syncs uploaded audio                   | `@image1 @audio2`                                              |
| Silent cinematic shot (video / overlay / continue) | `@image2` on its own line + `videoPrompts["@image2"]`          |
| Overlay clip while avatar keeps talking           | `@image1 Look at this {@image3 slow push-in}:3 right here.`    |
| Standalone music / sfx bed                        | `@audio1` on its own line                                      |
| Change a photo's role                             | `roleChanges: { "@image2": "video" }`                          |
| Invent a new photo                                | use `@image5` in the script + `assetProposals["@image5"]`      |
| Invent music                                      | use `@audio2` + `assetProposals["@audio2"] = { kind:"music" }` |