Building Multimodal Apps
Putting modalities together in a real product raises architecture questions a text-only app never faces. This page covers the decisions that recur.
Native model vs. pipeline
Section titled “Native model vs. pipeline”There are two ways to build a system that handles more than one modality.
A native multimodal model accepts several modalities directly — one model takes image and text, or audio and text. A pipeline chains specialized single-purpose models: speech-to-text → LLM → text-to-speech, or OCR → LLM.
| Native multimodal model | Pipeline of specialists | |
|---|---|---|
| Cross-modal nuance | Preserved — tone, layout, context | Lost at each boundary |
| Control & debugging | One opaque box | Inspect and swap each stage |
| Cost / latency | One call | Often cheaper, tunable per stage |
| Best stage quality | Whatever the model offers | Pick the best tool for each step |
Multimodal RAG
Section titled “Multimodal RAG”RAG assumed text. When the knowledge you want to retrieve is images, diagrams, screenshots, or audio, you have two options:
- Caption-based — at indexing time, a VLM describes each image in words; you then embed and retrieve those captions with ordinary text RAG. Simple, reuses your whole text stack, and the captions double as readable context. The cost: the caption is a lossy summary — detail the VLM didn’t mention is unsearchable.
- Multimodal embeddings — a CLIP-style model embeds images and text into one shared space, so a text query retrieves images directly with no captioning step. More faithful, but a separate model and index to run, and the retrieved image still needs describing before an LLM can reason over it.
Caption-based is the pragmatic default; reach for multimodal embeddings when visual nuance the captions miss is genuinely important.
Cost and latency are different here
Section titled “Cost and latency are different here”Mixed media breaks text-based intuitions:
- Images and audio are token-hungry. A single high-detail image can cost as much as several pages of text; audio scales with duration. Re-model your cost estimates — multimodal requests are not text requests.
- Generation is slow. Image and audio generation take seconds — run them asynchronously, stream progress, never block a request on them.
- Each modality fails its own way. A VLM misreads a chart; STT mistranscribes a name. Build a separate evaluation signal per modality — one aggregate score hides which part broke.
Recurring patterns
Section titled “Recurring patterns”- Document intelligence — PDF or scan → VLM/OCR → structured data extraction.
- Visual support — user sends a screenshot → VLM diagnoses the problem.
- Voice assistant — the voice agent pipeline.
- Generated media — text or image in → image/audio out, as an async job.
Each is the same discipline as any AI system: constrain the task, validate the output, handle the failure path — now across more than one kind of data.
Key takeaways
Section titled “Key takeaways”Choose a native multimodal model for cross-modal nuance and simplicity, a pipeline of specialists for control and observability — watching for information loss at the seams. For multimodal RAG, caption-based retrieval is the pragmatic default; multimodal embeddings are more faithful but add a model and index. Images and audio consume far more tokens than text and generate slowly, so re-model cost and run generation asynchronously — and evaluate each modality separately, because each fails in its own way.