The Problem With Words

I've been building an AI-powered spatial visualization app for months now. At some point during that process, I noticed something I couldn't explain at first: generating a concept image in one AI tool and feeding it to another โ€” to a coding agent that was building the actual product โ€” produced dramatically better results than any text prompt I could write.

Not marginally better. Dramatically better. The kind of difference where the model goes from technically satisfying the requirements to actually understanding the vision.

It reminds me of the old saying:

"A picture is worth a thousand words."

The strange part is that the AI doesn't "see" the image the way I do. It's not experiencing the aesthetic. It's not feeling the mood. So why does it work so much better than me carefully describing what I want? After-all, to the AI model, text and images both collapse into machine-readable representations inside the same reasoning substrate.

I think the answer reveals something fundamental about how we should be working with AI โ€” and it has implications way beyond prompt engineering.

Text Is Great for Constraints, Terrible for Taste

When you're directing an AI to build something, text is excellent at communicating constraints. Rules. Boundaries. Logic. You can write things like:

"No prose heuristics. Typed scene contract. Renderer-neutral substrate. Don't hardcode examples."

That's precise. The model can follow those instructions perfectly. But try communicating the felt quality bar โ€” the part that actually determines whether the thing you're building is any good:

"This should feel alive. The canvas should carry the answer. It should make text feel obsolete. A conceptual question should become an environment, not a chart."

Read that again. It's accurate to what I want. But it's abstract. The model can interpret it a hundred different ways, and most of those interpretations will be technically valid but aesthetically wrong. The problem is that the thing I'm actually trying to communicate โ€” the vibe, the spatial feel, the quality direction โ€” isn't primarily logical. It's aesthetic, spatial, and experiential. Text is the wrong medium for that.

What a Concept Image Actually Compresses

A single concept image โ€” even a rough one generated in a few seconds โ€” packs an enormous amount of directional information into one artifact. The AI might not experience it aesthetically, but it can still extract:

  • Composition โ€” focal points, contrast, pathways, hierarchy, balance
  • Density โ€” how much visual information feels rich versus overwhelming
  • Annotation style โ€” whether information feels integrated, contextual, minimal, technical, cinematic, etc.
  • Visual metaphor โ€” journeys, systems, gravity, growth, tension, exploration, scale
  • Mood โ€” dark, minimal, energetic, premium, organic, futuristic, playful
  • Product expectation โ€” what kind of experience this is supposed to be before a single word is read

Each of those would take a paragraph to describe in text. Together, they'd take a full page. And even then, the model would have to reconstruct the relationships between those properties from words alone. The image delivers all of it simultaneously, with the relationships already baked in.

I Call This a Taste Vector

The concept image acts as what I've started calling a taste vector. It tells the model: "aim over here" โ€” not with coordinates or specifications, but with a compressed bundle of aesthetic direction that points toward the right region of the quality space.

When I was building with Codex, the difference was stark. Before I fed it concept images, it could satisfy every written goal I gave it. The code was clean. The architecture was correct. But the output felt generic โ€” it was doing substrate cleanup when I needed spatial richness.

After the images, something clicked. The model suddenly understood the missing standard: natural conceptual questions need region/path/lane/form primitives and visual hierarchy โ€” not just technically valid nodes with pill labels. The model's own interpretation was:

"Not templates, but a quality direction."

That's exactly right. The images didn't give it a blueprint to copy. They gave it a target to aim at.

Why This Works (Even Though AI Doesn't "See")

The reason this works isn't because AI has developed taste. It's because vision models are trained on millions of images with associated descriptions, layouts, and contexts. When you hand a model a concept image, it maps it against that entire training distribution. It doesn't need to feel the mood โ€” it can classify it. It doesn't need to experience the spatial hierarchy โ€” it can detect it.

A concept image essentially lets you bypass the bottleneck of natural language for aesthetic communication. You're speaking to the model in a higher-bandwidth format for exactly the kind of information that text struggles to encode.

Think of it this way: explaining a melody in words is hard. Playing the melody is instant. Concept images are the equivalent of playing the melody for visual and spatial intent.

The Trap: Imitation vs. Calibration

There's one real danger with this approach, and it's worth naming explicitly: images can over-anchor the model into imitation. If you feed a concept image without the right framing, the AI will try to replicate it pixel by pixel rather than absorb the direction it represents.

The fix is simple but critical. When I hand off concept images, I always frame them as north-star references, not templates:

"Do not copy. Do not hardcode. Use these as quality calibration."

That single framing shift transforms the images from rigid blueprints into flexible taste vectors. The model absorbs the direction without being locked into the specifics. It generates in the spirit of rather than in the likeness of.

The Workflow

Here's how this looks in practice โ€” the actual loop I've been running:

  • Write the constraints in text. Architecture decisions, data contracts, what to avoid. Text is still the best medium for this.
  • Generate 2โ€“3 concept images for the aesthetic/spatial/experiential direction. These don't need to be polished โ€” they just need to capture the feel.
  • Feed both to the coding agent with explicit framing: the text is the spec, the images are the quality north-star. Do not replicate, calibrate.
  • Iterate on images, not just text. When the output isn't right, instead of writing more paragraphs, generate a new concept image that corrects the direction. It's faster and more precise.

Step four is the key insight most people miss. We default to refining text prompts when the output is off. But if the problem is aesthetic โ€” if the model technically satisfied your spec but the result feels wrong โ€” a new image will correct it faster than any amount of additional text.

The Deeper Implication

This realization goes beyond a prompting trick. It's actually the same thesis behind what I'm building with Nocturnal โ€” a macOS AI assistant built around spatial visualization instead of just a chat window.

The premise is that a visual can carry intent more efficiently than prose because it bundles relationships, priority, mood, structure, and direction at once. Text explains the system. An image shows the desired world. Chat interfaces force everything through a text bottleneck. Spatial interfaces let information exist in the form that best represents it.

What I stumbled into with concept images is the manual version of that idea โ€” using image generation as a bridge between human vision and AI execution. The future is AI interfaces that make this the default mode, not a workaround.

โ€”

If you're building with AI tools and you're only communicating through text, you're leaving a massive capability on the table. Generate a concept image. Frame it as a quality north-star. Watch the output snap into a completely different tier.

The image doesn't have to be photorealistic. It has to be directionally right. It can be wildly ambitious, futuristic, even beyond what you know how to build or describe yet โ€” but that is the point. The model can translate that visual direction into architecture, structure, and implementation patterns. That's the whole trick โ€” and it's more powerful than most people realize.