Character Consistent Image Generation: A Field Guide

Character consistent image generation is a set of techniques that make an AI model draw the same character, with the same face, hair, and clothing, across many separate images instead of inventing a new person every time. It's the difference between "a red-haired woman in a cafe" returning a stranger on each run and returning your red-haired woman, the one from frame one of your comic, your ad campaign, or your storyboard.

The quick version: there are two working methods, and they stack. First, upload reference images of your character. Nano Banana 2 accepts up to 4 character references, Nano Banana Pro up to 5, and Veo 3.1 carries up to 3 into video. Second, write a character sheet prompt, an exhaustive fixed description you paste into every generation. On BananaBanana a consistent character costs from $0.06 per 1K image on Nano Banana 2, and every demo in this post was generated on the platform, so the prompts here are copy-pasteable as-is.

I'll be honest up front about the failure modes too. Consistency in current models is good, not perfect. You'll see where it slips.

Why does the same character look different every time?

Image models don't have memory between requests. Each generation starts from noise, and your prompt is the only bridge. If the prompt says "a young woman with red hair," the model samples one of the millions of young women with red hair it can plausibly draw. Run it five times, get five people. They might share a vibe. They won't share a face.

This is called character drift, and it gets worse with vague prompts. "The same woman as before" does nothing, because there is no before. The model never saw your previous image unless you explicitly attach it.

Drift also creeps in within a single workflow. Change the camera angle and the jaw line shifts. Change the outfit and suddenly the freckles are gone. In my experience the most fragile features are exactly the ones humans use to recognize people: eye shape, nose, the hairline. Backgrounds and clothing survive fine. Faces wander.

So the whole game is feeding identity back into the model on every single request, either as pixels (reference images) or as text (a locked description). Preferably both.

A row of AI-generated portrait frames where the same character gradually morphs into a different person, illustrating character drift in image generation

How many reference images does each model accept?

Reference images are the stronger of the two methods: you upload photos of the character and the model treats them as ground truth. According to Google's Gemini API image docs, the limits differ sharply by model, and the differences are worth knowing before you pick one.

Model	Character refs	Object refs	Style refs	Price per 1K image
Nano Banana 2 Lite	none	up to 14	none	$0.03
Nano Banana 2	up to 4	up to 10	up to 3	$0.06
Nano Banana Pro	up to 5	up to 6	none	$0.11

The surprise in that table is Lite. It takes the most images overall (14), but the docs are explicit that they're object references only, no character consistency support. We learned this the practical way: Lite happily accepts a face as input and then treats it like a product photo, matching the jacket and losing the person. If your workflow is character-driven, Lite is out, whatever the price says. (For everything else it's a genuinely good budget model; we wrote a separate guide on when Lite is enough.)

Between the other two: Nano Banana 2 at $0.06 is where I'd start. Pro's fifth character slot and stronger identity lock matter for multi-character scenes and final assets, and at $0.11 per image it's not much of a premium when the image actually ships somewhere.

Upload angle matters more than count. Three references from one angle teach the model one angle. A front shot, a profile, and a three-quarter view teach it a head.

Stacks of character reference photos from different angles feeding into an AI image model, showing how reference images guide character consistent generation

What is the character sheet method?

A character sheet is a fixed block of text describing your character exhaustively, which you paste verbatim into every prompt. It's the text-only fallback when you have no reference photos yet, and it's how you create the reference photos in the first place.

The rule: describe the character the way a police sketch artist would want it. Age, build, skin, hair color and cut, eye color, distinguishing marks, one or two anchor accessories. Vague adjectives drift; concrete nouns hold.

Here's the exact block used for the demos below:

a woman in her late 20s with shoulder-length copper-red wavy hair,
pale skin with faint freckles across the nose and cheekbones,
green eyes, a small silver hoop earring in her left ear,
wearing a mustard-yellow corduroy jacket over a white t-shirt

Both images below were generated with Nano Banana Pro from that same block. Only the scene text around it changed. No reference images were attached, this is pure text-anchored consistency:

AI generated character consistency demo: red-haired woman in a mustard corduroy jacket reading by a cafe window, generated with Nano Banana Pro

The same AI character, a red-haired woman in a mustard corduroy jacket, on a rooftop at golden hour, showing character consistent image generation with Nano Banana Pro

Same person? Close. The face holds up well, the jacket and earring are locked, the freckles survived both scenes. If you pixel-peep you'll find the hair length varies slightly. That's the honest ceiling of text-only consistency, which is why the pro workflow combines methods: generate your best character image from the sheet, then feed that image back as a character reference for everything after. Text sets the identity, pixels enforce it.

Two smaller tips from running this in production. Anchor accessories (the earring, the jacket) do a lot of recognition work for very few tokens; give every character one. And keep the sheet under about 60 words, because past that the scene description starts fighting the character description for the model's attention.

Multi-character scenes and AI storyboards

Two consistent characters in one frame is where cheap tricks stop working. A merged text sheet for both characters tends to cross-contaminate: character A borrows B's hair, B inherits A's jacket. Probably fixable with enough reruns, but reruns cost money.

Reference images fix it properly. Nano Banana 2 takes up to 4 character references and Nano Banana Pro up to 5, per Google's docs, and those slots can be split across different people. Upload two or three shots of each character, then write the scene naming them by role ("the red-haired woman hands a coffee to the gray-bearded man"). Identity stays attached to the right person far more reliably, though hands passing objects between two people are still, in 2026, a lottery.

Storyboards are the natural next step, and there are two ways to do them. Frame by frame: one generation per panel, character refs attached to each, $0.06 a panel on Nano Banana 2, so a six-panel board runs $0.36. Or single-image: ask Nano Banana Pro for a multi-panel layout in one generation. The panel below was made that way, one $0.11 request, same character sheet as above:

Four-panel AI storyboard generated in a single Nano Banana Pro request, the same red-haired character consistent across all panels: waking up, cycling, presenting, and on a rooftop

Consistency inside a single image is essentially free, since the model draws all panels in one pass and can see its own work. The trade-off is resolution: each panel is a quarter of the frame. For pitch decks and animatics that's fine. For panels you'll ship individually, generate frame by frame and pay the $0.36.

Can you carry a character into video?

Yes, and this is where the pipeline gets fun. Veo 3.1 supports what Google's Vertex AI docs call asset reference images: up to 3 photos of a person, character, or product, and the model preserves that subject's appearance in the generated clip. In our generator this is the Subject Reference section, and the same 3-image limit applies to Gemini Omni Flash if you want sound (flat $1.00 per clip, full Omni review here).

The character sheet still earns its keep in video prompts. The clip below is Veo 3.1 Fast, text only, same description block as the images above, plus motion and camera language:

The prompt: the character sheet, then "walking toward the camera along a city street at golden hour, slow dolly back, medium shot, shallow depth of field." Six seconds, silent 720p, $0.52. She's recognizably the same character as the stills, which for a text-only handoff between two different models is honestly better than I expected.

A caveat from our generation logs: asset references sometimes flatten expressions. The face matches but plays a little waxy, especially in wide shots where it occupies few pixels. Close-ups fare better. If a clip comes back with a dead-eyed lead, reframe tighter rather than rerolling the same shot.

The full stack, then: character sheet → hero images on Nano Banana Pro → those images as character refs for every still and as asset refs for video. One identity, one pipeline, images from $0.06 and clips from $0.10 on the pricing page.

FAQ

Which AI model is best for character consistency?

Nano Banana Pro, by the numbers: up to 5 character reference images and the strongest identity lock in the family, at $0.11 per 1K or 2K image. Nano Banana 2 is the value pick at $0.06 with 4 character slots plus style references, which Pro lacks. Avoid Nano Banana 2 Lite for this use case; it doesn't support character references at all.

How do I keep the same face across AI images without reference photos?

Write a character sheet: a fixed 40–60 word description covering age, hair, eyes, skin, distinguishing marks, and one anchor accessory, and paste it unchanged into every prompt. Then use your best output as a reference image going forward. Text alone gets you roughly there; text plus image refs keeps you there.

Can I use a real person's photo as a character reference?

Technically the API accepts any photo. Legally and ethically, only use photos you have rights to and the person's consent for. Generating recognizable real people without consent violates most platforms' terms, ours included.

How many reference images should I upload?

Fewer, more varied images beat many similar ones. Three shots covering front, profile, and three-quarter angles usually outperform five near-identical selfies. The hard caps: 4 character images on Nano Banana 2, 5 on Nano Banana Pro, 3 on Veo 3.1 video.

Does character consistency work for non-human characters?

Mostly yes. Mascots, robots, and stylized animals often hold better than humans, because their identity lives in bold shapes and colors rather than subtle facial geometry. The same methods apply: distinctive fixed features in the sheet, reference images once you have a canonical design.