AI Image to Video: Turn Any Photo into a Moving Clip

AI image to video is a generation technique where a model takes one still picture, treats it as the first frame of a clip, and invents everything that happens next: motion, camera movement, lighting changes, sometimes sound. You supply a photo and a text description of what should move. The model returns a short video, usually 4 to 10 seconds, in which your exact image comes alive.

The short version, if that's all you need: upload a photo in the BananaBanana generator, describe the motion, wait 2–5 minutes. A silent 4-second clip from your photo starts at $0.10 on Veo 3.1 Lite, sound raises that to $0.18, and Gemini Omni Flash animates a photo with a full soundtrack for a flat $1.00. New accounts get $0.10 free, which covers exactly one silent test clip.

We run four video models in production, so this guide is written from actual generation logs rather than marketing pages. Including the failures. Some of those are more instructive than the successes.

How does AI image to video work?

Under the hood the still image is passed to the model as a conditioning frame. Google's Gemini API video docs describe Veo 3.1 as supporting "image-based direction" and "frame-specific generation," which in plain English means your picture steers the whole clip and you can pin exact frames where you want them. The model reads the scene, guesses at the physics (how the water in your photo would ripple, how hair would move in wind), and renders frame after frame forward from your starting point.

The practical consequence: the first frame of the output is your photo, nearly pixel for pixel. Everything after it is invented. Good prompts spend their words on the invention, not on re-describing what the model can already see.

This works startlingly well for scenes with obvious latent motion. A portrait breathes and blinks. Steam rises off coffee. A parked car pulls away. It works less well when the photo contains no plausible motion at all: feed it a flat product render on a white background with no motion cue in the prompt and you tend to get a slow, slightly nervous zoom. Not broken, just boring.

A vintage photograph held in two hands with the scene inside it starting to ripple and move, illustrating AI image to video animation

Which model turns a photo into video best?

All four video models on the platform accept an image as the first frame. They differ in ceilings and in what you pay.

	Veo 3.1	Veo 3.1 Fast	Veo 3.1 Lite	Gemini Omni Flash
Price (4s, silent)	$0.70	$0.35	$0.10	—
Price (4s, with audio)	$1.50	$0.50	$0.18	$1.00 flat, any length
Max resolution	4K	4K	1080p	720p only
Duration	exact 4/6/7/8 s	exact 4/6/7/8 s	exact 4/6/7/8 s	3–10 s, model decides
Audio	optional toggle	optional toggle	optional toggle	always on
Last frame + loops	yes	yes	yes	no
Extend to 148 s	yes	yes	no	no, conversational editing instead

My default is Veo 3.1 Fast. It keeps the useful controls (exact duration, an optional last frame, extension) at half the flagship price, and for social-sized output the quality gap against full Veo 3.1 rarely shows. Lite at $0.10 is where I test whether an idea moves at all before spending real money on it. Full Veo 3.1 earns its price when the clip involves water, fabric, or collisions, or when the delivery spec says 4K. Omni Flash is the pick when the sound matters as much as the picture; we took it apart in a separate Omni Flash API review if you want the details, including the 720p cap, which honestly stings for anything full-screen.

The clip above is a Veo 3.1 Fast generation from our production pipeline, made for this article: a framed photograph of a sailboat on a desk, and the prompt asks the sea inside the frame to start moving while the camera slowly pushes in. One take, no editing, silent by choice (Fast's silent tier is what we use for drafts). That "photo waking up" effect is essentially what the model does to your own uploads.

How to generate video from a photo, step by step

The whole flow on the generator takes about a minute of your time plus render time.

Switch the generator to video mode and pick a model. For a first try, Veo 3.1 Lite at $0.10 is the cheap way to learn.
Upload your photo as the first frame. JPG or PNG, and it gets cropped to the video's aspect ratio, so check the edges before you generate.
Pick 16:9 or 9:16. Vertical works fine for TikTok and Reels; the model handles portrait framing without complaint.
Write the motion prompt (next section) and, optionally, a negative prompt of up to 1,000 characters for things you don't want.
Choose duration and whether you want audio. On Veo models these are exact settings; Omni Flash decides length on its own.
Generate. Videos take roughly 2–5 minutes. You can close the tab; results wait in your history for 30 days.

Balance is only charged when a generation succeeds. Failed renders refund automatically, which matters more in video than in images because you'll iterate.

A three-panel storyboard showing a hand uploading a photo, a settings panel, and a finished video player, illustrating the image to video workflow

How do you write motion prompts for image to video?

Rule one: describe the motion, not the scene. The scene is already in the photo. "A woman in a red coat stands on a bridge" wastes the prompt on facts the model has. "She turns toward the camera and smiles as wind lifts her hair, slow dolly-in" is the same length and all signal.

Camera words do a lot of work with Veo. Dolly in, orbit, handheld, static tripod shot, slow pan left. The models were clearly trained on footage descriptions, and they respond to film vocabulary more reliably than to loose phrasing like "make it dynamic." I'd probably avoid "dynamic" altogether. Nobody agrees on what it means, including the model.

One lesson we learned the expensive way while producing demo clips for this blog: these models keep whatever is already happening in the first frame and quietly drop actions scheduled to start later. We once asked Omni Flash for a pancake flip "around the middle of the clip" and got ten seconds of a pancake just sitting there. Three times in a row, a dollar per take. Rewriting the action as already in progress fixed it on the first try. So: "syrup is pouring onto the stack" beats "syrup starts pouring after two seconds," and for image to video specifically, pick a starting photo where the action you want is at least plausible mid-motion.

A few patterns that keep earning their keep:

Portraits: "subtle breathing, a slow blink, then a small smile; camera locked" reads as alive without the uncanny warp big movements cause.
Products: "slow 180-degree orbit around the object, studio lighting stays constant" is the workhorse. Keep the background simple in the source photo.
Landscapes: name the moving elements ("clouds drift right, grass sways, light shifts warmer") or you'll get the nervous-zoom fallback.

A director's clapperboard next to handwritten prompt notes with arrows sketching camera moves over a photograph, illustrating motion prompt writing

Loops, last frames, and 148-second videos

Veo 3.1 models take an optional last frame as well as the first. Give it two different photos and it renders a transition between them, which is how you morph a day shot into a night shot or a sketch into a finished product. Give it the same photo for both and you get a clip that ends where it started: a clean loop, ready for a website background or a cinemagraph-style post. This is my favorite underused feature on the platform.

Extension is the other Veo trick. A finished clip can be continued by 7 seconds with a new prompt, and the chain can run up to 148 seconds total with full visual continuity (Veo 3.1 and Fast only; Lite and Omni Flash sit this one out). Starting a long chain from a photo you own is the closest thing to storyboard-free filmmaking the current APIs offer. It compounds cost too, so budget before you fall in love with the idea.

Omni Flash replaces extension with conversational editing: describe a change to the finished clip ("make it dusk, keep everything else the same") and pay another dollar for the revision. Different philosophy. Editing refines one moment; extension grows a timeline.

A strip of film frames bent into a perfect circle on a pastel background, illustrating a looped AI video that ends where it began

AI Image to Video: Turn Any Photo into a Moving Clip

How does AI image to video work?

Which model turns a photo into video best?

How to generate video from a photo, step by step

How do you write motion prompts for image to video?

Loops, last frames, and 148-second videos

FAQ

Can I turn a photo into a video for free?

What is the cheapest image to video API option?

Can the generated video include sound?

Can I animate a vertical photo for TikTok or Reels?

How long can an AI video made from a photo be?