AI search is multimodal, but it still grounds answers in text. So the way to optimize images and video is to attach a clean text layer to every visual asset: descriptive alt text and real captions for images, full transcripts and caption tracks for video, accurate titles and descriptions, and ImageObject or VideoObject schema. YouTube and other multimodal sources are heavily favored in Google AI Overviews and AI Mode, with video making up about 23.3% of AI Overviews citations, so a transcribed, well-described video is one of the highest-leverage assets you can publish.
Why does AI search care about images and video at all?
Because the answer surfaces changed. Google AI Overviews and AI Mode are multimodal: they pull thumbnails, diagrams, and video clips directly into the synthesized answer, and they favor sources that come with rich media. Video alone makes up roughly 23.3% of AI Overviews citations, and YouTube is one of the most-cited domains across those answers. A page that explains a concept in words and shows it in a captioned image or a transcribed video gives the model two ways to understand and credit you instead of one.
Here is the catch that trips most people up: the model is not really "watching" your video or "looking" at your photo the way a person does. It leans on the text attached to that media. So the goal is not to make prettier visuals, it is to make your visuals legible, by wrapping each one in metadata a language model can read. That single idea drives everything below.
How do I optimize images so AI can read them?
Start with the four pieces of text every meaningful image needs. First, descriptive alt text written for a human, not stuffed with keywords: describe what the image actually shows. Second, a real caption placed near the image that states the fact the image proves. Third, a descriptive file name like concentrate-dilution-ratio.png instead of IMG_4821.png. Fourth, surrounding text that explains the same point in words, because the paragraph next to an image is part of how the model interprets it.
For images that carry a key fact, a chart, a product shot, a diagram, add ImageObject schema with a caption and content URL so the asset is described in structured data too. Compress the file so it loads fast, since a slow or uncrawlable image is invisible. The pattern mirrors how AI reads everything else on your page; if you want the underlying mechanics, our guide to schema markup, the language AI actually reads covers the structured-data side in depth.
What makes a video legible to a multimodal model?
The highest-value thing you can attach to a video is a full transcript. A transcript turns minutes of speech into extractable text, the exact format a model can lift a sentence from and cite. Right behind it is an accurate caption track (real captions, not the rough auto-generated kind), which both helps viewers and gives the engine a time-aligned text version of the content.
Then come the basics that are easy to skip: a clear, descriptive title that names the question the video answers, a description that summarizes the content in plain language with the key points up top, and timestamps or chapters so individual segments are addressable. Each of those is text the model reads to decide what your video is about and which span of it answers a given query.
This is also why publishing on YouTube specifically pays off: it is the platform AI Overviews and AI Mode cite most, and it hands you transcripts, captions, chapters, and descriptions as native features. We wrote a full playbook on that leverage in use YouTube to win AI Overviews.
Do I need video schema, and how much does it help?
Yes, for any video you want surfaced. VideoObject schema tells search and AI engines the video's name, description, thumbnail, upload date, duration, and, critically, its transcript and content URL. It removes ambiguity: instead of inferring what your video is, the engine is handed a structured description of it. That makes the video eligible for richer treatment and easier for a multimodal answer to pull.
For self-hosted video, add VideoObject directly. For embedded YouTube, the platform carries much of this for you, which is one more reason to host there and embed back. Pair the schema with the on-page text, the transcript visible on the page, a heading that asks the question the video answers, and you have given the model three independent signals pointing at the same fact. Three signals beats one every time, and AI grounds its answers by reconciling many signals at once, the same way it does for text, as we cover in how to rank in Google AI Overviews.
How does AI use my media to actually cite me?
A citation happens when the engine can connect a span of its answer to a specific source, and credit that source. Media earns that connection through its text layer. When AI Overviews shows a video card or a thumbnail next to a claim, it chose that asset because the surrounding metadata, the transcript, the caption, the description, matched the sub-question it was answering. No readable text, no match, no card.
So the workflow is concrete. Pick the buyer questions you want to win. For each, decide whether an image or a video answers it best. Produce the asset, then layer on the text: alt and caption for an image, transcript and schema for a video, plus surrounding copy that states the answer in words. That is the whole loop, media to metadata to multimodal AI to citation, and it is the same answer-first discipline we apply to written pages, which we map out in the GEO content brief.
What mistakes quietly keep your media out of AI answers?
A handful of avoidable errors do most of the damage. Empty or decorative alt text on images that actually carry meaning. No transcript on video, which leaves the model with almost nothing to read. Auto-captions left uncorrected, so the text layer is full of errors the engine then trusts. Generic file names and titles that describe nothing. And blocked media, images or video the crawler cannot reach because of robots rules or lazy-loading that never resolves for bots.
The fix for all of them is the same discipline: treat every visual asset as if it ships with a paragraph of text, because to a model, it does. We run one visibility engine across more than 10 brands, and media optimization is just another lane on that engine, not a separate project. When the text layer is clean, images and video stop being dead weight the AI skips and start being the assets that get you surfaced.
Questions people ask
Give every meaningful image descriptive alt text written for a human, a real caption near the image that states what it shows, a descriptive file name, and a place in surrounding text that explains the same fact in words. AI search reads images as part of the page, so the words around an image and the words attached to it are what let a model understand and credit it. Add ImageObject schema for key images.
Yes. Google AI Overviews and AI Mode lean heavily on multimodal sources, and YouTube is one of the most cited domains in those answers, with video making up roughly 23.3% of AI Overviews citations. A video with a full transcript, an accurate caption track, a clear title and description, and VideoObject schema gives the model text it can lift, which is how a video earns a citation rather than just a view.
Transcripts and captions are the highest-value metadata because they turn audio and motion into extractable text. After that, descriptive alt text and captions for images, accurate titles and descriptions, and ImageObject or VideoObject schema. The pattern is consistent: every visual asset needs a text layer the model can read, because models still ground answers mostly in words.
Want this done for you?
Want to know if your images and video are pulling their weight in AI answers? Start with an AI visibility audit.
Get a free AI Visibility Audit →