What are VLM AI Models and What is FastVLM?
Vision Language Models (VLMs) are AI systems that can understand both images and text at the same time. This allows them to do tasks like:
- Answer questions about photos
- Analyze charts
- Read text within images
If you’ve ever tried uploading an image to ChatGPT and asking questions about it — you’ve used a VLM.
Historically, these models have faced a massive problem: processing high-quality images takes a lot of computing power and causes major delays.
Getting back to the same example with ChatGPT — it can take a while for it to reply about the image, right?
Also, traditional vision encoders, especially Vision Transformers (ViTs), don't work as well as they should at higher image resolutions.
Higher resolution images generate more visual tokens, which are the basic units of information that the AI processes.
We all know what happens when we feed a very large document to ChatGPT — this works in the same way. This also increases the time it takes for the language model to start generating responses. This delay is called the time-to-first-token (TTFT) delay. It has been a major problem that has stopped VLMs from being useful in many situations.
What Apple Did
Apple's research team solved the lag problem, and the accuracy problem, by creating FastViTHD, a novel hybrid vision encoder at the heart of FastVLM. Now, this is going to get a bit technical, so, please bear with me.
FastViTHD is an encoder that uses a five-stage design that combines different processing techniques to work as efficiently as possible.
- The first three stages use RepMixer blocks to quickly process the data
- The last two stages use multi-headed self-attention mechanisms.
This sounds very complex but essentially, Apple created their own DLLS for the world of AI image models.
Their solution also has a special design that lets the encoder generate four times fewer visual tokens for the language model to process, which makes the entire thing work much faster.
When this model is used on consumer devices, it could let you point a camera at your kitchen table with ingredients. The AI could then process that in real time and give you live cooking suggestions.
FastVLM Benchmarks
The performance improvements delivered by FastVLM are remarkable across multiple benchmarks:
- FastVLM is 3.2 times faster than the commonly used LLaVA-1.5 setup.
- When compared to LLaVA-OneVision at its maximum resolution (1152 x 1152), FastVLM is just as accurate but 85 times faster.
- The vision encoder is 3.4 times smaller than similar systems, making it more practical for use on devices that don't have a lot of resources.
In tests against other recent models, FastVLM regularly does better. When compared to ConvLLaVA, which used the same language model and training data, FastVLM performed 8.4% better on TextVQA and 12.5% better on DocVQA. It also operated 22% faster. The advantage is even more noticeable at higher resolutions. At these resolutions, FastVLM processes information twice as fast as ConvLLaVA.
When compared to another well-known vision-language model, MM1, FastVLM performs just as well, if not better, across various tests while producing five times fewer visual tokens. It also outperforms Cambrian-1 by running 7.9 times faster.
Why This Matters
FastVLM is a major step forward in making vision-language AI useful for real-world applications. The big decrease in processing time and computational needs creates new opportunities for:
- Running on mobile devices: The smaller model size and faster processing make it possible to use advanced vision-language AI on smartphones and tablets.
- Working in real-time: The 85x speed improvement allows for real-time document analysis, visual question answering, and image understanding.
- Improved experiences: AI assistants respond faster, making them more natural and responsive when talking about images.
FastVLM performs so well because it intelligently balances image resolution, processing speed, token count, and model size. When older models used complicated token pruning strategies or processed multiple lower-resolution tiles, FastVLM achieves optimal balance simply by scaling the input image appropriately.
Where to Access FastVLM
Apple has made FastVLM available to the public, on Hugging Face Model Hub. You can find the model under the name "apple/FastVLM-0.5B" if you're looking for the version with 0.5 billion parameters.
To run FastVLM, you'll need:
- PyTorch deep learning framework
- Hugging Face Transformers library (version 4.37.0 or later recommended)
- PIL (Python Imaging Library) for image processing
- CUDA-capable GPU for optimal performance (though CPU inference is possible with adjusted settings)
Bottom Line
Apple released its own AI model, but it's not a new version of ChatGPT. Instead, it's a Visual Language Model (VLM) called FastVLM.
VLMs are a series of AI models that can process text. Apple's release is important because the model is up to 85 times faster and about 3 times smaller than similar models from competitors. This should allow it to run efficiently on consumer hardware.
This could lead to many interesting uses for an AI assistant that can process images or even video in real time and interact with them. And, because this is happening on your device, there is no privacy concern. We will probably see new features powered by FastVLM soon, and it's exciting.