Local LLM hardware
calculator and planner

Estimate VRAM usage and tokens-per-second for running Llama, Qwen, DeepSeek, Mistral, and other open-source LLMs on any GPU.

Configuration
Model
Quantization
GPU
Workload
1K2K4K8K 16K32K64K128K
18162432
Offload weights to CPU/RAM
Slower, but lets oversized models run.
Run inside Atomic Chat
Atomic Chat compresses the KV cache 6× with zero quality loss, so you can run bigger context windows on the same GPU.
Learn how it works →
Live inference preview
Stream at ~0 tok/s
Live LLM response will appear here at realistic inference speed.
Try a question
Atomic Chat

Stop paying for AI. Own it.

Free, local AI chat for Mac — run Llama, Qwen, DeepSeek, Mistral and 1,000+ more models privately on your own hardware. Open-source. Zero cost.

Download Atomic Chat

Live preview powered by Overchat AI

Why Use Our LLM VRAM Calculator?

With the Overchat AI LLM Hardware Requirements Calculator, you can estimate how much VRAM or unified memory you need to run different local AI models on a Mac. You can also preview how quickly streaming will look in the chat with and without CPU and RAM offloading.

The best VRAM calculator

The Overchat AI LLM Inference Hardware Calculator allows you to choose from the most popular AI models and see how they perform on many systems with granular customization. You can set different context lengths and the number of concurrent tasks.

Overchat AI VRAM calculator
🚀

Every popular open-source LLM

Our LLM VRAM calculator covers all the major open-source families, including Llama, Qwen, Mistral, DeepSeek, Gemma, Phi, and GPT OSS. You can also select models with different parameters within each family.

💻

Quantization-aware VRAM math

Quantization is the process of compressing model weights. Higher compression allows you to run larger models on modest hardware, but the quality of the output decreases progressively. With Overchat AI, you can determine the optimal level of quantization.

♾️

Context length & batch size modeled

The KV cache takes up memory space, and its size increases the more you chat and use up more of the model's context window. This has a significant impact on real-world performance. With Overchat AI, you can see how each model will perform in realistic scenarios, not just with the first question.

About Overchat AI

Don’t have the VRAM? Skip the hardware and run ChatGPT, Claude, Gemini, DeepSeek, and more on Overchat AI — no GPU required.

Overchat AI Interface

What is an LLM VRAM Calculator?

An online Large Language Model (LLM) VRAM calculator estimates how much GPU memory is needed to run an LLM. With Overchat AI, you can simulate elastic load and use the interactive chat widget to see the real-world typing speed you will experience in your chatbot. This takes into account realistic KV cache build-up, context usage, and more. You can also see how the speed will be impacted if the model doesn't fit entirely into memory.

Our tool is against real Atomic Chat, vLLM, llama.cpp, and Hugging Face Transformers.

This way, you can select the largest model that will run smoothly on your system. First, choose your graphics card, iMac, or MacBook configuration. Then, select the model you want to run from the dropdown menu. Finally, drag the context and concurrent users sliders to set a realistic load. The dial on the right will indicate whether the model fits into memory and how the memory will be allocated. Scroll down to the chat simulation to see the realistic streaming speed of your chosen model on your hardware.

Features Of Our LLM Inference Calculator

Real-time VRAM breakdown. You can adjust the quantization, context length, and batch size to determine how many concurrent operations you want to run. The memory breakdown updates instantly. This allows you to see not only if your system can theoretically run a particular model, but also how it will perform in the real world and where bottlenecks will occur.

Overchat AI also shows generation speed in tokens per second and, time to first token —  usethe chat simulation widget to see how it will feel to run the model on your systme in real life.

When a model is too large for a single card, the calculator offers a CPU offloading mode. With this mode, you can see how a model will perform with CPU/RAM offload. You'll notice that streaming speed and time to first tokens increase significantly, and you'll be able to decide if the additional wait time is acceptable.

The calculator comes with presets for many modern GPU cards from NVIDIA, as well as Apple Silicon devices. These range from basic M1 Macs to M5 Max and M4 Ultra.

FAQ

What is an LLM VRAM calculator?

A LLM VRAM calculator estimates how much GPU memory a large language model needs for inference. This estimation is based on the model size, quantization, context length, and batch size.This allows you to see exactly why a setup works or doesn't.

How to use Overchat AI LLM hardware calculator?

Select a model, set the quantization level, and select your GPU in the calculator above. You can also turn on CPU/RAM offload, if you want, to see if you can run a model that otherwise doesn't fit into your memory. The panel on the right will show you how much VRAM the model will use, and whether your system can run it.

How much VRAM do I need to run an LLM locally?

 As a rough guide, a 7B model needs around 6–8 GB of VRAM. To run a 13B model, you'll need 10–12 GB. Finally, a 70B model needs 40–48 GB. This is just to load the model into memory. As you keep chatting, you will start utilizing longer context windows, which will create caches that will take even more space in memory. You can see how this impacts performance in our  LLM VRAM Calculator above.

From The Blog

Overchat AI For All Platforms

Available on Web, iOS, and Android. Access your AI assistant anywhere, anytime.

Google Play Store badgeApp Store badge
Overchat AI Desktop and mobile interfaces