AI cockpits compete on compute, but the real limit is memory

アンディ・チウ
4分前
7分で読める

As TOPS stops being the bottleneck, memory bandwidth is becoming the deciding factor for the next generation of smart cockpits.

In the last year or two, almost every new car launch has talked about an "AI cockpit." The number used to prove it is almost always TOPS, the chip's compute power. From Qualcomm's Snapdragon Cockpit Elite to MediaTek's Dimensity Auto, and many "large model running in the car" announcements, the headline compute number keeps going up.

But move from the launch stage to real engineering, and you notice something that is rarely said out loud: when these chips actually run a model that is large enough to be useful, the thing that decides the experience is usually not TOPS. It is a quieter number. Memory bandwidth which measured in GB/s. However bandwidth is only one side of it. Memory itself, its bandwidth, its capacity and how it is shared inside the chip is becoming the foundation that decides how good the next generation of smart cockpits will be.

TOPS was the right number, until the workload changed

There is a good historical reason why TOPS came first. In the earlier era of AI, the main job was computer vision: detecting objects, reading lane lines and watching the cabin. That kind of work needs a lot of computation, so compute really was the first limit and TOPS was the right way to measure it.

What changed is the workload itself, and it changed in two ways at the same time.

First, generative large models came into the car. When a large model writes its answer, it produces one token (roughly one word-piece) at a time. To produce each token, the chip has to read all of the model's weights from memory again. This step is limited by memory bandwidth, not by compute.

Second, and more important, cockpit AI is moving from "answer when asked" to "act on its own". Planning tasks and running several agents together. An agent is not one bigger model. It means several models stay loaded at once, the context and memory keep growing, and the system calls tools again and again. This needs more compute, but it also makes two demands on memory: enough capacity to hold several loaded models and a growing KV cache, and enough bandwidth to read all of that on every step. In short, generative models were only the start. The move to agents is what really brings the memory problem to the front.

It helps to split "fast" into two parts. Prefill is the step that reads your request; it sets how long you wait for the first word, and it uses compute (TOPS). Decode is the step that writes the answer; it sets how fast the words come out, and it uses memory bandwidth. So TOPS decides how quickly the assistant starts, and GB/s decides how smoothly it keeps talking. In a cockpit the request is often not short (a system prompt, retrieved manual text, several turns of conversation), so prefill and TOPS still matter. What generative and agent workloads mainly make heavier is the second step which is writing the answer, which depends on memory bandwidth, and which is the part least visible on a spec sheet.

A simple calculation, and some real numbers

Here is a rough calculation. A 7B model, after 4-bit compression, takes about 3.5 GB of memory. To reach about 20 tokens per second (close to human speaking speed), the decode step alone must read those 3.5 GB twenty times every second — about 70 GB/s. And that is only the floor. It assumes the memory bus is fully efficient, and it does not count the KV cache or anything else sharing the same bus. Real chips reach only 50–70% of their peak bandwidth, so to actually hold 20 tokens per second you need peak bandwidth well above 70 GB/s.

SemiDrive describes almost the same calculation in its own words. To output the first token within one second at a 512-token input, and then keep 20 tokens per second, it says you need an NPU of about 30–40 TOPS together with about 90 GB/s of bandwidth. It also points out that today's cockpit chips usually have enough NPU compute; what stops them from running a 7B model on the device is memory bandwidth, which is still stuck around 60–70 GB/s. Real tests point the same way. Vendor headline numbers (Snapdragon 8 Elite at about 70 tok/s, 8 Elite Gen 5 at about 220 tok/s) are usually measured on small models. On a Snapdragon 8 Elite, a real test of a useful 4-bit Llama 3.2 3B model ran at about 10 tok/s.

Looking across chips: the new ones all raise bandwidth. Put a few recent cockpit and central-compute chips side by side, and the direction is clear:

Chip (role)	NPU compute	Memory bandwidth	Target on-device model
Qualcomm 8295 (current mainstream cockpit, 5nm)	~30 TOPS	~60–70 GB/s	7B is a stretch
MediaTek CT-X1 (next-gen cockpit, 3nm)	46+ TOPS	not disclosed (vendor says memory subsystem beats the 8295)	13B
SemiDrive X10 (next-gen cockpit, 4nm)	40 TOPS	154 GB/s (128-bit LPDDR5X)	7B multimodal
NVIDIA Thor (central compute)	central-class*	273 GB/s (256-bit LPDDR5X)	LLM/VLM/VLA
Tesla AI4 (driving/vision, 7nm)	not disclosed**	~384 GB/s (GDDR6)	end-to-end vision / FSD

*Thor is a central-compute chip; its compute figure is not directly comparable to cockpit chips.

** Tesla does not officially publish AI4's TOPS, and outside estimates vary widely; the point here is its bandwidth route.

The point is this. Among cockpit chips, the NPU numbers are close (30–46 TOPS). What really sets apart "how big a model it can run" is the memory system. SemiDrive's X10 has only about 10 TOPS more than the 8295, but more than double the bandwidth, and that is what lets it run a 7B model on the device. CT-X1 supports 13B because of a stronger memory subsystem. The central-compute Thor goes all the way to a 256-bit bus and 273 GB/s. The pattern is the same: chips designed for AI keep raising memory bandwidth and bus width. This is becoming the shared design choice for the next generation of cockpit chips. HBM, the highest-bandwidth memory, cannot be used in cars yet because of cost, heat, and automotive qualification, so the industry has to get as much as it can out of LPDDR5X. Some go further: Tesla's AI4 uses power-hungry GDDR6 to push bandwidth to about 384 GB/s (at the cost of power and heat) — Elon Musk himself calls memory bandwidth "the choke point" for AI inference, dropped the older HW3 because its bandwidth was too low, and says the next chip, AI5, will target about 5× the bandwidth.

And this memory is not used by the AI model alone

This is where a car is harder than a phone. A cockpit chip uses one shared pool of memory (a unified memory architecture). The GPU drawing several screens, the camera feeds, and the audio and video pipeline all use the same LPDDR memory bus as the NPU that runs the model. When one inference suddenly takes tens of GB/s, and the whole chip only has about 60–70 GB/s, the result may not be "the AI is slow." It may be that the screens drop frames and the navigation stutters. Or, to keep the screens and navigation smooth, the system limits how much bandwidth the model is allowed to use. So the bandwidth wall is not only about whether the AI is fast. It is about whether the AI, once it really runs, slows down everything else in the cockpit.

This is not just theory. The industry is already using two-SoC designs and separate AI boxes to give the AI its own memory and bandwidth. And the move to agents makes it tighter: several loaded agents, each with its own KV cache, draw on the same budget — in both capacity and bandwidth.

How to deal with the bandwidth wall: several paths already in use

This is not a dead end. The industry is working on several levels at the same time. (This is a description of what is happening, not a prescription.)

Hardware. Widen the memory bus (from 128-bit to 256-bit), use faster LPDDR, and add priority rules in the memory controller so screens and cameras are protected. Where needed, use a second SoC or separate AI memory to keep the workloads apart. The cost is more chip area, more power, and a higher bill of materials.
Software (the cheapest bandwidth). Strong quantization (W4A16), KV-cache quantization and management (paged attention, sliding windows), and speculative decoding. Each of these either moves fewer bytes or produces more tokens per read. When the hardware is fixed, this gives the most value, and it is the part a Tier-1 should own as its own technology.
Architecture. Use a set of small models with routing instead of one large model. Split the work between the car and the cloud, sending the heaviest generation to the cloud — but count the ongoing cost and the need for a connection. And treat the memory and bandwidth budget as a first-class design rule at start of production, leaving headroom for the 12–15 year life of the car.
Self-developed silicon and new architecture. More and more OEMs are building their own chips to keep memory and architecture in their own hands (NIO's Shenji NX9031, XPeng's Turing, and Tesla's FSD are all examples). The one aimed most directly at the memory wall is Li Auto's self-developed Mahe 100 : instead of just piling on TOPS, it uses a dataflow architecture — letting data flow between on-chip compute units and cutting repeated reads and writes to DRAM, so it gets more usable compute out of the same raw number (Li Auto claims one chip delivers about 3× the effective compute of NVIDIA's Thor). The new L9 carries two of them, for 2,560 TOPS combined. This points to a second way around the memory wall: not only "add more bandwidth," but also "redesign the architecture so you need less of it."

These are not either/or. Real platforms usually combine them. Together they decide how much "intelligence" fits inside a fixed, and increasingly expensive, memory budget.

What this means for chip selection

Memory — its bandwidth, its capacity and how it is shared is becoming the base that decides the next-generation cockpit experience, and with DRAM and automotive LPDDR prices climbing, it is also a cost-and-supply decision. The practical takeaway: in your next platform selection and RFQ, treat memory as a first-class metric alongside TOPS, asking for sustained (not peak) bandwidth, the headroom left for AI while the screens and cameras are running, and real tokens-per-second on a reference 7B model and leave headroom for the life of the car. Compute decides whether the cockpit can start; memory decides whether it stays good to use.

SBD Automotive can benchmark your platform choices on the metrics that actually decide on-device AI — bandwidth, bandwidth per dollar, shared-memory headroom, and real on-device throughput. To discuss what this means for your chip, software, and supplier roadmap: info@sbdautomotive.com

Andy Qiu, Senior Manager at SBD Automotive

AI cockpits compete on compute, but the real limit is memory

TOPS was the right number, until the workload changed

A simple calculation, and some real numbers

And this memory is not used by the AI model alone

How to deal with the bandwidth wall: several paths already in use

What this means for chip selection

関連記事

SBD Automotiveについて

リーダーシップチーム

グローバル拠点

キャリア情報

キャリア

Connected

自律的な

シェアード・モビリティ

電気自動車

セキュリティ

コンサルティング

レポート

Digital Hub

インサイト