The short version, since you probably scrolled here first:
For most people, right now, rent via the API and wait.
The hardware just got more expensive, the memory spike behind it may cool off, and the models actually worth obsessing over don't fit on a single box anyway.
Go local only if one of these is you:
- you have data that genuinely can't leave your hardware
- you're running a heavy workload constantly, many hours a day
- you really want the quiet and low power of something like a Mac cluster
- you like to tinker and want to learn
If that's you, keep reading and I'll show you exactly which tier you land in. If it's not, you just saved yourself a few grand. You're welcome.
This whole question blew up because two things hit this month. The government pulled a couple of frontier models over export rules, and Apple jacked up hardware prices overnight, and both have people scrambling to own their own gear.
Start here: four questions that tell you where you belong
Before we talk about a single piece of hardware, answer these four questions, because they sort you faster than any spec sheet.
- Do you need to load big models, or run them fast? These are two different things and people conflate them all the time. Unified memory (like the 128GB in a DGX Spark or a maxed-out MacBook) is great for loading a large model so it fits at all. Memory bandwidth is the thing that actually gives you tokens per second once it is loaded. If you have a lot of memory but low bandwidth, the model fits and then runs slowly, and a lot of people are surprised by that.
- Are you doing agentic workloads, and do you mind if they take longer? If you are running agents that grind through a task in the background and you are fine letting them cook for a while, slower tokens per second matter way less. If you need a snappy interactive response every time, bandwidth becomes the whole game.
- Is cost actually a consideration? Be honest about your real usage, because once you add the price of the card, and the energy to keep a machine running, and your time keeping it healthy, the API is cheaper than people expect for the amount most folks actually use.
- Do you have a privacy requirement? Is there data that genuinely cannot leave your hardware, like regulated customer data or something under an agreement you signed? That is one of the few things that flips this whole decision on its own.
Answer those four and you will already know which tier you are in. The rest of this is just filling in the details.
The decision tree

The hardware reality people get wrong
This surprised me when I first started looking at this…
The RTX 5090 and the RTX PRO 6000 Blackwell are the same chip with the same memory bandwidth. They are both built on the same Blackwell die and they both move memory at around 1,792 GB/s. The only real difference is how much memory they carry, which is 32GB on the 5090 and 96GB on the PRO 6000. So if the model you want to run fits inside 32GB, the cheaper card gives you the same speed. The expensive card is not buying you tokens per second, it’ll buy you room for bigger models with speed.
The other question I get constantly is the difference between a DGX Spark and an RTX PRO 6000, and the answer comes down to memory bandwidth and unified memory. The DGX Spark has 128GB of unified memory, which is wonderful for loading a really big model so it fits, and its bandwidth is around 273 GB/s, which is low, and low bandwidth just means the tokens per second are slower. The PRO 6000 has less memory at 96GB but its bandwidth is around 1,792 GB/s, so whatever fits on it runs a lot faster. One box is built to fit big things, and the other box is built to run things fast, and you want to be clear about which problem you actually have.
Device | Memory | Bandwidth | Roughly (after June 25) | Best at |
RTX 5090 | 32GB GDDR7 | ~1,792 GB/s | ~$2,000 | Fast inference on anything that fits in 32GB |
RTX PRO 6000 Blackwell | 96GB GDDR7 (ECC) | ~1,792 GB/s | ~$7,500 to $8,400 | Bigger models on one card, fast, with real concurrency |
M5 Max MacBook Pro (128GB) | 128GB unified | ~614 GB/s | ~$4,099 | Quiet, power efficient, and “clusterable” |
DGX Spark | 128GB unified | ~273 GB/s | ~$4,000 | Loading big models, CUDA experiments (slow decode) |
AMD Ryzen AI Max+ 395 (Strix Halo) | 128GB unified | ~256 GB/s | ~$2,000 to $3,400 | Cheapest 128GB box (software still maturing) |
The timing problem nobody is pricing in
I’m having extreme FOMO and I think this part matters most for those who have an urge to buy this week.
Apple did not raise prices because it wanted to. It raised them because of a memory chip shortage, and that shortage is being driven by everybody building out AI data centers and buying up all the memory. DRAM prices rose as much as 98 percent in the first quarter of this year and are set to climb another 58 to 63 percent this quarter, and they have gone up more than fourfold since late 2025. Apple's own CEO called it a hundred-year flood and said the increases were unavoidable.
And the machines that got hit the hardest are exactly the ones you would buy for local AI…you see, the thing that makes them good for AI is a lot of memory, and memory is the scarce thing. The M3 Ultra Mac Studio jumped $1,300, which is about 33 percent, and the M5 Max MacBook Pro and the M4 Max Mac Studio each went up $500, while the iPhone did not move at all.
So sit with the irony for a second. The data center boom is what is taxing the consumer hardware that people want to buy so they can stop depending on data centers. And the FOMO is telling you to buy now before it gets worse, when buying now means buying at the very top of a memory price spike that Apple itself is hinting may come back down. Buying a depreciating asset at peak prices because you are afraid of missing out is the kind of thing you look back on and wince at. I could be totally wrong here but…just trying to not get caught up in the X hype rn.
The gotchas nobody mentions
Let’s just say, I took a bite and spent the money…where do I end up?
Mixing a fast card with a slow card does not add up the way you think.
If you drop a 5090 in next to an older 4090 or 3090 and split a model across them, the cards have to talk to each other over the PCIe bus, which is your slowest link in the whole system, and the slower card ends up gating your tokens per second for the layers it holds. You do not get 5090 speed plus 4090 speed. You get something closer to the weaker card plus the overhead of shuffling data back and forth. Most people assume that two powerful cards just work together and the speeds combine, and that is not how it plays out.
Two PRO 6000s is not plug and play either, and this one is rough.
The PRO 6000 has no NVLink, so two of them also talk over PCIe, which runs at a small fraction of the bandwidth a real NVLink connection would give you. Splitting one model across both cards (tensor parallelism) needs an enormous amount of interconnect bandwidth because the cards sync at every single layer, and on PCIe that becomes the bottleneck. On top of that, the optimized all-reduce kernel in the popular serving software does not even support these cards' Blackwell compute capability yet, so you have to disable it and fall back to a more conservative path, and people are reporting weeks of fighting deadlocks, building the inference stack from source because the prebuilt images do not support Blackwell, and tuning low-level driver and PCIe settings just to get stable dual-card inference running. Pipeline parallelism works better over PCIe, and it mostly helps you serve more users at once rather than making any single request faster.
SO…this is where the decision tree quietly walks you off a cliff. The moment you genuinely need fast multi-GPU, you are looking at H100, H200, or B200 class cards with real NVLink, and that is data center hardware at data center prices. So "I'll just buy two cards and link them" turns into "I guess I need a data center," and that is the point where most people should stop and ask whether they needed any of this in the first place.
Low, medium, high, extreme
Here is the most useful way I can frame it. Find the tier that matches your hardware today, see what you can actually run, and look at what the next step costs before you take it.
Low: a single consumer GPU (24 to 32GB), an M-series Mac, or a unified-memory mini PC.
- This is probably what you already have, and it covers a lot of real work.
- You can run the small-active mixture-of-experts and dense models really well, things like Qwen3.6 (the 27B or 35B-A3B with only 3B active), Gemma-class models, and smaller GLM variants, all quantized and fast.
- This is plenty for a local coding assistant and for running agents on your own machine. If you outgrow it, the next step is more memory bandwidth or more VRAM, and it is not more boxes.
Medium: one RTX PRO 6000 Blackwell (96GB).
- This is the sweet spot if you genuinely need to go bigger.
- You can run up to a 70B dense model at FP8, or an 80B-class mixture-of-experts model, on a single card, fast, with enough room to handle real concurrency for agentic work.
- This is the build I spec'd, and it still holds up as the best single-box answer. The catch is that this is also the rung where you have to decide if you are going to keep climbing, because the next step up gets expensive in a hurry.
High: a unified-memory box, or a small Apple Silicon cluster, for the big mixture-of-experts models.
- Here you are accepting slower tokens per second to fit something large, like a 671B model with 37B active parameters at a 4-bit quant.
- A single 128GB box (DGX Spark or Strix Halo) gets you there on capacity but at low bandwidth, and the Apple cluster is the more interesting version of this, which I will get to in its own section because it deserves it.
Extreme: the actual frontier open models.
- The models everyone is talking about right now, GLM-5.2 at 744B, MiniMax M3, Kimi K2.7 at a trillion parameters, DeepSeek V4 at 1.6T, do not fit on any single-box consumer build at a usable quant.
- To run these yourself you are into multi-card H100, H200, or B200 systems with NVLink, and honestly, at that point the API is right there and it is cheap.
- GLM-5.2 runs around $1.40 and $4.40 per million input and output tokens, and MiniMax M3 is around $0.30 and $1.20, and you can self-host that money's worth of tokens for a very long time before the hardware pays for itself.
The Apple Silicon cluster option, before you call it a data center
This is the one I want to add for the folks who care about power and quiet, because it is a real thing people are running now and it sits in a sweet spot right below the data center.
Each M5 Max chip does up to 614 GB/s of memory bandwidth with 128GB of unified memory, and that bandwidth is more than double what a DGX Spark or a Strix Halo box gives you, so each node is genuinely quick on decode. The new part is that macOS 26.2 added RDMA over Thunderbolt 5, which moves data directly from one machine's memory to another's while skipping most of the operating system overhead, and at WWDC this month Apple shipped a distributed stack on top of it (JACCL and MLX distributed) that lets you wire about four Macs together in a mesh and run one model across all of them. While it may sound like a research preview…it actually SHIPPED!!
So you can chain four M5 Max MacBooks, each with 128GB, into roughly 512GB of shared memory, and run a 400B-plus mixture-of-experts model, or even a trillion-parameter model. Apple's own demo ran the trillion-parameter Kimi across four M3 Ultras at 28 tokens per second, and people in the community are already running a 397B model across three M5 Max laptops sitting on a desk. The whole thing sips power compared to a rack of GPUs, it is quiet, and it fits on a desk.
The honest tradeoff is two things. The interconnect between the machines is Thunderbolt 5, which lands around 7.5 GB/s in practice, and that is your bottleneck, so this is slower than a real GPU. On the big models you are looking at something like 14 to 28 tokens per second, where a single H100 might do 71 on the same model. And it is not cheap, because four M5 Max MacBooks at around $4,099 each is more than $16,000, and that price just went up. But if you want capacity and quiet and low power draw, and you do not mind modest speed, this is a legitimate path that exists between a single box and a data center, and some people genuinely prefer it.
About the "buy now and pray it runs frontier models later" bet
A lot of people are buying hardware today on the hope that 18 months from now a model as good as Opus 4.8 or Fable will shrink down and run on their laptop. The trajectory is real, and I want to be fair about that, because small models really do keep catching up to last year's big ones.
BUT…there is a catch. The models that fit on a laptop are heavily quantized, and quantization is not free. The research is pretty consistent that 8-bit is basically lossless, and 4-bit holds up fine on easy tasks and then starts to degrade on the hard ones, and harder tasks degrade up to four times more than easy ones. Long-context tasks are where it really bites, with 4-bit dropping as much as 59 percent on long inputs, and the damage getting worse as the context grows. And smaller models take the quantization hit harder than big ones. That long-context, reasoning-heavy, agentic work is exactly the kind of thing you would want a frontier model for in the first place. (THIS IS AS OF JUNE 2026…)
So even if a model as good as today's Opus technically fits on your laptop in 18 months, the version that fits is the quantized one that is noticeably weaker on the things that matter most, and the actual frontier will have moved on by then anyway. Buying now and praying is really stacking three bets at once, that the model shrinks to your hardware, that it survives quantization on your specific tasks, and that your hardware is still the right shape for whatever the workload turns into. That is a lot to bet on a piece of hardware that loses value the moment you open the box.
So, is it worth it, and should you do it now?
I say all this as someone who actually went down the road, I have a DGX Spark on my desk, I spec'd out a full RTX PRO 6000 workstation, and I returned an $8,000 build at Micro Center before I even picked it up.
For most people, the honest answer is that going local can absolutely be worth it, and right now is a rough moment to pull the trigger.
The API is the best use of money once you count the card and the energy and your time, and on top of that the hardware just got more expensive and harder to get, and the memory spike that caused it may ease over the next year. Going local is genuinely worth it when you have a privacy requirement, or you are running heavy workloads constantly, or you specifically want the power efficiency and quiet of something like a Mac cluster, or someone else is paying for the card. Outside of those, waiting is not losing.
And the irony on the control angle is the same one we started with. The frontier models that are actually worth worrying about getting pulled do not fit on local hardware anyway, so owning a box does not buy you the thing that scared you. It buys you privacy and steady-state cost control.
What to do instead
Prototype on the API, find your real use case, measure what you actually spend in tokens, and then make the hardware decision with real data instead of a vibe. That order matters, because almost everyone who buys first ends up owning a box that does not match the workload they eventually land on. And if you can, wait for memory prices to cool before you buy, because right now, you are shopping at the top of a spike.
Spend the waiting time learning the things that transfer no matter which hardware shows up. Get comfortable with how to quantize and serve models with something like vLLM, SGLang, or MLX, build a real feel for the difference between memory capacity and memory bandwidth, and learn how to design agentic workloads that are aware of what they cost. Those skills make you ready to pull the trigger the second you have a genuine reason, and they do not depreciate the way a GPU does the moment you open the box.
The honest answer here is the patient one. Most people should keep paying for the API, keep building, and let the hardware and the prices come back to them. Waiting is not falling behind. The models and the software are moving so fast that the patient builder who prototypes cheaply and learns the fundamentals is in a better spot than the person who dropped sixteen grand on a cluster this week. As always, stay curious, keep shipping, and let’s cook!

