I Want Local Inference. What Do?
I want local inference. What do?
It is the question I started asking myself in late 2024, and the one I now get asked by friends often enough that I owe them a real answer rather than the abbreviated version I tend to give over coffee. Mine took about eighteen months and four hardware generations to arrive at. Mac minis first. Then Mac Studios. Then a pair of NVIDIA-platform Sparks (the Asus variant, not the DGX-branded version). Then a custom 4-GPU workstation, which I wrote up separately in workstation-1: There Are Many Like It, But This One Is Mine. Each tier solved the wall the previous one had hit and surfaced a new one. This post is the path I actually took, the lessons that accumulated, and what each tier is genuinely good at — so you can decide where you want to get off the train.
One meta-note before the journey. When I started this in late 2024, there were no YouTube tutorials for clustering Sparks because Sparks did not exist yet, no good walkthroughs for Exo on Mac minis, and the documentation for what was then called MLX distributed was either nonexistent or a single forum post. Claude was the only co-designer I had that had read enough of the underlying docs and source to keep up. A lot of what I worked out below was worked out by talking through it with a model that knew more than the open internet did at the time.
The minis#
I started with Mac minis. Two M4s, 16GB of unified memory each, bought in late 2024 / early 2025. My mental model at the time was naïve in the specific way that someone who has not done their homework is naïve: I saw Exo, the distributed-inference framework everyone in the community was experimenting with, and I assumed the constraint I was solving was a compute constraint. If I could wire enough minis together, I figured, I could get an inference box capable of running a real model. This was before I had spent any time seriously researching what running a real model actually involves.
The minis taught me, very quickly, that the constraint was memory, not compute. 16GB of unified memory is not enough to load anything you would want to talk to. By the time you fit a usable model and its KV cache, you have left no room for the operating system to be happy, and the moment you ask the model to do real work the box starts to swap. No amount of distribution-layer cleverness rescues you from a memory ceiling that low.
What I got out of the minis instead is a pair of small reliable computers that host some home monitoring, the occasional metrics exporter, and a few small personal projects. They are excellent at that. They are not your local-inference box; I did not get out of them what I had wanted to, but they did teach me the right first lesson.
Memory is the first wall.
The Studios#
The lesson from the minis was that memory was the wall, not compute, so I bought memory. Two Mac Studios with M3 Ultra chips, 512GB of unified memory each, in late 2025 / early 2026. The framing wasn’t just that this was the largest unified-memory option Apple sold; it was that the Studios are the fastest and cheapest path to half a terabyte of memory you can actually serve from. The going price of 512GB of DDR5 ECC RDIMM alone in 2026 is roughly what a maxed Studio costs end-to-end, and the Studio bundles in an M3 Ultra and the rest of the machine on top of that. Two of them is a full terabyte of model space across the pair. If memory was the constraint, this was the answer the market actually offered.
In one respect it was. The Studios will load enormous models. I have run baa-ai/GLM-5.1-RAM-420GB-MLX — a 420GB MLX checkpoint of GLM-5.1 — distributed across the pair, using Exo over a Thunderbolt 5 link as the inter-node fabric. That is a model size you cannot run on consumer hardware, and the Studios did it without complaint, sitting quietly on the desk while serving tokens.
What I learned, sitting on the desk next to them, is that unified memory solves “can I load it” without solving “is it usable.” LPDDR5x, even on the M3 Ultra’s wide memory bus, is bandwidth-bound at the speeds you actually want to serve tokens at. You get a lot of memory and modest throughput, which is the right tradeoff if your workload is “one user, one model, one stream of tokens at a time” and is the wrong tradeoff for almost everything else. Concurrency does not really come back. The Studios are slow under multi-stream load in a way that is structural, not tunable.
Steve Jobs’s original framing for the Mac was a computer for the everyman — a unit of computer, the right size and the right complete shape for one person to use. The Studios are the inference version of that. A Studio is a unit of inference: one human, one model, sufficient bandwidth for sustained single-stream conversation. It is not a server. Treating it as one is fighting the architecture.
Treating them as what they actually are, on the other hand, gives them a real ongoing role. The Studios are the brains of my operation today — the place I schedule longer-running work, the place I send jobs where wall-clock latency matters more than throughput. Drafting reports. Checking output from smaller models on tasks I want a more capable model to verify. Running batch analyses I can come back to in an hour. Anything where I am not sitting on the keyboard waiting for the next token. They load any model I care about and they serve it patiently. That is a real role, and it is the one they fill in the fleet now.
The other wall the Studios surfaced was distribution. Exo over Thunderbolt 5 is functional, but it is flaky. The clustering breaks often enough that I keep a recovery runbook on hand for getting the RDMA links and MLX distributed back into a known state. It works lately. It did not work at the beginning, and I would not characterize it as a path of low operational cost. MLX itself is a smaller ecosystem than CUDA — fewer model checkpoints, fewer runtimes, less community velocity — and that matters more than I expected it to.
Memory was solved. Bandwidth became the second wall. Distributed clustering became the third.
The Sparks#
The next decision was to get out of Mac land. I had been comfortable in MLX and Exo and Thunderbolt fabrics for a year, and the next thing to learn was the side of the industry that has spent twenty years figuring out how to ship GPU compute. NVIDIA. CUDA. The toolchain everyone else builds on. I bought two Asus Ascent GX10 units — the Asus-branded version of the GB10 platform, not the NVIDIA-branded DGX Spark. Two Asus units cost roughly what one DGX Spark costs. Same silicon, same firmware, a different sticker on the chassis.
The Sparks are GB10 Grace-Blackwell Superchips with 128GB of unified LPDDR5x each, ConnectX-7 10GbE, and roughly 500 TFLOPS of FP4 compute. On paper they are a generation past the Studios on raw arithmetic, and in practice they live up to it. The real unlock was training. I had been training on the Studios — it is possible there, it is just slow enough that you do not want to iterate against it. The Sparks turned training from a thing I would do reluctantly into a thing I do as a default development loop. I am currently training Qwen 3.5-9B on a database-querying task, which is a clean, verifiable, easy-to-grade problem and a good frame to build a pipeline against.
Inference works on them too, and shines under concurrent load. Where a Studio gets slow with two simultaneous streams, two Sparks fan out across many. The book-scanning OCR pipeline I run is a perfect Spark workload: many pages, many requests, a modest model, long wall-clock runtime, embarrassingly parallel. The two Sparks chew through it in a way the Studios could not. They also produce no meaningful heat in the office even when training all day — for a machine that lives next to a desk, that is a real property.
The wall they did not solve is the unified-memory wall. 128GB per node is more than enough for the Qwen-class models I want to train, but it ceilings what you can run at inference without distributed coordination. The Atlas / NVFP4 distributed path across two Sparks works, but it has the operational cost any distributed-inference path has, and it does not change the underlying bandwidth math. When I bought the Sparks there were no YouTube tutorials for clustering them; the documentation was sparse, the official examples assumed one node, and I worked the inter-node setup out with Claude over the course of a long afternoon. That is roughly the experience the rest of this journey had at every step.
Training and concurrent inference unlocked. Bandwidth and model size are still walls.
Workstation 1#
The three options above are roughly the sub-$10K off-the-shelf landscape for local inference. Mac minis at the floor, a Mac Studio in the middle, an Asus Ascent GX10 (or a pair) at the top. There is a fourth route I am not covering here: the 3090-beast, a DIY rig stacked with secondhand NVIDIA consumer cards on a workstation platform. That is its own world — used-market dynamics, driver dramas, multi-GPU coordination, power and cooling decisions — and it is frequently the cheapest VRAM-per-dollar option in the sub-$10K range. It deserves its own post. If you want a sealed-box answer to “I want local inference, what do?” the three above are what you are choosing between. Everything more capable than that either costs meaningfully more or requires building your own from new silicon. The next step up the ladder is the latter, and that is what I did.
So I built one. Four RTX PRO 6000 Blackwell Max-Q cards in a custom water loop on a Threadripper PRO platform, 384GB of consolidated VRAM, real PCIe bandwidth, the CUDA toolchain at full force, no distributed coordinator in the hot path. The full build story — silicon, platform, power, cooling, the parts that broke — is in workstation-1: There Are Many Like It, But This One Is Mine. The short version is that consolidating into one machine on real GPUs solved the bandwidth and ecosystem walls the Sparks left behind, at the cost of getting comfortable with custom cooling and residential 20A power math.
Each tier in this journey solved the wall the previous tier hit, and exposed a new one. The minis surfaced the memory wall. The Studios solved memory and surfaced the bandwidth and distributed-clustering walls. The Sparks unlocked training and concurrent inference and left the bandwidth and model-size walls in place. The workstation solved bandwidth and ecosystem and brought a new ceiling: total VRAM is whatever fits in one machine.
The practical answer to “I want local inference, what do?” is to pick by which walls you can live with, not by which hardware looks most impressive.
- Mac minis are not your inference box. They are excellent at being small reliable computers — home monitoring, metrics, small personal services — and that is a real role. If your goal is local inference, do not start here.
- A Mac Studio is a unit of inference: one human, one model, sufficient speed for single-stream conversation, enormous memory for the price. Two Studios distributed gives you a terabyte of unified memory and a runbook for keeping the cluster up. Best if your workloads are single-user, MLX-compatible, and tolerant of wall-clock latency.
- A pair of Sparks (Asus or DGX) is the sweet spot for training and concurrent inference of medium-sized models. Training pipelines, OCR and batch processing, multi-user inference of anything that fits in 128GB per node. Quiet enough for an office.
- A custom workstation with new GPUs is the endgame: consolidated VRAM at PCIe speed, the full CUDA toolchain, no distributed coordinator in the hot path. Most expensive route, biggest local models, and you have to be willing to build it.
If I were starting again from scratch in mid-2026, I would take roughly the same path — but I would compress it. Skip the minis; the memory wall is well-documented enough now not to need a personal lesson. Start at one Mac Studio. Add a pair of Asus Sparks when training becomes a default development loop. Build a workstation when total throughput or model size finally crosses what two Sparks can do. Each tier still has a role in the fleet after you outgrow it for the original purpose. That is the honest version of the answer.