Building Local-First in 2026

Two years ago, building a local-first AI app meant choosing between Whisper (slow, accurate) or nothing. Today, we’re shipping Oatmeal with real-time transcription and LLM summaries running entirely on-device.

The gap between “this is theoretically possible” and “this ships as a product” is enormous. Here’s what we learned.

The stack

Before diving into challenges, here’s what we’re actually running:

Speech-to-text: Parakeet TDT 1.1B via parakeet-mlx. NVIDIA’s model, ported to Apple’s MLX framework. ~5.8% word error rate on clean audio. Processes faster than real-time on M1+.

LLM summaries: Qwen3-4B (4-bit quantized) via MLX-Swift. Embedded directly in the app—no Ollama, no separate process, no daemon management. Qwen3’s dual-mode capability lets us toggle “thinking” on/off per request—complex analysis gets reasoning, quick tasks stay fast.

Audio capture: Core Audio Taps for system audio (macOS 14.2+), AVAudioEngine for microphone. Two separate streams = free speaker diarization.

Storage: SQLite via GRDB (structured data) + SQLiteVec (vector embeddings). Local-first, no network required.

All Swift. All on-device. No cloud API calls for core functionality.

Why MLX won

We evaluated every option: llama.cpp, Ollama, Core ML, raw PyTorch. MLX won for reasons that surprised us.

Performance. On Apple Silicon, MLX sustains ~230 tokens/sec with 5-7ms median latency. That’s 1.5x faster than llama.cpp in our benchmarks. The unified memory architecture means no CPU-GPU data transfer overhead—arrays just live in shared memory.

Swift integration. MLX-Swift isn’t a wrapper around a Python library. It’s a native Swift API that mirrors the Python one. We can build UI that responds to generation progress without bridging nightmares.

Lazy loading. Models load on-demand. First AI feature invoked? That’s when the LLM loads. Cold start is ~3 seconds on M1, faster on newer chips. Users who never use summarization never download the model.

The downside? Model load times are slower than GGUF-based solutions. And MLX only works on Apple hardware—fine for a macOS app, impossible if you need cross-platform.

Parakeet vs. Whisper

Everyone asks why not Whisper. The answer is straightforward: Parakeet is faster and more accurate for English.

NVIDIA’s Parakeet TDT 0.6B sits at #1 on the Hugging Face ASR leaderboard. The 1.1B variant we use has better accuracy with minimal speed penalty. On M1, we process 1 hour of audio in roughly 20 seconds—fast enough for real-time streaming with buffer room.

Whisper’s advantage is multilingual support and ecosystem maturity. If you need 50 languages, use Whisper. For English transcription where speed matters, Parakeet wins.

The parakeet-mlx port runs the model natively on Apple Silicon without the NVIDIA dependency. We spawn it as a Python subprocess managed by uv (more on that nightmare below).

The Python problem

Parakeet needs Python. Oatmeal is a Swift app. Bridging these worlds cleanly is genuinely hard.

What we tried first: PythonKit to embed a Python interpreter directly. This is theoretically elegant—single binary, no external dependencies. In practice, it’s a disaster. Dependency conflicts, GIL contention, crash-prone interop.

What we shipped: A JSON-RPC daemon. The Python STT engine runs as a subprocess, communicating over stdin/stdout. Swift sends audio chunks, Python returns transcripts with word-level timestamps.

The trick is bootstrapping. Users don’t have Python environments configured correctly. We bundle uv (the fast Python package manager from Astral) and create an isolated venv on first run. The daemon downloads Parakeet weights from Hugging Face, sets up dependencies, and starts serving.

This works, but it’s slow on first launch (~30-60 seconds to install everything) and adds 2GB+ of downloads. Not elegant. Very practical.

The first-run problem

On-device AI has a fundamental UX problem: the models need to exist somewhere.

For Oatmeal, first run means downloading:

Parakeet TDT 1.1B (~400MB)
Qwen3-4B 4-bit (~2.5GB)
Python dependencies (~200MB)

Total: roughly 3GB before the app is fully functional.

You can’t hide this. You can only make it bearable:

Progressive download. Transcription works before summarization is ready. Users can start recording immediately while the LLM downloads in the background.
Clear progress UI. “Downloading AI models (1.2 GB / 2.4 GB)” is better than a spinning wheel. Users understand big downloads take time.
Graceful degradation. No network? App still opens, shows previous transcripts, records new audio. AI features just show “waiting for models.”
Resume support. Partial downloads survive app restarts. Don’t make users re-download 1.5GB because they closed the window.

The industry hasn’t solved this. Jan AI and LM Studio do it well. Most apps don’t. We copied Jan’s approach: treat model download as an explicit, visible step in onboarding.

What still sucks

Honest accounting of what’s hard:

Model staleness. Edge models don’t update like cloud APIs. We’ll ship with Qwen3-4B; when better models drop, users need to manually update or wait for an app update. There’s no good solution here.

Accuracy edge cases. Cloud ASR trained on billions of hours beats local models on heavy accents, domain jargon, and crosstalk. We’re honest in marketing: “competitive accuracy” means we’re close, not better.

Memory pressure. Running STT and LLM simultaneously on a base M1 (8GB) requires careful memory management. We unload the STT model before loading the LLM for summarization. Works, but adds latency.

Testing. How do you CI test a system that requires GPU access and 2GB of model weights? We don’t have a good answer. Integration tests run locally on dev machines. Unit tests mock the AI boundaries.

Documentation. MLX-Swift docs are improving but sparse. Half our understanding came from reading source code. parakeet-mlx has almost no documentation—we forked it to fix bugs.

What’s getting better

The trajectory is encouraging:

Hardware acceleration. M5’s neural engine provides 4x speedup over M4 for LLM inference. The gap between local and cloud is shrinking with every chip generation.

Framework maturity. Apple announced MLX at WWDC 2025 as a strategic priority, not a side project. Expect better tooling, more model ports, tighter Xcode integration.

Model efficiency. Parakeet 0.6B nearly matches Whisper Large (1.55B) in accuracy. Qwen3-4B beats Llama-3.2-3B on every benchmark while adding dual-mode reasoning. Smaller, faster, good-enough models are the future.

Apple’s Foundation Models. iOS 19 exposes on-device LLMs to developers for free. We haven’t integrated yet, but it’s coming. Zero-cost, zero-latency LLM access as a platform feature changes the calculus for every app.

Advice for builders

If you’re considering local-first AI:

Pick a focused use case. “Local ChatGPT” is a losing proposition—cloud will always be more capable. “Local transcription for privacy-conscious users” is defensible.
Start with MLX if macOS-only. It’s faster than alternatives on Apple Silicon and Swift integration actually works.
Budget for first-run UX. You’ll spend more time on model downloading, progress indication, and error handling than on the AI integration itself.
Mock the AI layer aggressively. Your tests shouldn’t require GPU access. Treat AI models as I/O boundaries and mock them.
Ship with version pinned models. Don’t auto-update model weights. Stability matters more than chasing the latest release.

Why bother?

Given the complexity, why not just call the OpenAI API?

Because the use case demands it.

Meeting transcription involves confidential conversations—investor calls, HR discussions, legal meetings. Users who care about this won’t upload to cloud APIs. Local processing isn’t a nice-to-have; it’s the entire value proposition.

And because it’s finally possible.

Two years ago, real-time local transcription was science fiction for consumer hardware. Today, it runs on a MacBook Air. The window for building local-first AI apps is now—before cloud giants integrate everything into platforms and close the opportunity.