Rewriting a Database in a Language I Don't Know

In early December I was working on an internal integration project — bidirectional sync between two databases, change streams, echo prevention, the whole thing. Something that would have taken a team a few months only last year came together in about four days.

I was bowled over! I’d been writing Go since 2016, and suddenly I was shipping production-quality infrastructure in days instead of weeks. Full context threading through the whole app, open tracing support piggy backing on top of that. A robust integration test framework that could verify the function of the app in a production setting, etc…

The obvious next question was: what else could I build this fast?

What followed was a month-long experiment in “_*~vibecoding~*_” (as the kids call it): porting defradb, a Go database, to Rust, with the original Go test suite as an automated specification. This post is about what that process actually looked like in practice, the idea that made it possible, the tooling I built to manage it, how the work changed shape as the project matured, and what it taught me about where the real bottleneck in software development is.

Conception#

DefraDB is a P2P database with CRDT conflict resolution, about 120,000 lines of Go source backed by 210,000 lines of integration tests. I help build it at Source Network. We needed a Rust version for embedded devices, edge deployments, and WASM targets. A project to implement from-scratch port would normally take a one-pizza team multiple quarters.

But something about this codebase made it uniquely suited for AI-assisted porting: it has an extensive integration test suite. Every package has clear interfaces. The tests define expected behavior precisely: same inputs, same outputs, same error conditions. And critically, Go has excellent C FFI support through cgo, and Rust compiles to C-compatible shared libraries. That meant I could compile the Rust implementation as a shared library, link it into the Go test harness via FFI, and run the existing integration tests against the Rust code directly.

Same test. Same inputs. Same expected outputs. Different implementation language underneath.

This is what made the whole effort conceivable. DefraDB’s test suite is almost twice as large as the source code it tests — 210,000 lines of tests for 120,000 lines of implementation. I began looking at the test suite as a machine-readable specification of what “correct” means. The AI would generate the Rust implementation. The Go tests would keep it honest. I’d provide the architecture decisions. When FFI tests pass, the Rust implementation is correct by definition. When they fail, the failure message tells me exactly what’s wrong.

Before this project I didn’t write Rust. I’d read plenty of it over the years of working in blockchain I was constantly adjacent to Rust codebases but I’d always been on the Go side of that divide.

i guess we doin rust now

That turned out to matter less than I expected.

Hello World#

January 12, 2026. First commit: “Initial defradb.rs project structure.” Then: “Implement CRDT subsystem with LWW, Counter, and Composite CRDTs.” Then tests for the CRDT subsystem. 46 commits that day.

The workflow was a worktree-per-package approach. Each Rust crate mapped to a Go package. I’d open the Go source, feed it to Claude along with the interfaces and test expectations, and iterate until the Rust implementation matched behavior. Each module went through 5-6 review rounds — not rubber stamps, real architectural review where I’d question package structure, naming, testing approach, look back at the go code repeatedly, etc…

By the end of week one, the core was taking shape: storage layer, CRDT engine, document model, schema system — plus the foundations of the query engine, HTTP API, and P2P networking. Week two brought integration: the FFI bridge, access control, and Go-Rust interoperability testing.

This early phase was fast and messy. I let Claude generate aggressively, accepting structural debt. Files grew to thousands of lines. Interfaces weren’t quite right. But we knew the behavior was correct because the Go implementation provided a guide we could follow.

Green = new features. Yellow = bug fixes. Cyan = refactoring. The two peaks: day one scaffolding (Jan 12) and the FFI test parity push (Jan 28).

The chart shows the two peaks that define the project’s rhythm: the initial scaffolding burst (Jan 12-16) and the FFI test parity push (Jan 28). The valley in between is where architecture gets harder and the easy wins run out.

Build your own tools#

Having a ready specification is one thing. Being able to see where my implementation stood against it was another.

Running FFI tests manually gets old fast. The first version was raw shell commands — set CGO flags, point at the right library path, run go test with the right tags. Easy to forget a flag. Easy to test against a stale build. That became a Makefile. The Makefile helped but didn’t scale — each test package needed its own invocation, tracking which packages passed on which branch was manual, comparing results across runs meant reading terminal output.

So I built ffi-test — a Rust CLI that manages the whole loop. It builds the FFI library, runs Go integration tests against it, captures structured results, generates reports, and tracks progress across branches. Six subcommands: run, status, diff, logs, packages, worktree.

This is what the dashboard looks like today:

$ ffi-test status
FFI Test Status: main (all worktrees) @ 8ecf13c

Package                 Branch           Timestamp      Pass   Fail   Skip  Total   Rate
────────────────────────────────────────────────────────────────────────────────────────
acp                     main             02-10 19:50     345      0     48    393     87%
backup                  main             02-10 19:50      22      0      0     22    100%
collection              collection       02-10 14:42      19      0      2     21     90%
collection_version      collection       02-10 15:22     406      0      2    408     99%
encryption              encryption       02-10 18:11      32      0      6     38     84%
explain                 main             02-09 22:57     249      0      0    249    100%
index                   index            02-10 14:52     365      0      0    365    100%
mutation                perf             02-10 18:49     202      1     20    223     90%
net                     net              02-11 17:19     108      0      0    108    100%
query                   refactor-query   02-11 17:14     946      0      0    946    100%
subscription            main             02-10 19:18      13      0      0     13    100%
view                    main             02-10 19:51       1      0      0      1    100%
────────────────────────────────────────────────────────────────────────────────────────
TOTAL (102 packages)                                    2752      2     85   2839     96%

The branch column is key. Different features are tested on different worktrees, and the status view composites them into a single dashboard. The timestamp tells me how fresh each result is. I can drill into any package to see subpackage-level detail:

$ ffi-test status net
Package                                Branch   Timestamp      Pass   Fail   Skip  Total   Rate
───────────────────────────────────────────────────────────────────────────────────────────────
net/info                               net      02-11 15:37       4      0      0      4    100%
net/simple/peer                        net      02-11 14:46      13      0      0     13    100%
net/simple/peer/subscribe/collection   net      02-11 14:52      15      0      0     15    100%
net/simple/peer/subscribe/document     net      02-11 15:02      11      0      0     11    100%
net/simple/replicator                  net      02-11 17:18      19      0      0     19    100%
net/sync                               net      02-11 15:25       5      0      0      5    100%
net/sync/branchable_collection         net      02-11 15:35       5      0      0      5    100%
...
───────────────────────────────────────────────────────────────────────────────────────────────
TOTAL (15 packages)                                             108      0      0    108    100%

The iteration loop is: run ffi-test run query, see which tests fail, fix the Rust implementation, run it again. Reports are stored as JSON in ~/.defra-ffi-reports/, organized by branch and package. No CI server, no cloud infrastructure. Just a tight local feedback loop between the specification (Go tests) and the implementation (Rust code).

I found that at this pace, the tooling I built around the process mattered as much as the code itself. Claude can generate a module, but it can’t tell me which of 2,839 tests are failing or which worktree has stale results when running all the tests takes 5 hours. I needed a read-evaluate-print loop over the whole project — a dashboard that made the gap between where I was and where correct was visible at a glance.

Grinding out tests#

With the scoreboard running, the work became a series of worktrees in parallel. Pick a package, run the tests, fix what’s broken, watch the numbers climb. The commit messages tell the story:

344/387 — fix(ffi): Add cross-collection copy, relational ID, and self-ref validation
339/387 — fix(ffi): Improve collection version patching compatibility
405/433 — feat(query): Fix GroupBy/aggregate parity — 93.5%
411/433 — feat(query): Fix parser validation and operation name support — 94.9%
427/433 — feat(ffi): Add delete blocks, branchable collection blocks, and multi-doc CID queries
435/435 — fix(ffi): Achieve 100% query/simple test parity
308/308 — feat(index): Achieve 100% FFI index test parity

Not all packages were created equal. The hardest by far was net — some of the P2P networking tests were inherently, several had long timeouts, and debugging distributed behavior across an FFI boundary was slow, frustrating work. Collection tests were second — they packed in a huge number of features per test (versioning, patching, schema evolution, branchable collections). Index tests were a different kind of hard: many index types had only a handful of tests each, so I was often implementing an entire feature to pass a few cases, then moving to the next. Query had the most tests (946) but fewer distinct features — the same query planner exercised across thousands of edge cases.

Getting from 90% to 100% on any package took longer than getting from 0% to 90%. Query went from zero to 93% in two days, then took two more weeks to close the remaining 7% as the test count grew from 433 to 946. Some tests were initailly skipped as we hadn’t implemented all the FFI functions to exercise them and we slowly uncovered the whole suite as we worked.

Jan 12 — 0 lines of Rust

Each layer is a crate. The project grows from 685 lines to 196,000 in 31 days. The dip on Jan 28 reflects branch work — FFI test parity was being built on a feature branch that hadn't merged the latest P2P and CLI work yet.

Twenty-two crates, built from nothing to 196,000 lines over 31 days. The query engine is the tallest layer — 44,000 lines, nearly a quarter of the total. You can see the branch rhythm in the dip on Jan 28: the FFI parity push was happening on a feature branch that hadn’t merged the latest P2P and CLI work yet. When the branches converged in early February, everything clicks back into place.

avg lines changed per commit

Early commits average 900-3500 lines — whole subsystems landing at once. By late January, commits drop to 200-500 lines as work shifts to targeted fixes against the FFI test suite.

Toggle between views: the default shows average lines changed per commit — early days averaged 900-3500 lines as whole subsystems dropped in, shrinking to 200-500 as work shifted to targeted fixes. The additions/deletions view shows the churn underneath: green bars are lines added, red bars are lines removed. The big refactoring days jump out — Jan 26-27 show net-negative changes as code got reorganized, and Feb 10-11 show massive symmetric churn as files got split during contraction.

Expand / Contract#

Looking at the whole month, a clear rhythm emerged. It wasn’t “AI writes code.” It was expand and contract.

At the individual PR level, we would do implementation. Claude would generate a ton of code, and then when reviewing, trim the code, contract it. Then again, at the more macro level, at the project level, we also run into this exact same thing with this expand and contract flow.

Expand generate a module fast. Accept structural debt. Let files grow large. Focus on behavioral correctness — do the tests pass? Don’t worry about the Rust being idiomatic or the files being well-organized. Get the logic right.

Contract refactor. Split multi-thousand-line files into focused modules. Clean up interfaces. Remove dead code. Make the Rust actually readable. This is where architectural taste matters: knowing what a good module boundary looks like, where to draw the abstraction lines, what to name things.

The contraction phases are slower and more human-intensive. Claude can complete complex refactors and clean up code but I still need to know when to clean up and clarify things. That judgment came from ten years of building distributed systems not from Rust expertise.

Jan 12 — 1 file — 14 avg LOC/file

The query crate's expand/contract rhythm at the file level. Yellow area: average lines per file — peaks at 612 during the FFI push, then drops to 297 as files get split into focused modules. Cyan line: file count — holds steady during expansion, then jumps from 68 to 150 during contraction. Colored bands mark the project phases.

The query crate makes this visible at the file level. Average file size climbs steadily through the scaffolding and architecture phases — peaking at 612 lines per file during the FFI push. Files are fat with working but unstructured code. Then the splits start: 68 files become 150, average size drops to 297, and the crate keeps growing. The code didn’t shrink. It got reorganized.

730+ commits and counting. 193,000 lines of Rust across 22 crates. The velocity came from the contract phases being fast when I had strong opinions about structure, not from the expand phases being automated.

Using `defradb.rs` in anger#

Somewhere around week four, with the FFI dashboard mostly green, I started wanting to use the thing. Not just test it, run it. So I deployed the Rust binary to the Hetzner VPS that runs my side projects and started looking for something real to throw at it.

I’d been meaning to do something with my X archive for a while: Twitter lets you export everything, tweets, likes, DMs, the whole history. I had 16 years of data sitting in a zip file: 24,000 tweets and 136,000 liked tweets going back to 2009. That seemed like a reasonable stress test for a database I’d just built.

I wrote an import script that parses the JS-wrapped JSON from the archive export, builds GraphQL mutations, and fires them into DefraDB in parallel. Checkpoint files for resume on failure. Identity-scoped writes via the keyring. The whole import ran on the VPS against the Rust binary — not the Go version, the one that didn’t exist six weeks earlier.

Then P2P sync. DefraDB uses libp2p for CRDT replication, and by this point the net package was passing 100% of its FFI tests. I pointed my local Mac at the VPS peer and watched 160,000 documents replicate over. Same data, two nodes, one running code I’d never written a line of by hand.

Then I tested embeddings. DefraDB has a schema directive (@embedding) that generates vector embeddings via Ollama on document creation. I loaded nomic-embed-text on the VPS and let it chew through the archive — 768-dimensional vectors for semantic search across a decade and a half of tweets. CPU-only on the VPS, so about 0.4 tweets per second. My local Mac with a GPU did the same work at 200 tweets per second.

The final piece was an MCP server — a small FastMCP script that exposes the archive as tools Claude can call directly. Search by text, search by semantic similarity, pull threads, get stats. I dropped the config into this repo’s .mcp.json and suddenly Claude had access to my entire tweet history as structured context.

Which is how we got here. The Rust database I built with Claude is running on a VPS, storing my tweets, serving them through an MCP server, back to Claude, who helped me write this post about building it. The loop closed itself.

I’d tweeted about this approach during week one before I had the data to back it up. Now I do.

Learnings#

So what did a month of this teach me?

The discourse around AI-assisted development keeps centering on “_*~vibecoding~*_” the idea that a description is enough and the AI builds it. That framing misses what’s actually interesting.

The bottleneck in software development isn’t typing speed. It’s knowing what correct looks like. Architecture, edge cases, failure modes, interface design. The accumulated judgment that turns a pile of functions into a system that serves a purpose for whoever deploys it. AI doesn’t replace that judgment. What it does is make it portable. I’ve spent years building distributed databases in Go. I know what a storage layer should look like, how CRDTs should merge, what a query planner needs to handle, what cryptographic libraries are good, etc… That knowledge transferred directly into managing an AI writing Rust, a language where I could read the code but couldn’t write it from scratch.

The deeper insight I took away: when an existing codebase has good tests, the specification is already written. The test suite is the spec. The FFI bridge is the verification. My role becomes architecture and acceptance, deciding what to build and confirming it was right. Claude handled the translation. This worked because DefraDB’s Go codebase is well-tested. It would not have worked on a codebase with poor test coverage or unclear interfaces. Garbage specs in, garbage code out, regardless of how good the model is.

And as models improve, the scope of what counts as “good enough specification” widens. The human role is shifting from writing code, to writing specifications, to doing acceptance testing on AI-generated specifications. Each step up the abstraction ladder means the same domain expertise covers more ground.

The road to v1.0#

The Rust port isn’t done. 96% test parity across 102 packages means there are still edge cases to chase, performance to profile, and contraction phases to finish on several crates.

What comes next is proving the Rust implementation in the places Go can’t go. The browser build is first: DefraDB compiled to WASM, running entirely client-side, giving users a fully verified data store with complete audit trails in the browser. Crypto wallets, local-first apps, anything that needs structured data with provenance guarantees — all without a server roundtrip.

The embedded build is the other target. DefraDB’s P2P replication and query engine in a package small enough for resource-constrained devices. The combination of CRDT sync, GraphQL queries, and a tiny memory footprint is what makes DefraDB differentiated — and Rust is the language that makes that combination practical.

The 1.0 milestone is compatibility guarantees backed by the same integration tests running the same CLI commands against both Go and Rust implementations, expecting identical results. The specification that got us here becomes the contract going forward.

Rewriting a Database in a Language I Don’t Know