Very impressed with this report. Whenever I read TigerBeetle's claims on reliability and scalability, I'd think "ok, let's wait for the Jepsen report".
This report found a number of issues, which might be a cause for concern. But I think it's a positive because they didn't just fix the issues, they've expanded their internal test suite to catch similar bugs in future. With such an approach to engineering I feel like in 10 years TigerBeetle would have achieved the "just use Postgres" level of default database in its niche of financial applications.
Also great work aphyr! I feel like I learned a lot reading this report.
jorangreef 11 hours ago [-]
Thanks!
Yes, we have around 6,000+ assertions in TigerBeetle. A few of these were overtight, hence some of the crashes. But those were the assertions doing their job, alerting us that we needed to adjust our mental model, which we did.
Otherwise, apart from a small correctness bug in an internal testing feature we added (only in our Java client and only for Jepsen to facilitate the audit) there was only one correctness bug found by Jepsen, and it didn’t affect durability. We’ve written about it here: https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...
Finally, to be fair, TigerBeetle can (and is tested) to survive more faults than Postgres can, since it was designed with an explicit storage fault model and using research that was not available at the time when Postgres was released in ‘96. TB’s fault models are further tested with Deterministic Simulation Testing and we use techniques such as static memory allocation following NASA’s Power of Ten Rules for Safety-Critical Code. There are known scenarios in the literature that will cause Postgres to lose data, which TigerBeetle can detect and recover from.
For more on this, see the section in Kyle’s report on helical fault injection (most Raft and Paxos implementations were not designed to survive this) as well as a talk we gave at QCon London: https://m.youtube.com/watch?v=_jfOk4L7CiY
jrpelkonen 10 hours ago [-]
Hi Joran,
I have followed TigerBeetle with interest for a while, and thank you for your inspirational work and informative presentations.
However, you have stated in several occasions that the lack of memory safety in Zig is not a concern since you don't dynamically allocate memory post startup. However, one of the defects uncovered here (#2435) was caused by dereferencing an uninitialized pointer. I find this pretty concerning, so I wonder if there is something that you will be doing differently to eliminate all similar bugs going forward?
AndyKelley 6 hours ago [-]
TigerBeetle uses ReleaseSafe optimization mode, which means that the pointer was in fact initialized to 0xaaaaaaaaaaaaaaaa. Since nothing is mapped to this address, it reliably causes a segfault. This is equivalent to an assertion failure.
jrpelkonen 6 hours ago [-]
That’s good to hear! Thanks for the clarification.
matklad 9 hours ago [-]
Note that that's a bug in the client, in the Zig-java FFI code, which is inherently unsafe. We'd likely made an a similar bug in Rust.
Which is, yeah, one of the bigger technical challenges for us --- we ship language-native libraries for Go,node,Java,C#,Python and Rust, and, like in the Tolstoi novel, each one is peculiar in its own way. What's worse, they aren't directly covered by our deterministic simulator. That's one of the major reasons why we invest in full-system simulation with jepsen, antithesis and vortex (https://tigerbeetle.com/blog/2025-02-13-a-descent-into-the-v...). We are also toying with the idea of generating _more_ of that code, so there's less room for human error. Maybe one day we'll even do fully native client (eg, pure Java, pure Go), but we are not there yet.
One super-specific in-progress thing is that, at the moment, the _bulk_ of the client testing is duplicated per client, and also the _bulk_ of the testing is example-based. Building simulator/workload is a lot of work, and duplicating it for each client is unreasonable. What we want to do here is to use multi-process architecture, where there's a single Zig process that generates the workloads and generates interesting sequences of commands for clients, and than in each client we implement just a tiny "interpreter" for workload language, getting a test suite for free. This is still WIP though!
Regarding the broader memory safety issue in the database. We did have a couple of memory safety bugs, which were caught early in testing. We did have one very bad aliasing bug, which would have been totally prevented by Rust, which slipped through the bulk of our testing and into the release (it was caught in testing _after_ it was introduced): https://github.com/tigerbeetle/tigerbeetle/pull/2774. Notably, while the bug was bad enough to completely mess up our internal data structure, it was immediately caught by an assert down the line, and downgraded from correctness issues to a small availability issues (just restarting the replica would fix it). Curiously, the root cause for that bug was that we over-complicated our code. Long before the actual bug we felt uneasy about the data structure in question, and thought about refactoring it away (which refactor is underway. Hilariously, it looks that just "removing" the thing without any other code changes improves performance!).
So, on balance, yeah, Rust would've prevented a small number of easy bugs, and on gnarly bug, but then the entire thing would have to look completely different, as the architecture of TigerBeetle not at all Rust-friendly. I'd be curious to see someone replicating single-thread io-uring no malloc after startup architecture in Rust! I personally don't know off the top of my head whether that would work or not.
jcalabro 8 hours ago [-]
I remember reading a similar thing about FoundationDB with their DST a while back. Over time, they surfaced relatively few bugs in the core server, but found a bunch in the client libraries because the clients were more complicated and were not run under their DST.
Anyways, really interesting report and project. I also like your youtube show - keep up the great work! :)
matklad 6 hours ago [-]
Oh, important clarification from andrewrk(https://lobste.rs/c/tf6jng), which I totally missed myself: this isn't actually a dereference of uninitialized pointer, it's a defer of a pointer which is explicitly set to a specific, invalid value.
jrpelkonen 6 hours ago [-]
This is indeed an important point, the way I originally understood the bug was that the memory was not initialized at all. Thanks for the clarification
anarazel 2 hours ago [-]
> There are known scenarios in the literature that will cause Postgres to lose data, which TigerBeetle can detect and recover from.
What are you referencing here?
jorangreef 2 hours ago [-]
The scenarios described in our QCon London talk linked above.
This surveys the excellent storage fault research from UW-Madison, and in particular:
“Can Applications Recover from fsync Failures?”
“Protocol-Aware Recovery for Consensus-Based Storage”
Finally, I'd recommend watching “Consensus and the Art of Durability”, our talk from SD24 in NYC last year:
[disks are] somewhere between non-byzentine fault tolerance and
Byzantine fault tolerance ... you expect the disk to be almost
an active adversary ...
...
so you start to see just a single disk as a distributed system
My goodness, not at all! If you can't trust the interface to a local disk then you're lost just at a fundamental level. And even ignoring that, a disk is an implementation detail of a node in a distributed system, whatever properties that disk may have to that local node are irrelevant in the context of the broader system, and are the responsibility of the local node to manage before communicating anything with other nodes in that broader system.
Combined with https://www.youtube.com/watch?v=tRgvaqpQPwE it seems like the author/presenter is conflating local/disk-related properties/details with distributed/system-based requirements/guarantees. If consensus requires a node to have durably persisted some bit of state before it sends a particular message to other nodes in the distributed system, then it doesn't matter how that persistence is implemented, it only matters how that persistence is observable, disks and FS caches and etc. aren't requirements, they're just one of many possible implementation choices.
SOLAR_FIELDS 11 hours ago [-]
I always get excited to read Kyle’s write ups. I feel like I level up my distributed systems knowledge every time he puts something out.
jitl 10 hours ago [-]
Really happy to see TigerBeetle live up to its claims as verified by aphyr - because it's good to see that when you take the right approach, you get the right results.
Question about how people end up using TigerBeetle. There's presumably a lot of external systems and other databases around a TigerBeetle install for everything that isn't an Account or Transfer. What's the typical pattern for those less reliable systems to square up to TigerBeetle, especially to recover from consistency issues between the two?
jorangreef 4 hours ago [-]
Joran from TigerBeetle here! Thanks! Really happy to see the report published too.
The typical pattern in integrating TigerBeetle is to differentiate between control plane (Postgres for general purpose or OLGP) and data plane (TigerBeetle for transaction processing or OLTP).
All your users (names, addresses, passwords etc.) and products (descriptions, prices etc.) then go into OLGP as your "filing cabinet".
And then all the Black Friday transactions these users (or entities) make, to move products from inventory accounts to shopping cart accounts, and from there to checkout and delivery accounts—all these go into OLTP as your "bank vault". TigerBeetle lets you store up to 3 user data identifiers per account or transfer to link events (between entitites) back to your OLGP database which describes these entities.
This architecture [1] gives you a clean "separation of concerns", allowing you to scale and manage the different workloads independently. For example, if you're a bank, it's probably a good idea not to keep all your cash in the filing cabinet with the customer records, but rather to keep the cash in the bank vault, since the information has different performance/compliance/retention characteristics.
This pattern makes sense because users change their name or email address (OLGP) far less frequently than they transact (OLTP).
Finally, to preserve consistency, on the write path, you treat TigerBeetle as the OLTP data plane as your "system of record". When a "move to shopping cart" or "checkout" transaction comes in, you first write all your data dependencies to OLGP if any (and say S3 if you have related blob data) and then finally you commit your transaction by writing to TigerBeetle. On the read path, you query your system of record first, preserving strict serializability.
Does that make sense? Let me know if there's anything here we can drill into further!
This is a particularly fun Jepsen report after reading their fuzzer blind spots post.
It looks like the segfaults on the JNI side would not have been protected if Rust or some other memory safe language were being used - the lack of memory safety bugs gives some decent proof that TigerBeetle's approach to Zig programming (TigerStyle iirc, lol) does what it sets out to do.
matklad 9 hours ago [-]
See https://news.ycombinator.com/item?id=44201189. We did have one bug where Rust would've saved our bacon (instead, the bacon was saved by an assertion, so it was just slightly crispy, not charred).
EDIT: But, yeah, totally, if not for TigerStyle, we'd die to nasal demons!
12_throw_away 4 hours ago [-]
A small appreciation for the section entitled "Panic! At the Disk 0": <golf clap>
FlyingSnake 7 hours ago [-]
Love the wonderfully detailed report. Getting it tested and signed off by Jepsen is such a huge endorsement for TigerBeetle. It’s not even reached v1.0 and I can’t wait to see it hit new milestone in the future.
Special kudos to the founders who are sharing great insights in this thread.
jorangreef 49 minutes ago [-]
Yes, Kyle did an incredible job and I also love the detail he put into the report. I kept saying to myself: “this is like a work of art”, the craftsmanship and precision.
Appreciate your kind words too, and look forward also to sharing something new in our talks at SD25 in Amsterdam soon!
ryeats 8 hours ago [-]
I think it is interesting but obvious in hindsight that it is necessary to have the distributed system under test report the time/order things actually happened to enable accurate validation against an external model of the system instead of using wall-clock time.
matklad 8 hours ago [-]
Note that this works because we have strict serializability. With weaker consistency guarantees, there isn't necessarily a single global consistent timeline.
This is an interesting meta pattern where doing something _harder_ actually simplifies the system.
Another example is that, because we assume that the disk can fail and need to include repair protocol, we get state-synchronization for a lagging replica "for free", because it is precisely the same situation as when the entire disk gets corrupted!
aphyr 7 hours ago [-]
To build on this--this is something of a novel technique in Jepsen testing! We've done arbitrary state machine verification before, but usually that requires playing forward lots of alternate timelines: one for each possible ordering of concurrent operations. That search (see the Knossos linearizability checker) is an exponential nightmare.
In TigerBeetle, we take advantage of some special properties to make the state machine checking part linear-time. We let TigerBeetle tell us exactly which transactions happen. We can do this because it's a.) strong serializable, b.) immutable (in that we can inspect DB state to determine whether an op took place), and c.) exposes a totally ordered timestamp for every operation. Then we check that that timestamp order is consistent with real-time order, using a linear-time cycle detection approach called Elle. Having established that TigerBeetle's claims about the timestamp order are valid, we can apply those operations to a simulated version of the state machine to check semantic correctness!
I'd like to generalize this to other systems, but it's surprisingly tricky to find all three of those properties in one database. Maybe an avenue for future research!
eevmanu 7 hours ago [-]
I have a question that I hope is not misinterpreted, as I'm asking purely out of a desire to learn. I am new to distributed systems and fascinated by deterministic simulation testing.
After reading the Jepsen report on TigerBeetle, the related blog post, and briefly reviewing the Antithesis integration code on GitHub workflow, I'm trying to better understand the testing scope.
My core question is: could these bugs detected by the Jepsen test suite have also been found by the Antithesis integration?
This question comes from a few assumptions I made, which may be incorrect:
- I thought TigerBeetle was already comprehensively tested by its internal test suite and the Antithesis product.
- I had the impression that the Antithesis test suite was more robust than Jepsen's, so I was surprised that Jepsen found an issue that Antithesis apparently did not.
I'm wondering if my understanding is flawed. For instance:
1. Was the Antithesis test suite not fully capable of detecting this specific class of bug?
2. Was this particular part of the system not yet covered by the Antithesis tests?
3. Am I fundamentally comparing apples and oranges, misunderstanding the different strengths and goals of the Jepsen and Antithesis testing suites?
I would greatly appreciate any insights that could help me understand this better. I want to be clear that my goal is to educate myself on these topics, not to make incorrect assumptions or assign responsibility.
matklad 3 hours ago [-]
To add to what aphyr says, you generally need three components for generative testing of distributed systems:
1. Some sort of environment, which can run the system. The simplest environment is to spin up a real cluster of machines, but ideally you want something fancier, to improve performance, control over responses of external APIs, determinism, reproducibility, etc.
2. Some sort of load generator, which makes the system in the environment do interesting thing
3. Some sort of auditor, which observes the behavior of the system under load and decides whether the system behaves according to the specification.
Antithesis mostly tackles problem #1, providing a deterministic simulation environment as a virtual machine. The same problem is talked by jepsen (by using real machines, but injecting faults at the OS level), and by TigerBeetle's own VOPR (which is co-designed with the database, and for that reason can run the whole cluster on just a single thread). There there approaches are complimentary and are good at different things.
For this bug, the critical part was #2, #3 --- writing workload verifier and auditor that actually can trigger the bug. Here, it was aphyr's 1600 lines of TigerBeetle-specfic Clojure code that triggred and detected the bug (and then we patched _our_ equivalent to also trigger it. Really, what's buggy here is not the database, but the VOPR. Database having bugs is par of course, you can't just avoid bugs through the sheer force of will. So you need testing strategy that can trigger most bugs, and any bug that slips through is pointing to the deficiency in the workload generator.)
aphyr 2 hours ago [-]
And honestly--designing a generator for a system like this is hard. Really hard. I struggled for weeks to get something that didn't just fail 99% of requests trivially, and it's an (ahem) giant pile of probabilistic hacks. So I wouldn't be too hard on the various TB test generators here!
Yeah, TigerBeetle's blog post goes into more detail here, but in short, the tests that were running in Antithesis (which were remarkably thorough) didn't happen to generate the precise combination of intersecting queries and out-of-order values that were necessary to find the index bug, whereas the Jepsen generator did hit that combination.
There are almost certainly blind spots in the Jepsen test generators too--that's part of why designing different generators is so helpful!
eevmanu 6 hours ago [-]
Thanks for your answer aphyr and for this amazing analysis
jorangreef 3 hours ago [-]
(Note also that 90% of our deterministic simulation testing is done primarily by the VOPR, TigerBeetle's own deterministic simulator, which we built inhouse, and which runs on a fleet of 1,000 dedicated CPU cores 24/7. We also use Antithesis, but as a second layer of DST.)
TigerBeetle is something I’m interested in. I see there is no C or Zig client listed in the clients documentation. Thought these would be the first ones to exist given it is written in Zig. Do they exist or maybe still WIP?
koakuma-chan 12 hours ago [-]
Curios if they got any large bank or stock exchange to use TigerBeetle
jorangreef 11 hours ago [-]
Joran, creator and CEO from TigerBeetle here!
At a national level, we’re working with the Gates Foundation to integrate TigerBeetle into their non-profit central bank switch that will be powering Rwanda’s National Digital Payments System 2.0 later this year [1].
At an enterprise level, TigerBeetle already powers customers processing 100M+ transactions per month in production, and we recently signed our first $2B fintech unicorn in Europe with a few more in the US about to close. Because of the move to realtime transaction processing around the world [2] there’s been quite a bit of interest from companies wanting to move to TigerBeetle for more performance.
Finally, to your question, some of the founders of Clear Street, a fairly large brokerage on Wall Street have since invested [3] in TigerBeetle.
Have you had a difficult time convincing customers to use a product written in a pre-1.0 programming language?
jorangreef 4 hours ago [-]
Zig's pre-1.0 status also refers more to API stability. The language and tooling already has more quality, at least in my own experience, than if we had picked C, which was the only other choice available to us when we made the decision to invest in Zig's trajectory back in 2020, given we needed to do static allocation and that any sort of global allocator was out of the question.
But, no. On the commercial side, I don't think we've had one conversation with a prospect or CTO or engineering team where they were concerned that we picked a systems language for the next thirty years. And while Zig is a beautiful, perfect replacement for C, I think the real reason the question has never come up, is that our customers come to us instead of us to them. We're not trying to convince anyone. They're already appreciating the extensive end-to-end testing we do on everything we ship.
However, I should emphasize again, that given all the assertions, fuzzing and DST we do, Zig's quality can't be overstated. It holds up.
matklad 8 hours ago [-]
From the user's perspective, this doesn't matter at all. Zig is implementation detail, what we actually ship is a fully statically linked native executable for the database, and "links only libc" (because thread locals!) .a/.so native "C" library for clients. Nothing will change, for the user, if we decide to rewrite the thing in Rust, or C, or Hare, nothing Zig-specific leaks out.
Form the developer perspective, the big thing is that we don't have any dependencies, so updating compiler for us is just a small amount of work once in a while, and not your typical ecosystem-wide coordination problem. Otherwise, Zig's pretty much "finished" for our use-case, it more or less just works.
diggan 11 hours ago [-]
> some of the founders of Clear Street, a fairly large brokerage on Wall Street have since invested [3] in TigerBeetle
"Invested" in terms of "giving you money" or in terms of "Now uses the database themselves"? I read it as the first, but I think the question is about usage, not investments.
jorangreef 11 hours ago [-]
Both. In terms of investing and planning to migrate.
thomaspaine 8 hours ago [-]
I work on the ledgering system at clear street and as far as I know we have no plans to do this. We evaluated it internally a few years ago and found that the account and transaction model was too different from ours to migrate over.
jorangreef 7 hours ago [-]
Hi Thomas, yes, I was there. However, this is something that Sachin and I subsequently discussed last year (Sachin recently provided the TPS footnote to be used in the report here). However, I understand that roadmap may since have changed, but this is to the best of my knowledge.
sachnk99 6 hours ago [-]
Hi -- Sachin here, one of the founders of Clear Street. To clarify:
- The investment in TigerBeetle was done personally, not through Clear Street.
- I'm no longer actively involved day-to-day as CTO at Clear Street, but while I was, TigerBeetle was a solution we very much had in mind as our volumes were increasing.
That said, roadmaps change, priorities shift, etc. If TigerBeetle existed when we started Clear Street, I very much would have used it, and saved me from many headaches.
diggan 11 hours ago [-]
Thanks for the clarification :)
jorangreef 10 hours ago [-]
You too! :)
11 hours ago [-]
SOLAR_FIELDS 11 hours ago [-]
Not a bank or exchange but I work for a very large fintech and we are using it on our newer products.
jorangreef 11 hours ago [-]
Awesome to hear that! Are we chatting in Slack? Or please DM me or Lewis. Would love to chat!
nindalf 11 hours ago [-]
I think if they had, they'd brag about it on their homepage. So far the biggest endorsement from there is from some YouTuber. A popular YouTuber, no doubt, but a YouTuber nevertheless.
koakuma-chan 11 hours ago [-]
Yeah, TigerBeetle itself and their testing suite looks impressive, but putting Primeagen there makes them look like Next.js or Cursor.
jorangreef 11 hours ago [-]
That’s a talk for engineers that was streamed on the Primeagen and went a bit viral. If you haven’t watched it yet, it’s an intro to TigerBeetle technically.
If you can stand that guy speak, it's worth a watch.
jorangreef 11 hours ago [-]
I actually love the pace at which Prime speaks, but I feel awkward at hearing my own voice. Hopefully the ideas stand on merit!
andyferris 10 hours ago [-]
I found the line about Tigerbeetle's model assuming entire disk sector errors but not bit/byte errors rather interesting - as someone who has created error correcting codes, this seems out of line with my understanding. The only situation I can see it works is where the disk or driver encodes and decodes the sectors... and (on any disk/driver I would care to store an important transactional database) would be reporting tonnes of (possibly corrected) faults before Tigerbeetle was even aware.
Or possibly my mental model of how physical disks and the driver stack behave these days is outdated.
matklad 9 hours ago [-]
Just to clarify, our _model_ totally assumes bit/byte error! It's just that our fuzzer was buggy and wasn't actually exercising those faults!
And now I have some Friday evening reading material.
jorangreef 11 hours ago [-]
It should be fixed soon!
The VSR 2012 paper is one of my favorites as is “Protocol-Aware Recovery for Consensus-Based Storage”, which is so powerful.
Hope you enjoy the read!
Ygg2 11 hours ago [-]
TigerBeetle is impressive, but it's a single purpose DB. Unless you fit within the account ledger model it's extremely restrictive.
SOLAR_FIELDS 11 hours ago [-]
That is 100% correct. You use TigerBeetle when you need a really good double entry accounting system that is open source. You wouldn’t use it for much else other than that. Which makes it great software, it’s purpose made to solve one problem really well
saaaaaam 10 hours ago [-]
That's a slightly redundant criticism though - it doesn't present itself as anything other than a single purpose database designed for financial transactions.
That's like saying that rice noodles are no good for making risotto. At the core they are both rice...
Ygg2 8 hours ago [-]
People seem to describe it at OLTP, and one of first DBs to come up in OLTP search is MySQL.
dumah 8 hours ago [-]
OLTP (Online Transaction Processing) is a database paradigm optimized for handling high volumes of short, fast transactions in real-time, typically supporting day-to-day operational activities like order processing, inventory updates, and customer account management where data integrity and quick response times are critical.
Another paradigm is OLAP, in which aggregation of large datasets is the principal concern.
Ygg2 7 hours ago [-]
Yes, I'm aware. It seems now there is a further bifurcation. OLTP is no longer general purpose, but now it's also for only one narrow use-case.
jorangreef 11 hours ago [-]
Joran from TigerBeetle here!
Yes, TigerBeetle specializes only for transaction processing (OLTP). It’s not a general-purpose (OLGP) DBMS.
That said, we have customers from energy to gaming, and of course fintech.
wiradikusuma 10 hours ago [-]
If memory serves, TigerBeetle is/was not free for production? I can't find the Pricing page, but I kinda remember reading about it somewhere (or it was implied) a while back.
jorangreef 10 hours ago [-]
The DBMS is Apache 2.0 and our customers pay us (well) for everything else to run, integrate, migrate, operate and support that.
For more on our open source thinking and how this is orthogonal to business model (and product!), see our interview with the Changelog: https://m.youtube.com/watch?v=Yr8Y2EYnxJs
boris 6 hours ago [-]
I watched that but I don't see it as convincing. Let's take the AWS example brought up in the talk. The "compete on the interface, not (open source) implementation" idea I think misses (at least) the following points:
1. AWS will take your initial and ongoing investment in the implementation but they don't have to share theirs with you. Specifically, they will take your improvements but their own improvements (say some performance optimizations) they can keep to themselves. It's good business sense if it allows them to further differentiate their "improved" offering from your "vanilla" service.
2. Competing on the the interface in this case really means competing on related services like management, etc. So your thesis is that you will provide a better/cheaper managed service than AWS. Even if that's true (a big if), most of the time the decision which service to use will have little to do with technical merit. I.e. we already use AWS, have SLA painfully negotiated, get volume discounts, etc. Do we really want to go through all of this with another vendor just for one extra service.
Just a couple of thoughts that will hopefully help you sharpen your thesis.
kristoff_it 5 hours ago [-]
> AWS will take your initial and ongoing investment in the implementation but they don't have to share theirs with you. Specifically, they will take your improvements but their own improvements (say some performance optimizations) they can keep to themselves. It's good business sense if it allows them to further differentiate their "improved" offering from your "vanilla" service.
In practice all I've seen from AWS is just to add integrations with their internal orchestrators and not much else. Back when I was at Redis Labs, AWS added TLS support to Redis and was dying to get that upstreamed (so that they wouldn't have to maintain the patch), except that as far as I understood nobody upstream wanted that code. In other words, hypothetical improvements by AWS (and other Clouds) are extremely overrated. When it comes to tigerbeetle, I would put the chance that they introduce bugs and vulnerabilities much higher than the possibility they add any meaningful improvement over what the actual experts (the tigrebeetle team) have already done.
> Do we really want to go through all of this with another vendor just for one extra service.
That's a great point, and in fact I've seen AWS purposefully offer insane (in Europe maybe we would say anti-competitive) discounts precisely to prevent Redis Labs from gaining market share. I'm sure they will try the same with TB once it becomes mainstream enough. What TB has that Redis doesn't have is the fact that it's a database designed for truly mission-critical stuff (i.e. counting the money) and maybe customers will be willing to go through the extra motions to ensure they get the best service they can (assuming TB will be able to provide that).
boris 4 hours ago [-]
> In other words, hypothetical improvements by AWS (and other Clouds) are extremely overrated.
Interesting, in a recent thread (I think it was about Redis going back open source) an AWS employer was bragging about substantial concurrency optimizations they implemented in Valkey. At the time I thought it could have been a great differentiator to keep proprietary but perhaps they decide to sacrifice it to help make sure Valkey takes over the Redis midshare.
kristoff_it 1 hours ago [-]
That's a special case for sure, given the new fight for supremacy between the two forks, that said you can see in all those threads antirez bickering with the AWS people over exactly who introduced what.
jorangreef 3 hours ago [-]
To be clear, we have no problem if all the hyperscalers decide to offer TigerBeetle as their flagship OLTP database. That builds trust and is a good thing for the ecosystem as a whole.
We also don't expect (or need) anyone to contribute improvements upstream to us. That's open source!
Finally, open source is not the same thing as product. There are thousands of companies around the world who make high quality products that people pay for. TigerBeetle is no different.
Fuzzer Blind Spots (Meet Jepsen!) – https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...
This report found a number of issues, which might be a cause for concern. But I think it's a positive because they didn't just fix the issues, they've expanded their internal test suite to catch similar bugs in future. With such an approach to engineering I feel like in 10 years TigerBeetle would have achieved the "just use Postgres" level of default database in its niche of financial applications.
Also great work aphyr! I feel like I learned a lot reading this report.
Yes, we have around 6,000+ assertions in TigerBeetle. A few of these were overtight, hence some of the crashes. But those were the assertions doing their job, alerting us that we needed to adjust our mental model, which we did.
Otherwise, apart from a small correctness bug in an internal testing feature we added (only in our Java client and only for Jepsen to facilitate the audit) there was only one correctness bug found by Jepsen, and it didn’t affect durability. We’ve written about it here: https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...
Finally, to be fair, TigerBeetle can (and is tested) to survive more faults than Postgres can, since it was designed with an explicit storage fault model and using research that was not available at the time when Postgres was released in ‘96. TB’s fault models are further tested with Deterministic Simulation Testing and we use techniques such as static memory allocation following NASA’s Power of Ten Rules for Safety-Critical Code. There are known scenarios in the literature that will cause Postgres to lose data, which TigerBeetle can detect and recover from.
For more on this, see the section in Kyle’s report on helical fault injection (most Raft and Paxos implementations were not designed to survive this) as well as a talk we gave at QCon London: https://m.youtube.com/watch?v=_jfOk4L7CiY
I have followed TigerBeetle with interest for a while, and thank you for your inspirational work and informative presentations.
However, you have stated in several occasions that the lack of memory safety in Zig is not a concern since you don't dynamically allocate memory post startup. However, one of the defects uncovered here (#2435) was caused by dereferencing an uninitialized pointer. I find this pretty concerning, so I wonder if there is something that you will be doing differently to eliminate all similar bugs going forward?
Which is, yeah, one of the bigger technical challenges for us --- we ship language-native libraries for Go,node,Java,C#,Python and Rust, and, like in the Tolstoi novel, each one is peculiar in its own way. What's worse, they aren't directly covered by our deterministic simulator. That's one of the major reasons why we invest in full-system simulation with jepsen, antithesis and vortex (https://tigerbeetle.com/blog/2025-02-13-a-descent-into-the-v...). We are also toying with the idea of generating _more_ of that code, so there's less room for human error. Maybe one day we'll even do fully native client (eg, pure Java, pure Go), but we are not there yet.
One super-specific in-progress thing is that, at the moment, the _bulk_ of the client testing is duplicated per client, and also the _bulk_ of the testing is example-based. Building simulator/workload is a lot of work, and duplicating it for each client is unreasonable. What we want to do here is to use multi-process architecture, where there's a single Zig process that generates the workloads and generates interesting sequences of commands for clients, and than in each client we implement just a tiny "interpreter" for workload language, getting a test suite for free. This is still WIP though!
Regarding the broader memory safety issue in the database. We did have a couple of memory safety bugs, which were caught early in testing. We did have one very bad aliasing bug, which would have been totally prevented by Rust, which slipped through the bulk of our testing and into the release (it was caught in testing _after_ it was introduced): https://github.com/tigerbeetle/tigerbeetle/pull/2774. Notably, while the bug was bad enough to completely mess up our internal data structure, it was immediately caught by an assert down the line, and downgraded from correctness issues to a small availability issues (just restarting the replica would fix it). Curiously, the root cause for that bug was that we over-complicated our code. Long before the actual bug we felt uneasy about the data structure in question, and thought about refactoring it away (which refactor is underway. Hilariously, it looks that just "removing" the thing without any other code changes improves performance!).
So, on balance, yeah, Rust would've prevented a small number of easy bugs, and on gnarly bug, but then the entire thing would have to look completely different, as the architecture of TigerBeetle not at all Rust-friendly. I'd be curious to see someone replicating single-thread io-uring no malloc after startup architecture in Rust! I personally don't know off the top of my head whether that would work or not.
Anyways, really interesting report and project. I also like your youtube show - keep up the great work! :)
What are you referencing here?
This surveys the excellent storage fault research from UW-Madison, and in particular:
Finally, I'd recommend watching “Consensus and the Art of Durability”, our talk from SD24 in NYC last year:https://www.youtube.com/watch?v=tRgvaqpQPwE
Combined with https://www.youtube.com/watch?v=tRgvaqpQPwE it seems like the author/presenter is conflating local/disk-related properties/details with distributed/system-based requirements/guarantees. If consensus requires a node to have durably persisted some bit of state before it sends a particular message to other nodes in the distributed system, then it doesn't matter how that persistence is implemented, it only matters how that persistence is observable, disks and FS caches and etc. aren't requirements, they're just one of many possible implementation choices.
Question about how people end up using TigerBeetle. There's presumably a lot of external systems and other databases around a TigerBeetle install for everything that isn't an Account or Transfer. What's the typical pattern for those less reliable systems to square up to TigerBeetle, especially to recover from consistency issues between the two?
The typical pattern in integrating TigerBeetle is to differentiate between control plane (Postgres for general purpose or OLGP) and data plane (TigerBeetle for transaction processing or OLTP).
All your users (names, addresses, passwords etc.) and products (descriptions, prices etc.) then go into OLGP as your "filing cabinet".
And then all the Black Friday transactions these users (or entities) make, to move products from inventory accounts to shopping cart accounts, and from there to checkout and delivery accounts—all these go into OLTP as your "bank vault". TigerBeetle lets you store up to 3 user data identifiers per account or transfer to link events (between entitites) back to your OLGP database which describes these entities.
This architecture [1] gives you a clean "separation of concerns", allowing you to scale and manage the different workloads independently. For example, if you're a bank, it's probably a good idea not to keep all your cash in the filing cabinet with the customer records, but rather to keep the cash in the bank vault, since the information has different performance/compliance/retention characteristics.
This pattern makes sense because users change their name or email address (OLGP) far less frequently than they transact (OLTP).
Finally, to preserve consistency, on the write path, you treat TigerBeetle as the OLTP data plane as your "system of record". When a "move to shopping cart" or "checkout" transaction comes in, you first write all your data dependencies to OLGP if any (and say S3 if you have related blob data) and then finally you commit your transaction by writing to TigerBeetle. On the read path, you query your system of record first, preserving strict serializability.
Does that make sense? Let me know if there's anything here we can drill into further!
[1] https://docs.tigerbeetle.com/coding/system-architecture/
It looks like the segfaults on the JNI side would not have been protected if Rust or some other memory safe language were being used - the lack of memory safety bugs gives some decent proof that TigerBeetle's approach to Zig programming (TigerStyle iirc, lol) does what it sets out to do.
EDIT: But, yeah, totally, if not for TigerStyle, we'd die to nasal demons!
Special kudos to the founders who are sharing great insights in this thread.
Appreciate your kind words too, and look forward also to sharing something new in our talks at SD25 in Amsterdam soon!
This is an interesting meta pattern where doing something _harder_ actually simplifies the system.
Another example is that, because we assume that the disk can fail and need to include repair protocol, we get state-synchronization for a lagging replica "for free", because it is precisely the same situation as when the entire disk gets corrupted!
In TigerBeetle, we take advantage of some special properties to make the state machine checking part linear-time. We let TigerBeetle tell us exactly which transactions happen. We can do this because it's a.) strong serializable, b.) immutable (in that we can inspect DB state to determine whether an op took place), and c.) exposes a totally ordered timestamp for every operation. Then we check that that timestamp order is consistent with real-time order, using a linear-time cycle detection approach called Elle. Having established that TigerBeetle's claims about the timestamp order are valid, we can apply those operations to a simulated version of the state machine to check semantic correctness!
I'd like to generalize this to other systems, but it's surprisingly tricky to find all three of those properties in one database. Maybe an avenue for future research!
After reading the Jepsen report on TigerBeetle, the related blog post, and briefly reviewing the Antithesis integration code on GitHub workflow, I'm trying to better understand the testing scope.
My core question is: could these bugs detected by the Jepsen test suite have also been found by the Antithesis integration?
This question comes from a few assumptions I made, which may be incorrect:
- I thought TigerBeetle was already comprehensively tested by its internal test suite and the Antithesis product.
- I had the impression that the Antithesis test suite was more robust than Jepsen's, so I was surprised that Jepsen found an issue that Antithesis apparently did not.
I'm wondering if my understanding is flawed. For instance:
1. Was the Antithesis test suite not fully capable of detecting this specific class of bug?
2. Was this particular part of the system not yet covered by the Antithesis tests?
3. Am I fundamentally comparing apples and oranges, misunderstanding the different strengths and goals of the Jepsen and Antithesis testing suites?
I would greatly appreciate any insights that could help me understand this better. I want to be clear that my goal is to educate myself on these topics, not to make incorrect assumptions or assign responsibility.
1. Some sort of environment, which can run the system. The simplest environment is to spin up a real cluster of machines, but ideally you want something fancier, to improve performance, control over responses of external APIs, determinism, reproducibility, etc. 2. Some sort of load generator, which makes the system in the environment do interesting thing 3. Some sort of auditor, which observes the behavior of the system under load and decides whether the system behaves according to the specification.
Antithesis mostly tackles problem #1, providing a deterministic simulation environment as a virtual machine. The same problem is talked by jepsen (by using real machines, but injecting faults at the OS level), and by TigerBeetle's own VOPR (which is co-designed with the database, and for that reason can run the whole cluster on just a single thread). There there approaches are complimentary and are good at different things.
For this bug, the critical part was #2, #3 --- writing workload verifier and auditor that actually can trigger the bug. Here, it was aphyr's 1600 lines of TigerBeetle-specfic Clojure code that triggred and detected the bug (and then we patched _our_ equivalent to also trigger it. Really, what's buggy here is not the database, but the VOPR. Database having bugs is par of course, you can't just avoid bugs through the sheer force of will. So you need testing strategy that can trigger most bugs, and any bug that slips through is pointing to the deficiency in the workload generator.)
https://github.com/jepsen-io/tigerbeetle/blob/main/src/jepse...
There are almost certainly blind spots in the Jepsen test generators too--that's part of why designing different generators is so helpful!
To understand why the query engine bug slipped through, see: https://tigerbeetle.com/blog/2025-06-06-fuzzer-blind-spots-m...
At a national level, we’re working with the Gates Foundation to integrate TigerBeetle into their non-profit central bank switch that will be powering Rwanda’s National Digital Payments System 2.0 later this year [1].
At an enterprise level, TigerBeetle already powers customers processing 100M+ transactions per month in production, and we recently signed our first $2B fintech unicorn in Europe with a few more in the US about to close. Because of the move to realtime transaction processing around the world [2] there’s been quite a bit of interest from companies wanting to move to TigerBeetle for more performance.
Finally, to your question, some of the founders of Clear Street, a fairly large brokerage on Wall Street have since invested [3] in TigerBeetle.
[1] https://mojaloop.io/how-mojaloop-enables-rndps-2-0-ekash/
[2] https://tigerbeetle.com/blog/2024-07-23-rediscovering-transa...
[3] https://tigerbeetle.com/company
But, no. On the commercial side, I don't think we've had one conversation with a prospect or CTO or engineering team where they were concerned that we picked a systems language for the next thirty years. And while Zig is a beautiful, perfect replacement for C, I think the real reason the question has never come up, is that our customers come to us instead of us to them. We're not trying to convince anyone. They're already appreciating the extensive end-to-end testing we do on everything we ship.
However, I should emphasize again, that given all the assertions, fuzzing and DST we do, Zig's quality can't be overstated. It holds up.
Form the developer perspective, the big thing is that we don't have any dependencies, so updating compiler for us is just a small amount of work once in a while, and not your typical ecosystem-wide coordination problem. Otherwise, Zig's pretty much "finished" for our use-case, it more or less just works.
"Invested" in terms of "giving you money" or in terms of "Now uses the database themselves"? I read it as the first, but I think the question is about usage, not investments.
- The investment in TigerBeetle was done personally, not through Clear Street.
- I'm no longer actively involved day-to-day as CTO at Clear Street, but while I was, TigerBeetle was a solution we very much had in mind as our volumes were increasing.
That said, roadmaps change, priorities shift, etc. If TigerBeetle existed when we started Clear Street, I very much would have used it, and saved me from many headaches.
Otherwise check out https://tigerbeetle.com/company if you want more about the corporate side.
Or possibly my mental model of how physical disks and the driver stack behave these days is outdated.
I think it should be http://pmg.csail.mit.edu/papers/vr-revisited.pdf (http scheme not https) ?
And now I have some Friday evening reading material.
The VSR 2012 paper is one of my favorites as is “Protocol-Aware Recovery for Consensus-Based Storage”, which is so powerful.
Hope you enjoy the read!
That's like saying that rice noodles are no good for making risotto. At the core they are both rice...
Another paradigm is OLAP, in which aggregation of large datasets is the principal concern.
Yes, TigerBeetle specializes only for transaction processing (OLTP). It’s not a general-purpose (OLGP) DBMS.
That said, we have customers from energy to gaming, and of course fintech.
For more on our open source thinking and how this is orthogonal to business model (and product!), see our interview with the Changelog: https://m.youtube.com/watch?v=Yr8Y2EYnxJs
1. AWS will take your initial and ongoing investment in the implementation but they don't have to share theirs with you. Specifically, they will take your improvements but their own improvements (say some performance optimizations) they can keep to themselves. It's good business sense if it allows them to further differentiate their "improved" offering from your "vanilla" service.
2. Competing on the the interface in this case really means competing on related services like management, etc. So your thesis is that you will provide a better/cheaper managed service than AWS. Even if that's true (a big if), most of the time the decision which service to use will have little to do with technical merit. I.e. we already use AWS, have SLA painfully negotiated, get volume discounts, etc. Do we really want to go through all of this with another vendor just for one extra service.
Just a couple of thoughts that will hopefully help you sharpen your thesis.
In practice all I've seen from AWS is just to add integrations with their internal orchestrators and not much else. Back when I was at Redis Labs, AWS added TLS support to Redis and was dying to get that upstreamed (so that they wouldn't have to maintain the patch), except that as far as I understood nobody upstream wanted that code. In other words, hypothetical improvements by AWS (and other Clouds) are extremely overrated. When it comes to tigerbeetle, I would put the chance that they introduce bugs and vulnerabilities much higher than the possibility they add any meaningful improvement over what the actual experts (the tigrebeetle team) have already done.
> Do we really want to go through all of this with another vendor just for one extra service.
That's a great point, and in fact I've seen AWS purposefully offer insane (in Europe maybe we would say anti-competitive) discounts precisely to prevent Redis Labs from gaining market share. I'm sure they will try the same with TB once it becomes mainstream enough. What TB has that Redis doesn't have is the fact that it's a database designed for truly mission-critical stuff (i.e. counting the money) and maybe customers will be willing to go through the extra motions to ensure they get the best service they can (assuming TB will be able to provide that).
Interesting, in a recent thread (I think it was about Redis going back open source) an AWS employer was bragging about substantial concurrency optimizations they implemented in Valkey. At the time I thought it could have been a great differentiator to keep proprietary but perhaps they decide to sacrifice it to help make sure Valkey takes over the Redis midshare.
We also don't expect (or need) anyone to contribute improvements upstream to us. That's open source!
Finally, open source is not the same thing as product. There are thousands of companies around the world who make high quality products that people pay for. TigerBeetle is no different.