Twitter | Search | |
Jepsen
Distributed systems safety analysis
64
Tweets
0
Following
2,262
Followers
Tweets
Jepsen Jan 30
Replying to @mpenet
This is what I get for writing "2019-TODO" in October and only updating the TODO part before release! Should be fixed momentarily, I've just been waiting on gcloud to pick up the changes. Takes forever.
Reply Retweet Like
Jepsen Jan 30
New Jepsen analysis! We talk about etcd's kv operations, watches, and locks. KV ops look strict serializable, and watches deliver all changes in order (with a minor undocumented edge case around revision zero). As usual, distributed locks aren't real:
Reply Retweet Like
Jepsen Nov 19
Replying to @mf @natevanben and 4 others
Oh, whoops, that's right. I was thinking about txns going into the log as one paxos op, and log sealing being a second process. But of course, the only actual paxos op is "This is the entire log segment"; no separate blocking phase required.
Reply Retweet Like
Jepsen Nov 19
Replying to @mf @natevanben and 4 others
Does that mean that the effective txn commit latency is actually 3, not 2, network hops? 2 in the actual commit path, and an additional one implied by waiting for a log segment sealing message from remote leaders?
Reply Retweet Like
Jepsen Nov 18
Replying to @natevanben @fauna and 3 others
... one thing I hadn't considered until now, though, was that if we assume clocks perfectly synced, and all nodes seal simultaneously, yes, you do need to wait a third inter-dc hop for remote leaders to inform you of the window being sealed... , care to jump in here?
Reply Retweet Like
Jepsen Nov 18
Replying to @natevanben @fauna and 3 others
Oh, I think I get what you're driving at here. Yes, I think you need to get a commit index, and specifically, one for the decision to seal the log window, but I don't think that's in the blocking path of txns--if you couple windows to the Raft term, leader can seal independently
Reply Retweet Like
Jepsen Nov 18
Replying to @natevanben @fauna and 3 others
I may have forgotten to explain this in the talk, but you can (in most cases) pick a log segment with a local leader, because, in general, every DC will have leaders for one (or more) log segments. If all local leaders are down, yes, you'd have to proxy, which would add hops.
Reply Retweet Like
Jepsen Sep 19
No, we haven't worked together on 4.0. Perhaps they were thinking of 's analysis of 3.6.4, or 's analysis of 3.4.0? That's the most recent work we've done with Mongo.
Reply Retweet Like
Jepsen Sep 18
Replying to @GoTurboOfficial
Yes, we've talked! ComDB2 relies on synchronized clocks.
Reply Retweet Like
Jepsen Sep 18
Replying to @jepsen_io
I'm committed to giving everyone the most accurate, rigorous reporting on database correctness that I can, and I encourage vendors to do the same. Be open, honest, and nuanced in your writing. That honesty is good for users, and it builds trust in your team.
Reply Retweet Like
Jepsen Sep 18
Replying to @jepsen_io
But still, some vendors do misrepresent the results of our work together, and this bugs me. I expected vendors to call each other out for this sort of thing, because they're more than willing to write take-downs over interpretation of benchmarks, but so far that hasn't happened.
Reply Retweet Like
Jepsen Sep 18
Replying to @jepsen_io
Most vendors are telling the truth here: by the time we conclude our collaboration, the safety issues we found have usually been addressed, and the test suite often passes. Many vendors also follow up "passes" with a description of the issues we found, which I think is honest.
Reply Retweet Like
Jepsen Sep 18
Replying to @jepsen_io
In more general terms, almost every database we tested with Jepsen fails, sometimes in dozens of ways, before its test suite passes. That's how we know Jepsen is *working*! The vendor headline that comes out of that process is usually "X passes Jepsen".
Reply Retweet Like
Jepsen Sep 18
Replying to @jepsen_io
Crashes, unavailability, and performance problems aren't usually reported by Jepsen as "failing" results, because we're primarily concerned with checking safety, rather than liveness problems. It's hard to say how slow is too slow. We file and discuss these issues qualitatively.
Reply Retweet Like
Jepsen Sep 18
Replying to @jepsen_io
There are other issues that we found in our work, like slowly spawning an ever-increasing number of backend worker processes which eventually consume all resources and kill the machine. This one's still open too.
Reply Retweet Like
Jepsen Sep 18
Replying to @jepsen_io
Because these problems involve schema changes (e.g. creating tables), they may not impact users frequently. YugaByte doesn't think they're relevant to the core transactional mechanism in YugaByte DB, which is why they're not discussing them when they say "Jepsen tests passed".
Reply Retweet Like
Jepsen Sep 18
Again, YugaByte DB's Jepsen tests did not pass. They do not currently pass. Correctness issues we identified in our collaboration, mainly due to non-transactional schema changes, are still unaddressed. YugaByte and I have talked about this.
Reply Retweet Like
Jepsen Sep 5
Replying to @Yugabyte
An open question in my mind: can non-transactional schema changes (e.g. adding a column) result in *data-level* serializability violations? What would those anomalies look like? I'm honestly not sure, but it's something we can explore going forward!
Reply Retweet Like
Jepsen Sep 5
Replying to @Yugabyte
So... when YugaByte says they "pass Jepsen" () they're only talking about the parts of the test suite which look at changes to data records in the absence of schema changes. We think that's most important for users, and it's the vast majority of our tests
Reply Retweet Like
Jepsen Sep 5
Replying to @Yugabyte
The impact of this issue (like many of the problems we found in schema modification) is limited to a short time around table creation. Schema changes in general aren't transactional, so this might occur during other changes, like adding/removing columns--we haven't looked yet.
Reply Retweet Like