| Tweets |
|
Jepsen
@jepsen_io
|
Jan 30 |
|
This is what I get for writing "2019-TODO" in October and only updating the TODO part before release!
Should be fixed momentarily, I've just been waiting on gcloud to pick up the changes. Takes forever.
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Jan 30 |
|
New Jepsen analysis! We talk about etcd's kv operations, watches, and locks. KV ops look strict serializable, and watches deliver all changes in order (with a minor undocumented edge case around revision zero). As usual, distributed locks aren't real:
jepsen.io/analyses/etcd-…
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Nov 19 |
|
Oh, whoops, that's right. I was thinking about txns going into the log as one paxos op, and log sealing being a second process. But of course, the only actual paxos op is "This is the entire log segment"; no separate blocking phase required.
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Nov 19 |
|
Does that mean that the effective txn commit latency is actually 3, not 2, network hops? 2 in the actual commit path, and an additional one implied by waiting for a log segment sealing message from remote leaders?
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Nov 18 |
|
... one thing I hadn't considered until now, though, was that if we assume clocks perfectly synced, and all nodes seal simultaneously, yes, you do need to wait a third inter-dc hop for remote leaders to inform you of the window being sealed... @fauna, care to jump in here?
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Nov 18 |
|
Oh, I think I get what you're driving at here. Yes, I think you need to get a commit index, and specifically, one for the decision to seal the log window, but I don't think that's in the blocking path of txns--if you couple windows to the Raft term, leader can seal independently
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Nov 18 |
|
I may have forgotten to explain this in the talk, but you can (in most cases) pick a log segment with a local leader, because, in general, every DC will have leaders for one (or more) log segments. If all local leaders are down, yes, you'd have to proxy, which would add hops.
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 19 |
|
No, we haven't worked together on 4.0. Perhaps they were thinking of @meatcomputer's analysis of 3.6.4, or @aphyr's analysis of 3.4.0? That's the most recent work we've done with Mongo.
jepsen.io/analyses/mongo…
jepsen.io/analyses/mongo…
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 18 |
|
Yes, we've talked! ComDB2 relies on synchronized clocks. bloomberg.github.io/comdb2/transac…
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 18 |
|
I'm committed to giving everyone the most accurate, rigorous reporting on database correctness that I can, and I encourage vendors to do the same. Be open, honest, and nuanced in your writing. That honesty is good for users, and it builds trust in your team.
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 18 |
|
But still, some vendors do misrepresent the results of our work together, and this bugs me. I expected vendors to call each other out for this sort of thing, because they're more than willing to write take-downs over interpretation of benchmarks, but so far that hasn't happened.
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 18 |
|
Most vendors are telling the truth here: by the time we conclude our collaboration, the safety issues we found have usually been addressed, and the test suite often passes. Many vendors also follow up "passes" with a description of the issues we found, which I think is honest.
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 18 |
|
In more general terms, almost every database we tested with Jepsen fails, sometimes in dozens of ways, before its test suite passes. That's how we know Jepsen is *working*! The vendor headline that comes out of that process is usually "X passes Jepsen". pic.twitter.com/bJBT3MDw74
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 18 |
|
Crashes, unavailability, and performance problems aren't usually reported by Jepsen as "failing" results, because we're primarily concerned with checking safety, rather than liveness problems. It's hard to say how slow is too slow. We file and discuss these issues qualitatively.
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 18 |
|
There are other issues that we found in our work, like slowly spawning an ever-increasing number of backend worker processes which eventually consume all resources and kill the machine. This one's still open too. github.com/YugaByte/yugab…
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 18 |
|
Because these problems involve schema changes (e.g. creating tables), they may not impact users frequently. YugaByte doesn't think they're relevant to the core transactional mechanism in YugaByte DB, which is why they're not discussing them when they say "Jepsen tests passed".
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 18 |
|
Again, YugaByte DB's Jepsen tests did not pass. They do not currently pass. Correctness issues we identified in our collaboration, mainly due to non-transactional schema changes, are still unaddressed. YugaByte and I have talked about this.
blog.yugabyte.com/announcing-yug… pic.twitter.com/lu8gwTq8re
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 5 |
|
An open question in my mind: can non-transactional schema changes (e.g. adding a column) result in *data-level* serializability violations? What would those anomalies look like? I'm honestly not sure, but it's something we can explore going forward!
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 5 |
|
So... when YugaByte says they "pass Jepsen" (blog.yugabyte.com/yugabyte-db-di…) they're only talking about the parts of the test suite which look at changes to data records in the absence of schema changes. We think that's most important for users, and it's the vast majority of our tests
|
||
|
|
||
|
Jepsen
@jepsen_io
|
Sep 5 |
|
The impact of this issue (like many of the problems we found in schema modification) is limited to a short time around table creation. Schema changes in general aren't transactional, so this might occur during other changes, like adding/removing columns--we haven't looked yet.
|
||
|
|
||