@jwildeboer This appears to have a lot more to do with not testing your recovery procedure, not having backups, and using non-ECC RAM in servers than it does with ZFS. There is no filesystem that can protect you from bad processes. I don't know about the full scan for import thing; I've been using ZFS for over a decade and have never had that happen. It seems like on top of not bothering to test their recovery procedures their ops folks also barely know ZFS.

@freakazoid I didn't see that they were not using ECC. I only saw a note that "server RAM" might be a cause. But I am always very careful to jump to conclusions from the outside.


@jwildeboer Except for the conclusion that it's somehow ZFS's fault ;-)

@freakazoid That's your interpretation of my words. I didn't say or imply that. I found the story interesting and shareworthy for these reasons:

- One server can cause a lot of problems, even when ZFS seems to be set up in a way that should garantuee high resilience.
- Finding the root cause can be quite difficult
- Lack of features in older versions that came unexpected, causing severe slowing down of the recovery process.

It's an insightful postmortem of a ZFS failure mode. Hence I shared it

@jwildeboer I guess we should also take in consideration the fact that, IIRC, the boxes involved were old Nexenta's implementation of ZFS.
I've seen it happening already on old IllumOS boxes.


@jwildeboer It's also a good cautionary tale for those who think "Oh it's triple-replicated so we don't need backups." You always need backups. And you need to test your backups. Just like you need to test every other recovery procedure.

@freakazoid Yes, it's a cautionary tale, highlighting a lot of points to review in any DR/Failure process. And that's why I shared it. Not many companies are as transparent as #Gandi in sharing such info.

Sign in to participate in the conversation
R E T R O  S O C I A L

A social network for the 19A0s.