A warning story for the #ZFS people. The #Gandi storage failure post mortem https://news.gandi.net/en/2020/01/postmortem-of-the-failure-of-one-hosting-storage-unit-at-lu-bi1-on-january-8-2020/
@jwildeboer This appears to have a lot more to do with not testing your recovery procedure, not having backups, and using non-ECC RAM in servers than it does with ZFS. There is no filesystem that can protect you from bad processes. I don't know about the full scan for import thing; I've been using ZFS for over a decade and have never had that happen. It seems like on top of not bothering to test their recovery procedures their ops folks also barely know ZFS.
@freakazoid That's your interpretation of my words. I didn't say or imply that. I found the story interesting and shareworthy for these reasons:
- One server can cause a lot of problems, even when ZFS seems to be set up in a way that should garantuee high resilience.
- Finding the root cause can be quite difficult
- Lack of features in older versions that came unexpected, causing severe slowing down of the recovery process.
It's an insightful postmortem of a ZFS failure mode. Hence I shared it
@jwildeboer It's also a good cautionary tale for those who think "Oh it's triple-replicated so we don't need backups." You always need backups. And you need to test your backups. Just like you need to test every other recovery procedure.
A social network for the 19A0s.