News/Media

What's going on in the world of Blue Sky Systems

I have recommended to people what architecting SANS for them for years RAID5/6. Depending on the disk type and use, I would select one of the two. Over the years, people have been asking, but why would you use one over the other beyond redundancy? One huge reason Bit Rot!

Due to SSD being so prevalent, people have forgotten about bit rot. This is because SSDs have an error-correcting code, or ECC for short, built in.

So, what is my experience that has made me write about this? Well, like any good geeky IT engineer, we enjoy tinkering at home. I had a simple RAID 1 setup which I have been using for a few years, holding some data. I had drive failures in the past and swapped them out. However, over this time, I began to notice defects in videos and data corruption. This was due to bit rot. Without any error parity bits to check, this was causing tiny amounts of data to be lost. Now this was a home lab with replaceable data. But if it were key data at home or in the office, this would be a big problem.

This is where RAID comes in. Now, if you're running production data that you can't afford to lose, running a regular spinning drives RAID 6 is a minimum. Why? Simple. RAID 5 allows for 1 drive to be lost, yes. But during that rebuild, you no longer have a way to check against bit rot. At this point, you have the same protection as RAID 0. As the data is rebuilt on the new drive, there is no way to double-check that this data is 100% correct. Using RAID 6, in the event of a single drive failure, you still have 1 drive with error correction capabilities.

Another interesting point I have never been asked about in my profession is data scrubbing. This is important as well. This is a scheduled task that runs through the array to check for bit rot. It does this by reading the drives and ensuring that the same parity check is calculated; if not, it corrects the error.

There are also file systems with this capability, such as BTRFS and ZFS, the latter being more popular due to its greater maturity. These only work if you use the system to create the metadata that covers loss as a RAID replacement; otherwise, it can only detect bit rot, not repair.

Whatever tech you chose, be it filesystem, RAID or something similar, if you are using spinning drives or SSD that are 3.84TB+, use a system with double redundancy. Once installed, ensure that data scrubbing is running, as this will pick up errors and deal with them.