Discussions

Confessions of a Raid1 Unbeliever

+
0 Votes
Locked

Confessions of a Raid1 Unbeliever

mike
Raid Nightmare - How much confidence are you putting in your Raid1(mirror)??

My Setup - HP ML115 Server2003R2 - embedded on board NVRAID controller. 30 clients
(2) 1TB Drives used for Data (not OS) configured in Raid 1 (mirror). Raid 1 = smart move - right? About 650GB of Data . No Problems for almost 3 years then.....
Server started Freezing and had to be hard reset - After a month of exploration I discovered ( I'm a little slow) that it only happened during ShadowProtect Backups of the entire Drive. The rest of the time users could use the server and everything seemed normal.
The error log always shows Source NVRAID Event ID 11. ''The driver detected a controller error on ..." ( no information)
In Bios Raid reported as - "Healthy"
An I/O problem ? - no looks OK and Queue is not very big ! and plenty of space on the drives. Could it be Extra Long FIle path/names (>255) ?? - maybe, I found a few - fixed those.
Ran Robocopy to backup the entire Raid Drive and found from the Robocopy log that the server hung when it hit a certain file. Deleted the file and everything seems ok for 2/3 days - then it starts again
Ran Robocopy again and - stopped on another file, same file every time, deleted this file. All is sweetness and light.

Then about a week later a user reported a directory (folder) missing, and I saw another drive letter on the server ! what ! 2 independent Hard drives ? sure enough the raid was broken. Rebuild right ? - but hang on what EXACTLY does rebuild do, copy A to B, B to A or synchronize? . Checked and compared each drive and the data is different, Jeepers ! the missing folder is on the "B"' or Drive "1'' but not on Drive "'A'' or Drive ''0''. It seems Drive0 ! is in some sense the "master'" Drive ??? Eh?? I noticed that is not so obvious which physical drive is which!
How long ago did the raid break - can't be long can it ? as I am rebooting the server every 2-3 days. where is the users data going now -presumably to Drive A (0) since thats the only drive with the Shares
So how did this folder (large) get onto Drive B (1)
Of course Now I'm not at all confident of my Monthly + Daily Incrementals shadow backups of the Raid since I'm not sure which drive the data came from ( I suspect drive A(0) so I could be missing the missing folders! So now my monthly+daily incrementals are unreliable - great!
Now I'm getting paranoid and spend the better part of a weekend copying of the data from each Drive A(0) and B(1) major folder by folder and synchronising them using EasytoSync. Good job I have available hard drives in an external box!
Now I think I have all the data in one place!
So Forget Rebuild - since I've no idea what it is actually doing - I think it only copies from Drive 0 to 1. So I delete the raid - reformat the drives and create the raid and restore the data. (Can't run CHDSK on Raid array)
All seems OK for 3 days - Made a Shadow Protect Main Backup + 2 incrementals. then it all started again! NVRAID error.
Is it the one of the HDrives ? the connections or the Raid Controller on the mother board?
I've no reason to suspect the HD have I? its a reported Raid controller error, Again I check my backups - very tedious) break and delete the raid and run the drives independently. OK for 2/3 days so its the motherboard ! - No, because then I get a Drive Controller error on Drive ''A'' AH! its a hard drive problem? Remove the HD and check using HDTune. Its OK - Quick format OK. Now I try a long format -it fails.
Next day power up this suspect drive and it Quick formats - no problem and HDTune says OK. 1 Hour Later it Fails long format and HDtune reports everything in the Red - no good!

Conclusion - the HD is faulty ONLY when its Hot as it will be when Shadow Protect is copying 600GB of data - the rest of the time its OK.

Second conclusion - Raid1 is a waste of time for Data storage and can cause more problems than its worth. So I'm sticking with NO Raid - doubling my data capacity is a side benefit and relying on a Monthly + Daily Incrementals plus Daily Robocopy of everything to another HD drive ( just in case I need to check the log for a corrupt file). Paranoid I may be. but at least I feel more confident of finding any HD problem and users will only loose use recent data worst-case. Perhaps I should chck to logs every morning - or get it to send me an email if fails! I also keep one Drive off-site and swop it out every week. Also I'd hazard a guess that when users "loose'' folders/files its because they or someones deleted it!.
Now it might be that you have a Server with drive caddies with little Led's on to warn of failure. If not best of luck!

So...
You cant rely on the "Healthy" report on the raid array - and who checks it every day?
You may have no warning of failure.
The data on the "Raid" Drives may be different after failure and you don't know when it happened.
Your backups become suspect. This is a real concern !

If you disagree with my conclusions about the usefulness of Raid1 (for DATA) - love to hear from you.
If you have the misfortune to experience any such problems - I suggest you change the Hard drive(s) first! Hindsight is great isn't it.
Mind you, I confess I'm not an IT pro just a long-term volunteer doing what I can at charity in Thailand - picking it up as I go along.
Now I'll take a deep breath! - feel better with it of my chest - cheers or
Chok Dee as they say in Thailand
Mike
  • +
    0 Votes
    HAL 9000 Moderator

    To use the HDD makers Testing Utility to check the drives.

    Little important things get checked and the Drive Log is also checked for what it's recorded and it's the easiest way to check the correct function of a HDD.

    The Proper way to test any HDD is to test it in the system that is failing and if it fails remove the drive fit to another computer and retest. If it fails the second Test the Drive has gone to Silicon Heaven or at the very least is on it's way there. If it passes the Second Test the M'Board, Data Lead or Power Supply in the First computer is damaged and you need to do a bit more investigation.

    Also if it's reported that any HDD has triggered it's Overheat Warning scrap the Drive and fit a replacement. Of course enabling SMART doesn't do any harm either.

    Col

  • +
    0 Votes
    HAL 9000 Moderator

    To use the HDD makers Testing Utility to check the drives.

    Little important things get checked and the Drive Log is also checked for what it's recorded and it's the easiest way to check the correct function of a HDD.

    The Proper way to test any HDD is to test it in the system that is failing and if it fails remove the drive fit to another computer and retest. If it fails the second Test the Drive has gone to Silicon Heaven or at the very least is on it's way there. If it passes the Second Test the M'Board, Data Lead or Power Supply in the First computer is damaged and you need to do a bit more investigation.

    Also if it's reported that any HDD has triggered it's Overheat Warning scrap the Drive and fit a replacement. Of course enabling SMART doesn't do any harm either.

    Col