Emergency RAID Recovery

From Nuclear Physics Group Documentation Pages
Jump to navigationJump to search

How we saved /data1 on pumpkin

DON'T EVEN THINK ABOUT DOING ANYTHING IN THIS SECTION UNLESS YOU ABSOLUTELY HAVE TO AND ARE A RAID MASTER

If the array is simply misdetected or otherwise not working out quite right, you'll want to follow the (elsewhere undocumented) repair procedure from the areca web interface. This is typically caused by moving an entire drive array to another machine, or rearranging drives while the machine is powered off.

  • go to the repair section
  • type RESCUE into the box
  • reboot
  • type SIGNAT into the box
  • type LeVeL2ReScUe into the box (yes, capitalization matters)
  • reboot
  • array should be back, do SIGNAT again to make sure the array stays detected



If the volume is failed (as in dead)

  • Power down the system
  • Remove hot spares and other potentially confusing drives
  • Boot the system back up
  • Check the settings on the failed volume and PRINT THESE OUT
  • Delete the volume
  • Create a new volume with EXACTLY the same settings as the page you printed out, and make sure that the initialization mode is "no init for rescue". if you mess this up, all data is lost!!!

Once the array is back, it likely has some filesystem errors.

* Mount read-only, back up your data as best as you can, umount, and e2fsck.
* Repair errors
* Reboot
* See if you can read/write onto the array's filesystem now. Run another e2fsck to make doubly sure.
* Reboot again

Congrats! You've saved the day!

Saving a dismantled software RAID

Read the mdadm man page, and once you're completely confident about how everything works, run mdadm --incremental --run /dev/sdb1 for example.