Fujitsu Develops Fast Recovery Process for Multiple Disk Failures

Fujitsu Laboratories Ltd. has developed a disk-recovery technique that is faster than existing methods and that can handle multiple failures. RAID is a widely used technology for protecting data against disk failures, but the huge amounts of data generated through the use of web services makes recovery slow.

Fujitsu claims that its new high-speed recovery process maintains the same tolerance to disk failures as the existing RAID technology but it is speeding up recoveries by 20% or more with flexible trade-off of space efficiency in accordance with the usage scenario when there are multiple failures, such as when two disks fail simultaneously.

The technology is devising a unique structure for managing blocks of data - units of storage - by groups, which maintains the same tolerance to disk failures as the existing RAID technology while speeding up recoveries by 20% or more with flexible trade-off of space efficiency in accordance with the usage scenario when there are multiple failures, such as when two disks fail simultaneously.

In standard RAID implementations, such as the widely used RAID 5 and RAID 6, all the parity is used to protect all the data. If a given disk fails, it is necessary that together with the parity that protects each piece of data stored on the failed disk, the remaining data be used to reconstruct the lost data. This means a lengthy recovery process and an increased risk that additional data will be lost during the recovery process itself. For example, when using 48 disks, each with 4-TB capacity and 15 Mbps random I/O performance, recovering from simultaneous failures of two disks is calculated to take more than 10 hours.

In Fujitsu's technology, the range of protection offered by each parity does not cover all of the data, but rather is limited to a portion. Additionally, Fujitsu developed a unique approach using a partially overlapping range of parity protection to protect any of the pieces of data from loss. When a disk fails, only the minimum combination of parity and data needed for recovery is used, which shortens recovery time.

In addition, data and parity are distributed over the different disks that make up a storage system. When a disk fails, recovery is performed by selecting the parity with the minimum amount of processing to recover the lost data that had been stored on that disk.

Fujitsu says that with the multilayered, overlapping structure of parity-protection range, there are mutual tradeoffs between recovery time (dependent on the minimum data-processing volume needed to restore data), probability of data loss (dependent on the number of parities that protect each piece of data), and capacity utilization efficiency (dependent on the ratio between data and parity). The range of parity protection can be tuned to provide the best balance given the importance of the data being stored.

Fujitsu Laboratories plans to continue making improvements to this recovery technology with the goal of a practical implementation during fiscal 2015.