Storage Software: Availability
Tidbits from our Storage Analytics team:
Most events are transient and short (90% < 10min)
Pays to wait before initiating recovery operations
Fault bursts are important:
10% of faults are part of a correlated burst
Most small bursts have no rack correlation
Most large bursts are highly rack-correlated
Correlated failures impact benefit of replication:
Uncorrelated R=2 to R=3 => MTTF grows by 3500x
Correlated R=2 to R=3 => MTTF grows by 11x
source: Google Storage Analytics team
D.Ford, F.Popovici, M.Stokely, and V-A. Truong, F. Labelle, L. Barroso, S. Quinlan, C. Grimes