Silent Data Corruptions at Scale
A study by Meta (Facebook) examining silent data corruptions (SDCs) in large-scale data centers found that faulty hardware causes incorrect computations without immediately crashing systems, impacting 1.7% of machines annually and requiring careful monitoring to detect hard failures vs. SDCs.
Background
- A 2021 paper from Facebook/Meta researchers studying "silent data corruptions" (SDCs): hardware errors where CPUs produce wrong results without crashing or logging any error.
- Unlike crashes or blue screens, SDCs are invisible — the system seems fine but outputs are subtly wrong, silently corrupting databases, ML models, or scientific calculations.
- Key finding: CPUs that pass all factory tests can still fail under rare electrical conditions (voltage drops, temperature shifts). At Facebook's scale (hundreds of thousands of servers), these rare glitches become inevitable.
- The paper's practical detection method: run the same computation on two CPU cores and compare results. This let Facebook identify faulty hardware before it caused user-visible harm.
- This work changed how large-scale operators think about hardware reliability — from assuming all tested CPUs are identical to treating hardware faults as a normal, manageable risk.