List of state-of-the-art checkpointing and failure recovery mechanism for machine learning. CheckFreq # Paper | Code Check-N-Run # Paper GPM # Paper | Code Swift # Paper LightCheck # Paper | Code