"ENVELOPE - Effizienz und Zuverlässigkeit: Selbstorganisation in HPC-Systemen“ is a project funded by the Federal Ministry of Education and Research (BMBF) with a period of three years. The project started in January 2017.
ENVELOPE targets at the improvement of efficiency and resiliency in heterogeneous HPC systems. For doing so it relies on self-organization techniques that are independent of the underlying hardware. It combines global mechanisms considering the whole computing cluster with a node-level observation as well as application support for an improvement of the systems’ resiliency. The rising amount of computing nodes and the concomitant increasing amount of system components results in an ever-growing probability of node failures. Reliability is crucial for future HPC systems, however, traditional methods, e.g., globally coordinated checkpointing, are not always an option due to their high resource demands. ENVELOPE therefore links traditional resiliency schemes with a proactive failure detection of individual components. This approach supports the graceful continuation of the running application. The heterogenous characteristics of the system are hidden from the application programmer.
The Institute for Automation of Complex Power System focuses on the proactive migration of processes by leveraging container solutions. These are combined with existing checkpointing mechanisms for a minimization of the migration costs. Contemporary, the application may also be restarted from the checkpoints if the migration could not be terminated on time.
On 8 October 2018 Dr. Lankes presented the first results of the Envelope project on the HPC status conference of the Gauß-Allianz in Erlangen.