ScalNEXT (Optimierung des Datenmanagements und des Kontrollflusses von Rechenknoten für Supercomputing) is a research project funded by the BMBF as part of the SCALEXA program in which smart networking hardware is to be used to increase computational efficiency and flexibility in clusters for high-performance computing.
Project Development Team – Information and Communication Technology Topics / Team Leader Simulation Infrastructure and HPC
- +49 241 80 49724
- Send Email
Modern HPC systems are usually structured as cluster systems. This means that individual and usually independent nodes with their own operating system instances are coupled by a basic resource or job management system and connected via a network. The networks used for this purpose, such as Infiniband, Slingshot or Tofu, offer high bandwidths, but their latency is limited by physical parameters. Furthermore, they are mostly passive, i.e., they are only used for communication between the nodes. In addition to the actual computing tasks, data management and control of the network remain with the nodes and is thus distributed with maximum distance. This leads to high latencies for management and control tasks, scaling bottlenecks due to a high number of active end components, and communication bottlenecks due to the need for synchronization messages.
Modern networks, however, offer the possibility of moving many of these tasks to the network, making them more central to the system and bypassing scaling problems. These so-called smart networks, which are reconfigurable and programmable, are already being used in modern telecommunications and data centers, along with techniques such as software defined networks (SDNs). However, their use is still limited in the HPC domain. To make the use of smart network in HPC feasible, several challenges still need to be solved. These include secure virtualization of user-level network resources, development of simple APIs that match existing programming approaches, and modifying operating systems with global, cross-network approaches.
The ScalNEXT project addresses these challenges and develops new technologies to enable the use of smart networks in HPC. The goal of ScalNEXT is to increase the scalability of HPC systems and applications. We are developing enabling technologies that will enable the offloading of core data management and control flow functionality away from nodes and onto the network (onto NICs and switches), and we will apply them there to the three core application areas of Modeling and Simulation (ModSim), Data Analysis and I/O (HPDA), and Machine Learning (ML/KI). In all three areas, this will, firstly, reduce the load of compute nodes, which can then be fully dedicated to the necessary computations. Secondly, management and control tasks will be transferred to the more tightly coupled and centrally located network resources. This will result in a significant increase in computational efficiency and scalability, and will enable the offloading of computations close to the data.
More information on the project can be found on the Gaussian Alliance website:
We gratefully acknowledge financial support from the BMBF (German Federal Ministry of Education and Research), grant number 16ME0688.