Resource-aware computations on CPUs and GPUs

Instructors: Suraj Kumar, Loris Marchal & Frédéric Vivien (ROMA team, LIP, ENS Lyon), 2025-2026

Computing platforms have limited resources, such as memory, cache, bandwidth, and processing power. In the past, algorithms have been derived with optimal complexity and thus are supposed to make an efficient usage of processing power, while memory, cache and bandwidth limitations have often been ignored. However, it is often the case that these resources become the primary factors limiting overall performance, especially (but not only) when using accelerators such as GPUs. GPUs offer increased processing capabilities and superior energy efficiency compared to CPUs, making them a crucial element of many computing systems over the past decade.

In this course, we will present on one hand algorithmic approaches that have recently been proposed in order to utilize all resources efficiently, and on the other hand we will focus on how to implement these efficient algorithms on real hardware platforms. The typical use case will focus on linear algebra computations (matrix operations), which are the basis of both "traditional" high performance computing applications and recent neural network computations.

distributed-machine — A distributed memory machine.

V100-GPU — Nvidia Volta 100 architecture.

Outline

Introduction
- Course details and evaluation methods
- Relevance of this course
Models and algorithms for CPU computations
- Matrix multiplications
- I/O complexity bounds
- Data transfer cost models
- Communication avoiding sequential and parallel algorithms
- 2D/2.5D/3D algorithms
- Directed acyclic graph (DAG) scheduling
- Memory aware computations
- Exploratory research topics
GPU fundamentals
- GPU vs CPU architecture
- CUDA basics
- Memory model, threads, blocks
- Memory optimizations using shared memory
- Coalescing and memory alignment
- Tensor cores
- Exploratory research topics
Implementation on CPUs and GPUs
- Parallelization using OpenMP and MPI
- Parallelization using CUDA

We will look at several interesting research projects in the course related to parallel computations in high performance computing, machine learning and data analytics.

Prerequisite

Experience with C/C++ is expected. Knowledge of parallel algorithms will be helpful, but not required.

Evaluation

The evaluation will be based on the following weightings:

2 pen-and-paper-based assignments (20% weight)
2-3 small programming assignments (30% weight)
Project (50% weight): Each student will select a topic based on their interests and work on it. The output will be accessed based on a written report and an oral prsentation.

Bibliography

The course will be based on recent research atricles in the area. However, we will follow the following textbooks for GPU programming: