Data-aware algorithms for matrix and tensor computations

Instructor: Suraj Kumar (ROMA team, LIP, ENS Lyon), 2024-2025

In the last few decades, the rate of operations of computing systems has improved drastically but we have noticed relatively smaller improvement in the rate of data transfers. The current computing systems face bottleneck due to the large volume of data transfers. One of the promising approaches to solve this problem is to design algorithms which minimize the data transfers. The goal of this course is to present a variety of such algorithms for matrix and tensor computations.

Tensors are multi-dimensional arrays and arise in several domains, e.g., data mining, neurosciences, quantum computing, molecular dynamics and computer vision. The course will present several tensor computations in detail.

Tucker decomposition example — Tucker decomposition of a 3-dimensional tensor.

The course will also discuss various approaches to compute communication lower bounds (that determine how much data transfers are required) for different matrix and tensor computations. Establishing communication lower bounds helps one to identify efficient algorithms that exactly attain the bounds.

Part 1: Communication lower bounds and optimal algorithms for matrix computations

In this part, we will first study the seminal work of Hong and Kung (from 1960s) to determine the minimum data transfer between levels of a memory hierarchy on a sequential machine (a single processor). After that, we will review other approaches to obtain communication lower bounds on sequential and parallel systems. We will also study some algorithms for both types of systems which attain the lower bounds.

Matrix multiplications
Pebble games
Data transfer cost models
Memory dependent and independent communication lower bounds
Communication avoiding sequential and parallel algorithms
2D/2.5D/3D algorithms
Strassen's matrix multiplications
Matrix factorizations

Part 2: Tensor computations

In this part, we will first look at tensor notations and different types of popular tensor operations. After that, we will study widely used tensor decompositions and different algorithms to compute them. We will also analyze the bottleneck computations of various tensor decomposition algorithms and discuss parallel and communication optimal approaches for those computations.

Tensor notations
Use of tensors in data analytics and quantum computing
Tensor decompositions: Canonical Polydiac, Tucker, Tensor train
Matricized tensor times Khatri-Rao product computations
Multiple tensor times matrix computations

Part 3: Guest lectures on related popular topics

We will have some lectures on related topics from experts.

Sparse matrix and tensor computations
Hierarchical and block low-rank matrices
Randomized methods to improve algorithmic costs

We will also look at several interesting research projects in the course related to matrix/tensor computations in high performance computing, machine learning, data analytics, molecular simulations and quantum computing.

Prerequisite

Experience with algorithm analysis and mathematical optimization will be helpful, but not required.

Lecture notes

Lecture 1 (Sep 10): Overview of the course slides and sequential matrix multiplication slides. Suggested reading pdf. Assignment 1 is out with two questions (please see the slides), and it is due by Sep 19.
Lecture 2 (Sep 12): Costs of some MPI collective routines slides.
Lecture 3 (Sep 17): Parallel matrix multiplication slides. Suggested reading pdf. Assignment 2 is out (please see the slides), and it is due by Sep 26.
Lecture 4 (Sep 19): Symmetric matrix computations slides. Suggested reading pdf1, pdf2.
Lecture 5 (Sep 24): Matrix factorizations slides. Suggested reading pdf.
Lecture 6 (Sep 26): Remaining portions of matrix factorizations.
Lecture 7 (Oct 01): Introduction to tensors slides. Suggested reading pdf. Assignment 3 is out (please see the slides), and it is due by Oct 10. If the unfoldings of the specified tensor are not vey clear, please have a look at all the three unfoldings here.
Lectures 8-9 (Oct 03 & Oct 8): Low rank approximations of tensors slides.
Lecture 10 (Oct 10): Implementation of sequential LU and Householder QR factorizations qr_file.
Lecture 11 (Oct 15): Matricized tensor times Khatri-Rao product computations slides. Suggested reading pdf.
Lecture 12 (Oct 17): Multiple tensor times matrix computations slides. Suggested reading pdf. Final assignment is out (please see the slides), and it is due by Oct 29.
Lecture 13 (Oct 22): Remaining portions of multiple tensor times matrix computations, and high-level overview of hierarchical matrices and randomized matrix multiplications.
Lectures 14-15 (Oct 24): Implementation of CP/Tucker/Tensor train decomposition algorithms files.
Lecture 16 (Nov 12): Project presentations and discussion.

Acknowledgement: I am extremely thankful to Grey Ballard and Loris Marchal for sharing source files of slides of their courses. I am using some of their slides in this course.

Evaluation

The evaluation will be based on the following weightings:

4 homework assignments (best 3 would be counted): 40%
Project (60%): A list of selected research topics/papers will be provided. Each student or a group of 2 students will prepare a report and give a presentation for one of these.

Course project for the final evaluation

Each student or a group of two students will prepare a 5-6 pages report and give a presentation for one of the below topics/articles. Project report is due by Nov 5.

Research topics

Communication costs of a specific matrix factorization (LU, QR, TSQR, RRQR, ...)
Extending a specific matrix factorization for tensors
Use of tensors in a particular domain, for example, neuroscience, data analysis, molecular simulations, quantum computing, face recognition, ...

Research articles

Communication lower bounds for several computations on a sequential machine. Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs
Performance optimizations for TSQR algorithm. Reconstructing Householder Vectors from Tall-Skinny QR
Low rank approximation for stencil computations. Presentation on Nov 12 at 09:20 AM by Antoine Crosnier and Corentin Moumard. LoRAStencil: Low-Rank Adaptation of Stencil Computation on Tensor Cores
Sequential lower bounds and optimal algorithms for symmetric computations. I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels
Hypergraph partitioning-based methods to improve MTTKRP performance. Scalable Sparse Tensor Decompositions in Distributed Memory Systems
A parallel method to perform MTTKRP on a parallel shared memory machine. Presentation on Nov 12 at 08:10 AM by Justine Cauvi. SPLATT: Efficient and Parallel Sparse Tensor-Matrix Multiplication
Randomization based parallel HOSVD and ST-HOSVD methods. Parallel Randomized Tucker Decomposition Algorithms
Tucker decomposition to improve performance of convolution kernels. Stable Low-rank Tensor Decomposition for Compression of Convolutional Neural Network
Tensor train representation for the weight matrices of the fully connected layers. Presentation on Nov 12 at 08:40 AM by Greg Depoire--Ferrer and Apolline Rodary. Tensorizing Neural Networks
Use of tensor train representation in quantum systems. The density-matrix renormalization group: a short introduction

If you want to propose another topic/article, I would be happy to discuss it with you.