Classes
struct	KernelConfig

Functions
template<class T, class Abi, KernelConfig Conf, index_t RowsReg, StorageOrder OA, StorageOrder OD>
void	trtri_copy_microkernel (uview< const T, Abi, OA > A, uview< T, Abi, OD > D, index_t k) noexcept
template<class T, class Abi, KernelConfig Conf, index_t RowsReg, index_t ColsReg, StorageOrder OD>
void	trmm_microkernel (uview< const T, Abi, OD > Dr, uview< T, Abi, OD > D, index_t k) noexcept
template<class T, class Abi, KernelConfig Conf, StorageOrder OA, StorageOrder OD>
void	trtri_copy_register (view< const T, Abi, OA > A, view< T, Abi, OD > D) noexcept

Variables
template<class T, class Abi>
constexpr index_t	ColsReg = RowsReg<T, Abi>
template<class T, class Abi, KernelConfig Conf, StorageOrder OA, StorageOrder OD>
const constinit auto	trtri_copy_lut
template<class T, class Abi, KernelConfig Conf, StorageOrder OD>
const constinit auto	trmm_lut
template<class T, class Abi>
constexpr index_t	RowsReg
	Register block size of the matrix-matrix multiplication micro-kernels.

Class Documentation

◆ batmat::linalg::micro_kernels::trtri::KernelConfig

struct batmat::linalg::micro_kernels::trtri::KernelConfig

Class Members
MatrixStructure	struc = MatrixStructure::LowerTriangular

Function Documentation

◆ trtri_copy_microkernel()

template<class T, class Abi, KernelConfig Conf, index_t RowsReg, StorageOrder OA, StorageOrder OD>

void batmat::linalg::micro_kernels::trtri::trtri_copy_microkernel	(	const uview< const T, Abi, OA >	A,
		const uview< T, Abi, OD >	D,
		const index_t	k )

noexcept

Parameters

A	k×RowsReg.
D	k×RowsReg.
k	Number of rows in A and D. Invert the top block of A and store it in the top block of D. Then multiply the bottom blocks of D by this block (on the right).

Definition at line 19 of file trtri.tpp.

◆ trmm_microkernel()

template<class T, class Abi, KernelConfig Conf, index_t RowsReg, index_t ColsReg, StorageOrder OD>

void batmat::linalg::micro_kernels::trtri::trmm_microkernel	(	const uview< const T, Abi, OD >	Dr,
		const uview< T, Abi, OD >	D,
		const index_t	k )

noexcept

Parameters

Dr	RowsReg×k lower trapezoidal
D	k×ColsReg
k	Number of rows in D. Compute product Dr D and store the result in the bottom block of D

Definition at line 98 of file trtri.tpp.

◆ trtri_copy_register()

template<class T, class Abi, KernelConfig Conf, StorageOrder OA, StorageOrder OD>

void batmat::linalg::micro_kernels::trtri::trtri_copy_register	(	view< const T, Abi, OA >	A,
		view< T, Abi, OD >	D )

noexcept

Definition at line 132 of file trtri.tpp.

Variable Documentation

◆ ColsReg

template<class T, class Abi>

index_t batmat::linalg::micro_kernels::trtri::ColsReg = RowsReg<T, Abi>

constexpr

Definition at line 26 of file trtri.hpp.

◆ trtri_copy_lut

template<class T, class Abi, KernelConfig Conf, StorageOrder OA, StorageOrder OD>

const constinit auto batmat::linalg::micro_kernels::trtri::trtri_copy_lut

inlineconstinit

Initial value:

                                       =
make_1d_lut<RowsReg<T, Abi>>([]<index_t Row>(index_constant<Row>) {
    return trtri_copy_microkernel<T, Abi, Conf, Row + 1, OA, OD>;
})

Definition at line 29 of file trtri.hpp.

◆ trmm_lut

template<class T, class Abi, KernelConfig Conf, StorageOrder OD>

const constinit auto batmat::linalg::micro_kernels::trtri::trmm_lut

inlineconstinit

Initial value:

                                 = make_2d_lut<RowsReg<T, Abi>, ColsReg<T, Abi>>(
[]<index_t Row, index_t Col>(index_constant<Row>, index_constant<Col>) {
    return trmm_microkernel<T, Abi, Conf, Row + 1, Col + 1, OD>;
})

Definition at line 35 of file trtri.hpp.

◆ RowsReg

template<class T, class Abi>

index_t batmat::linalg::micro_kernels::gemm::RowsReg

inlineconstexpr

Register block size of the matrix-matrix multiplication micro-kernels.

AVX-512 has 32 vector registers, we use 25 registers for a 5×5 accumulator block of matrix C (leaving some registers for loading A and B):

AVX2 has 16 vector registers, we use 9 registers for a 3×3 accumulator block of matrix C (leaving some registers for loading A and B):

Note: A block size of 4×4 is slightly faster than 3×3 for large matrices, because the even block size results in full cache lines being consumed. For small matrices, 3×3 is faster because it does not spill any registers in the micro-kernels. 2×2 is slower than 3×3 for both small and large matrices (tested using GCC 15.1 on an i7-10750H).

Assumes that the platform has at least 16 vector registers, we use 9 registers for a 3×3 accumulator block of matrix C (leaving some registers for loading A and B):

NEON has 32 vector registers, we use 16 registers for a 4×4 accumulator block of matrix C (leaving plenty of registers for loading A and B):

Definition at line 13 of file avx-512.hpp.

Classes

Functions

Variables

Class Documentation

◆ batmat::linalg::micro_kernels::trtri::KernelConfig

Function Documentation

◆ trtri_copy_microkernel()

◆ trmm_microkernel()

◆ trtri_copy_register()

Variable Documentation

◆ ColsReg

◆ trtri_copy_lut

◆ trmm_lut

◆ RowsReg