batmat 0.0.18
Batched linear algebra routines
Loading...
Searching...
No Matches
batmat::linalg::micro_kernels::gemm_diag Namespace Reference

Classes

struct  KernelConfig

Functions

template<class T, class Abi, KernelConfig Conf, index_t RowsReg, index_t ColsReg, StorageOrder OA, StorageOrder OB, StorageOrder OC, StorageOrder OD>
std::conditional_t< Conf.track_zeros, std::pair< index_t, index_t >, void > gemm_diag_copy_microkernel (const uview< const T, Abi, OA > A, const uview< const T, Abi, OB > B, const std::optional< uview< const T, Abi, OC > > C, const uview< T, Abi, OD > D, const uview_vec< const T, Abi > d, const index_t k) noexcept
 Generalized matrix multiplication D = C ± A⁽ᵀ⁾ diag(d) B⁽ᵀ⁾. Single register block.
template<class T, class Abi, KernelConfig Conf, StorageOrder OA, StorageOrder OB, StorageOrder OC, StorageOrder OD>
void gemm_diag_copy_register (const view< const T, Abi, OA > A, const view< const T, Abi, OB > B, const std::optional< view< const T, Abi, OC > > C, const view< T, Abi, OD > D, view< const T, Abi > d) noexcept
 Generalized matrix multiplication D = C ± A⁽ᵀ⁾ diag(d) B⁽ᵀ⁾. Using register blocking.

Variables

template<class T, class Abi>
constexpr index_t ColsReg = RowsReg<T, Abi>
template<class T, class Abi, KernelConfig Conf, StorageOrder OA, StorageOrder OB, StorageOrder OC, StorageOrder OD>
const constinit auto gemm_diag_copy_lut
template<MatrixStructure Struc>
constexpr auto first_column
template<index_t ColsReg, MatrixStructure Struc>
constexpr auto last_column
template<class T, class Abi>
constexpr index_t RowsReg
 Register block size of the matrix-matrix multiplication micro-kernels.

Class Documentation

◆ batmat::linalg::micro_kernels::gemm_diag::KernelConfig

struct batmat::linalg::micro_kernels::gemm_diag::KernelConfig
Class Members
bool negate = false
bool track_zeros = false
MatrixStructure struc_C = MatrixStructure::General

Function Documentation

◆ gemm_diag_copy_microkernel()

template<class T, class Abi, KernelConfig Conf, index_t RowsReg, index_t ColsReg, StorageOrder OA, StorageOrder OB, StorageOrder OC, StorageOrder OD>
std::conditional_t< Conf.track_zeros, std::pair< index_t, index_t >, void > batmat::linalg::micro_kernels::gemm_diag::gemm_diag_copy_microkernel ( uview< const T, Abi, OA > A,
uview< const T, Abi, OB > B,
std::optional< uview< const T, Abi, OC > > C,
uview< T, Abi, OD > D,
uview_vec< const T, Abi > diag,
index_t k )
noexcept

Generalized matrix multiplication D = C ± A⁽ᵀ⁾ diag(d) B⁽ᵀ⁾. Single register block.

Definition at line 35 of file gemm-diag.tpp.

◆ gemm_diag_copy_register()

template<class T, class Abi, KernelConfig Conf, StorageOrder OA, StorageOrder OB, StorageOrder OC, StorageOrder OD>
void batmat::linalg::micro_kernels::gemm_diag::gemm_diag_copy_register ( view< const T, Abi, OA > A,
view< const T, Abi, OB > B,
std::optional< view< const T, Abi, OC > > C,
view< T, Abi, OD > D,
view< const T, Abi > diag )
noexcept

Generalized matrix multiplication D = C ± A⁽ᵀ⁾ diag(d) B⁽ᵀ⁾. Using register blocking.

Definition at line 108 of file gemm-diag.tpp.

Variable Documentation

◆ ColsReg

template<class T, class Abi>
index_t batmat::linalg::micro_kernels::gemm_diag::ColsReg = RowsReg<T, Abi>
constexpr

Definition at line 36 of file gemm-diag.hpp.

◆ gemm_diag_copy_lut

template<class T, class Abi, KernelConfig Conf, StorageOrder OA, StorageOrder OB, StorageOrder OC, StorageOrder OD>
const constinit auto batmat::linalg::micro_kernels::gemm_diag::gemm_diag_copy_lut
inlineconstinit
Initial value:
[]<index_t Row, index_t Col>(index_constant<Row>, index_constant<Col>) {
})
consteval auto make_2d_lut(F f)
Returns a 2D array of the form:
Definition lut.hpp:25
std::conditional_t< Conf.track_zeros, std::pair< index_t, index_t >, void > gemm_diag_copy_microkernel(uview< const T, Abi, OA > A, uview< const T, Abi, OB > B, std::optional< uview< const T, Abi, OC > > C, uview< T, Abi, OD > D, uview_vec< const T, Abi > diag, index_t k) noexcept
Generalized matrix multiplication D = C ± A⁽ᵀ⁾ diag(d) B⁽ᵀ⁾. Single register block.
Definition gemm-diag.tpp:35
std::integral_constant< index_t, I > index_constant
Definition lut.hpp:10

Definition at line 16 of file gemm-diag.tpp.

◆ first_column

template<MatrixStructure Struc>
auto batmat::linalg::micro_kernels::gemm_diag::first_column
inlineconstexpr
Initial value:
=
[](index_t row_index) { return Struc == MatrixStructure::UpperTriangular ? row_index : 0; }

Definition at line 22 of file gemm-diag.tpp.

◆ last_column

template<index_t ColsReg, MatrixStructure Struc>
auto batmat::linalg::micro_kernels::gemm_diag::last_column
inlineconstexpr
Initial value:
= [](index_t row_index) {
return Struc == MatrixStructure::LowerTriangular ? std::min(row_index, ColsReg - 1)
: ColsReg - 1;
}

Definition at line 26 of file gemm-diag.tpp.

◆ RowsReg

template<class T, class Abi>
index_t batmat::linalg::micro_kernels::gemm::RowsReg
inlineconstexpr

Register block size of the matrix-matrix multiplication micro-kernels.

AVX-512 has 32 vector registers, we use 25 registers for a 5×5 accumulator block of matrix C (leaving some registers for loading A and B):

AVX2 has 16 vector registers, we use 9 registers for a 3×3 accumulator block of matrix C (leaving some registers for loading A and B):

Note
A block size of 4×4 is slightly faster than 3×3 for large matrices, because the even block size results in full cache lines being consumed. For small matrices, 3×3 is faster because it does not spill any registers in the micro-kernels. 2×2 is slower than 3×3 for both small and large matrices (tested using GCC 15.1 on an i7-10750H).

Assumes that the platform has at least 16 vector registers, we use 9 registers for a 3×3 accumulator block of matrix C (leaving some registers for loading A and B):

NEON has 32 vector registers, we use 16 registers for a 4×4 accumulator block of matrix C (leaving plenty of registers for loading A and B):

Definition at line 13 of file avx-512.hpp.