Skip to content

LessUp/sgemm-optimization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SGEMM Optimization

CI Pages License: MIT CUDA C++

English | 简体中文

A compact CUDA SGEMM learning project that walks from a readable baseline kernel to Tensor Core WMMA, with cuBLAS verification and a CMake-first build.

What makes it useful

  • One optimization ladder: naive -> tiled -> bank-conflict-free -> double-buffer -> Tensor Core.
  • Comparable kernel interfaces: every FP32 kernel uses the same (A, B, C, M, K, N, stream) launcher shape.
  • Verification-first harness: kernel output is checked against cuBLAS with separate tolerances for FP32 and Tensor Core paths.
  • Learning-oriented docs: GitHub Pages carries the full walkthrough instead of duplicating it in the README.

Quick start

git clone https://github.com/LessUp/sgemm-optimization.git
cd sgemm-optimization

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
ctest --test-dir build

Runtime tests and benchmarks require a CUDA-capable local machine. Hosted CI is limited to compile-time, formatting, repository-structure, OpenSpec, and Pages checks.

Start here

Goal Entry point
Use the project site GitHub Pages
Build and run once Getting Started
Follow the kernel ladder Learning Path
Inspect the source layout Architecture
Read the normative specs Specifications

Source map

src/kernels/   CUDA SGEMM implementations
src/utils/     CUDA RAII, verification, benchmark helpers
src/main.cu    benchmark CLI
tests/         Google Test coverage against cuBLAS
docs/          learning documentation mirrored on Pages
openspec/      stable specs and change workflow

License

MIT. See LICENSE.md.