English | 简体中文
A compact CUDA SGEMM learning project that walks from a readable baseline kernel to Tensor Core WMMA, with cuBLAS verification and a CMake-first build.
- One optimization ladder: naive -> tiled -> bank-conflict-free -> double-buffer -> Tensor Core.
- Comparable kernel interfaces: every FP32 kernel uses the same
(A, B, C, M, K, N, stream)launcher shape. - Verification-first harness: kernel output is checked against cuBLAS with separate tolerances for FP32 and Tensor Core paths.
- Learning-oriented docs: GitHub Pages carries the full walkthrough instead of duplicating it in the README.
git clone https://github.com/LessUp/sgemm-optimization.git
cd sgemm-optimization
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark -a
ctest --test-dir buildRuntime tests and benchmarks require a CUDA-capable local machine. Hosted CI is limited to compile-time, formatting, repository-structure, OpenSpec, and Pages checks.
| Goal | Entry point |
|---|---|
| Use the project site | GitHub Pages |
| Build and run once | Getting Started |
| Follow the kernel ladder | Learning Path |
| Inspect the source layout | Architecture |
| Read the normative specs | Specifications |
src/kernels/ CUDA SGEMM implementations
src/utils/ CUDA RAII, verification, benchmark helpers
src/main.cu benchmark CLI
tests/ Google Test coverage against cuBLAS
docs/ learning documentation mirrored on Pages
openspec/ stable specs and change workflow
MIT. See LICENSE.md.