Conversation
… update C++ operator interface Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR adds variable-shape support to the grouped MXFP8 swizzle kernel, replacing the previous hard error on non-uniform Confidence Score: 4/5Safe to merge after resolving the unchecked CUDA API calls (flagged in a prior thread) that can silently produce a zero-block launch. Prior review threads identified a P1 issue (unchecked cudaGetDevice/cudaOccupancyMaxActiveBlocksPerMultiprocessor leaving persistent_blocks at zero on failure, silently doing nothing). That issue appears unaddressed in this head SHA. New findings in this review are P2-only (dead tensor_id variable, shadowed constexpr declarations inside the lambda). The P1 ceiling caps confidence at 4. transformer_engine/common/swizzle/swizzle.cu — persistent-grid launch block in the variable-shape path needs error-checked CUDA occupancy API calls. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[swizzle_grouped_scaling_factors] --> B{is_variable_shape?}
B -- No --> C[Uniform shape fast path\nper-tensor 3D grid launch]
B -- Yes --> D[Variable shape path]
D --> E[Compute persistent_blocks\nvia cudaOccupancyMaxActiveBlocksPerMultiprocessor]
E --> F[Launch persistent kernel\npersistent_blocks x TB_DIM squared]
F --> G[Warp 0 computes total_blocks\nvia warp reduction into shared mem]
G --> H[Persistent loop over linear_block_id]
H --> I[Linear scan over tensors\nto resolve tensor_id and scale base offset]
I --> J{rowwise?}
J -- Yes --> K[swizzle_row_scaling_kernel_impl\nvec_load_size in 1 2 4]
J -- No --> L[swizzle_col_scaling_kernel_impl\nvec_load_size in 1 2 4]
K --> H
L --> H
Reviews (2): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
| int device_id; | ||
| cudaGetDevice(&device_id); | ||
| int num_SMs; | ||
| cudaDeviceGetAttribute(&num_SMs, cudaDevAttrMultiProcessorCount, device_id); | ||
| // Find out how many blocks of this specific kernel can fit on one SM | ||
| int max_active_blocks_per_sm; | ||
| cudaOccupancyMaxActiveBlocksPerMultiprocessor( | ||
| &max_active_blocks_per_sm, | ||
| grouped_swizzle_scaling_variable_shape_kernel<SF_TILE_DIM_M, SF_TILE_DIM_K>, | ||
| TB_DIM * TB_DIM, // block size | ||
| max_slm_size // dynamic shared memory | ||
| ); | ||
| int persistent_blocks = num_SMs * max_active_blocks_per_sm; |
There was a problem hiding this comment.
Unchecked CUDA API calls can silently produce zero-block launches
cudaGetDevice, cudaDeviceGetAttribute, and cudaOccupancyMaxActiveBlocksPerMultiprocessor are all called without NVTE_CHECK_CUDA. If any of these fail, max_active_blocks_per_sm is left with an indeterminate (or zero) value, making persistent_blocks = 0. Launching the persistent kernel with 0 blocks is legal in CUDA — it silently does nothing — so the output buffer stays uninitialized with no error raised.
| int device_id; | |
| cudaGetDevice(&device_id); | |
| int num_SMs; | |
| cudaDeviceGetAttribute(&num_SMs, cudaDevAttrMultiProcessorCount, device_id); | |
| // Find out how many blocks of this specific kernel can fit on one SM | |
| int max_active_blocks_per_sm; | |
| cudaOccupancyMaxActiveBlocksPerMultiprocessor( | |
| &max_active_blocks_per_sm, | |
| grouped_swizzle_scaling_variable_shape_kernel<SF_TILE_DIM_M, SF_TILE_DIM_K>, | |
| TB_DIM * TB_DIM, // block size | |
| max_slm_size // dynamic shared memory | |
| ); | |
| int persistent_blocks = num_SMs * max_active_blocks_per_sm; | |
| int device_id; | |
| NVTE_CHECK_CUDA(cudaGetDevice(&device_id)); | |
| int num_SMs; | |
| NVTE_CHECK_CUDA(cudaDeviceGetAttribute(&num_SMs, cudaDevAttrMultiProcessorCount, device_id)); | |
| // Find out how many blocks of this specific kernel can fit on one SM | |
| int max_active_blocks_per_sm; | |
| NVTE_CHECK_CUDA(cudaOccupancyMaxActiveBlocksPerMultiprocessor( | |
| &max_active_blocks_per_sm, | |
| grouped_swizzle_scaling_variable_shape_kernel<SF_TILE_DIM_M, SF_TILE_DIM_K>, | |
| TB_DIM * TB_DIM, // block size | |
| max_slm_size // dynamic shared memory | |
| )); | |
| NVTE_CHECK(max_active_blocks_per_sm > 0, "Occupancy query returned 0 blocks per SM."); | |
| int persistent_blocks = num_SMs * max_active_blocks_per_sm; |
| if (!is_variable_shape) { | ||
| // Fallback to uniform shape implementation | ||
| NVTE_CHECK(input->all_same_shape(), "Grouped swizzle requires uniform tensor shapes."); | ||
| NVTE_CHECK(input->all_same_last_dim() && input->all_same_first_dim(), | ||
| "Grouped swizzle requires uniform tensor shapes."); |
There was a problem hiding this comment.
Dead code: redundant assertions inside
!is_variable_shape branch
is_variable_shape is defined as !input->all_same_shape(), so inside if (!is_variable_shape) the two NVTE_CHECK calls are tautologies — they can never fire. They add noise and could mislead future readers into thinking the branch can handle non-uniform shapes. Consider removing them or converting them to a comment.
| if (!is_variable_shape) { | |
| // Fallback to uniform shape implementation | |
| NVTE_CHECK(input->all_same_shape(), "Grouped swizzle requires uniform tensor shapes."); | |
| NVTE_CHECK(input->all_same_last_dim() && input->all_same_first_dim(), | |
| "Grouped swizzle requires uniform tensor shapes."); | |
| if (!is_variable_shape) { | |
| // All tensors share the same shape; use the optimised uniform-shape path. |
| if (int_stride % 2 != 0) int_stride++; | ||
| int* d_block_offsets = reinterpret_cast<int*>(workspace); | ||
| int* d_global_counter = d_block_offsets + num_tensors + 1; | ||
| int* d_total_blocks = d_global_counter + 1; |
There was a problem hiding this comment.
d_total_blocks is written but never consumed
d_total_blocks is populated by compute_grouped_swizzle_setup (as *total_blocks = current_block_offset) but is never read by the persistent kernel or any host code afterward. The persistent grid terminates via the tensor_id == -1 sentinel, not via a stored total. If this field was intended as a diagnostic or future guard, a comment would clarify its purpose; otherwise it can be removed to avoid confusing future maintainers and wasting a device-side write.
| size_t num_tensors = input.num_tensors(); | ||
| size_t num_int_elems = num_tensors + 3; // n+1 block_offsets + gc + tb | ||
| if (num_int_elems % 2 != 0) num_int_elems++; // pad to even for size_t alignment | ||
| size_t workspace_size = num_int_elems * sizeof(int) + (num_tensors + 1) * sizeof(size_t); | ||
| workspace_size = roundup(workspace_size, 256); | ||
| auto workspace = | ||
| allocateSpace(std::vector<size_t>{workspace_size}, transformer_engine::DType::kByte, false); | ||
|
|
||
| NVTE_SCOPED_GIL_RELEASE({ | ||
| nvte_swizzle_grouped_scaling_factors(swizzle_input.data(), swizzle_output.data(), | ||
| at::cuda::getCurrentCUDAStream()); | ||
| getDataPtr(workspace), at::cuda::getCurrentCUDAStream()); |
There was a problem hiding this comment.
Workspace allocated unconditionally even for uniform-shape inputs
The workspace is only consumed by the variable-shape code path in swizzle_grouped_scaling_factors. For uniform shapes the pointer is accepted but immediately ignored. Gating the allocation on whether variable shapes are present (e.g., first_dims.data_ptr != nullptr || last_dims.data_ptr != nullptr) would avoid a small but unnecessary device allocation on every invocation with uniform tensors. This is a performance suggestion, not a correctness issue.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| cudaFuncAttributeMaxDynamicSharedMemorySize, max_slm_size)); | ||
|
|
||
| int device_id; | ||
| cudaGetDevice(&device_id); |
There was a problem hiding this comment.
Caching the attrbues like number of sms and max active blocks per device would be ideal to reduce CPU overheads on each call.
We already have a function in transformer_engine/common/util/cuda_runtime.cpp called "sm_count". Could you please use that here?
| // Fallback to uniform shape implementation | ||
| NVTE_CHECK(input->all_same_shape(), "Grouped swizzle requires uniform tensor shapes."); | ||
| NVTE_CHECK(input->all_same_last_dim() && input->all_same_first_dim(), | ||
| "Grouped swizzle requires uniform tensor shapes."); |
There was a problem hiding this comment.
These checks might not be needed. Given we used input->all_same_shape() to reach this stage
vthumbe1503
left a comment
There was a problem hiding this comment.
I think that workspace allocation + small kernel for computing offsets + persistent kernel might be an overkill for swizzling. @int-smart Do you have some performance numbers by any chance for the swizzling kernel on Blackwell?
How about we follow a SM filling grid pattern like in grouped_bias_add kernel in this PR?
https://github.com/NVIDIA/TransformerEngine/pull/2885/changes/BASE..b64559af9b89d816b8d7ffba4f5273e556d90c8e#diff-fa75cbeb11caf588f79b811be355c8f00b0cf5d4b807c259b94f2a40ffc8db6f
With this pattern thread block id is dynamically decided based on sum(first_dims) and at the same time we divide the rows of grouped_tensor uniformly among the SMs. However it only handles variable first_dims(Need to extend the idea for other cases like all dims being variable)
|
@vthumbe1503 Will check the PR and get back |
|
With regards to Blackwell I dont have the numbers tbh. I can generate it for RTX 40 series |
Signed-off-by: Abhishek <abhi.dtu11@gmail.com>
for more information, see https://pre-commit.ci
|
@vthumbe1503 Consolidated to one kernel, removed shared memory allocation and tried to stick to the PR you mentioned. If this works let me know. Seems to perform better on rtx 4070 than my last approach. There are still some optimizations can be done but that would need more shared memory alloc. |
Description
Grouped Swizzle with variable shape. Not sure if this is needed but if not can be closed.
Fixes #2451
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: