[PyTorch] Fix stale columnwise data usage by ksivaman · Pull Request #2925 · NVIDIA/TransformerEngine

ksivaman · 2026-04-25T01:24:02Z

Description

This PR sets columnwise usage correctly for all quantizers instead of retaining the value in the quantizer, which may be incorrect after resuming training post validation steps as the columnwise usage is set to False for eval mode.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Set columnwise usage explicitly instead of retaining the value in the quantizer.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

greptile-apps · 2026-04-25T01:29:52Z

Greptile Summary

This PR fixes stale columnwise_usage state on weight quantizers by changing elif conditions to if across linear.py, layernorm_linear.py, layernorm_mlp.py, and grouped_linear.py, so set_usage is always called with the correct value rather than relying on whatever was last written to the quantizer (which could be False after an eval pass). The FSDP2 tensor-level fix replaces the stale self._quantizer.columnwise_usage read in all three tensor types with is_backward_pass or torch.is_grad_enabled().

P1 — linear.py: The old elif isinstance(weight, QuantizedTensor): weight_quantizer = weight._quantizer defensive path is gone. If weight_quantizer arrives as None while the weight is already a QuantizedTensor, quantize_weight immediately dereferences quantizer.rowwise_usage and raises AttributeError. Restoring the elif assignment (without calling set_usage on it, since it is None) preserves the original safety net.

Confidence Score: 3/5

Mostly safe bug fix, but one module-level regression in linear.py drops a defensive assignment that guards against an AttributeError in quantize_weight.

The core fix (elif→if and FSDP2 tensor changes) is correct and well-tested for the primary code paths. A P1 regression in linear.py removes a guard for the weight_quantizer=None + QuantizedTensor scenario, turning a silent fallback into a potential crash. While this path may be rarely hit in practice, the asymmetry with how layernorm_mlp.py and layernorm_linear.py handle the same pattern makes linear.py's divergence worth addressing.

transformer_engine/pytorch/module/linear.py — dropped defensive quantizer assignment before quantize_weight call

Important Files Changed

Filename	Overview
transformer_engine/pytorch/module/linear.py	Correctly fixes stale `columnwise_usage` by always calling `set_usage` when `weight_quantizer` is non-None, but the old `elif isinstance(weight, QuantizedTensor)` fallback that prevents a crash in `quantize_weight` when `weight_quantizer` is `None` was dropped.
transformer_engine/pytorch/module/layernorm_linear.py	Clean fix: `elif` → `if` ensures `set_usage` is called unconditionally when `weight_quantizer is not None`, including after the weight-quantizer re-assignment for pre-quantized weights; FSDP2 guard (`is_fsdp2` flag) preserved.
transformer_engine/pytorch/module/layernorm_mlp.py	Same correct `elif` → `if` pattern applied to both fc1 and fc2 weight quantizers, ensuring `set_usage` propagates after the weight-quantizer re-assignment step.
transformer_engine/pytorch/module/grouped_linear.py	Correctly restructures condition so `set_usage` is always called on the true per-weight quantizers (extracted from the pre-quantized weights) rather than skipping when weights are already quantized.
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py	Adds `set_usage` propagation to pre-quantized weight tensors inside `GroupedTensor.quantized_tensors` and in the discrete-weight path; FC2 non-grouped path has a pre-existing redundant `set_usage` call now made more visible.
transformer_engine/pytorch/tensor/float8_tensor.py	Replaces stale `self._quantizer.columnwise_usage` with `is_backward_pass or torch.is_grad_enabled()` for the FSDP2 `reshard_after_forward=False` path; also extracts `training_state`/`is_backward_pass` before the `if reshard_after_forward` branch.
transformer_engine/pytorch/tensor/float8_blockwise_tensor.py	Same FSDP2 `reshard_after_forward=False` fix as `float8_tensor.py`; clean and symmetric change.
transformer_engine/pytorch/tensor/mxfp8_tensor.py	Replaces stale quantizer state with `is_backward_pass or torch.is_grad_enabled()`; adds a `RuntimeError` when columnwise data is absent but needed — the error message could hint at wrapping eval loops with `torch.no_grad()` for clarity.
tests/pytorch/test_sanity.py	Adds a focused regression test (`test_quantizer_columnwise_usage_after_eval`) covering the train→eval→train columnwise state scenario for all four module types across all three quantization recipes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Forward pass enters module] --> B{weight is QuantizedTensorStorage AND not debug?}
    B -- Yes --> C[weight_quantizer = weight._quantizer]
    B -- No --> D{weight_quantizer is not None?}
    C --> D
    D -- Yes --> E[set_usage rowwise=True columnwise=is_grad_enabled ...]
    D -- No --> F[skip set_usage]
    E --> G[quantize_weight]
    F --> G
    G --> H{FSDP2 reshard_after_forward?}
    H -- Yes --> I[columnwise = is_backward_pass]
    H -- No --> J[columnwise = is_backward_pass OR grad_enabled]
    J --> K{columnwise=True but _columnwise_data is None?}
    K -- Yes --> L[RuntimeError raised - mxfp8_tensor only]
    K -- No --> M[all-gather sharded tensors]
    I --> M
    M --> N[GEMM forward]

Comments Outside Diff (1)

transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py, line 274-278 (link)

Redundant set_usage call in the FC2 non-grouped path

quantizer.set_usage(rowwise=True, columnwise=input_requires_grad) appears unconditionally at the start of the loop iteration (line 275) and then again inside the if not is_quantized_tensor(weight) branch (line 277), so for non-quantized weights the call is made twice with the same arguments. The outer call is unnecessary and can be removed.

_{Reviews (1): Last reviewed commit: "Fix columnwise_usage after eval mode" | Re-trigger Greptile}

greptile-apps · 2026-04-25T01:29:56Z

+        if weight_quantizer is not None:
+            if isinstance(weight, QuantizedTensor) and not debug:
+                weight_quantizer = weight._quantizer


Dropped defensive weight_quantizer assignment loses the quantize_weight call

The original elif isinstance(weight, QuantizedTensor): weight_quantizer = weight._quantizer handled the case where weight_quantizer arrives as None while weight is already a QuantizedTensor. In that path, quantize_weight immediately dereferences quantizer.rowwise_usage (line 710 of base.py) and will raise AttributeError: 'NoneType' object has no attribute 'rowwise_usage'.

The new code only re-assigns weight_quantizer when it is already non-None, so the previously guarded scenario now crashes instead of falling back to the weight's own quantizer. The missing assignment should be:

if weight_quantizer is not None: if isinstance(weight, QuantizedTensor) and not debug: weight_quantizer = weight._quantizer columnwise_usage = ... weight_quantizer.set_usage(rowwise=True, columnwise=columnwise_usage) elif isinstance(weight, QuantizedTensor): # weight_quantizer is None but weight is pre-quantized — pick up its quantizer weight_quantizer = weight._quantizer

greptile-apps · 2026-04-25T01:30:01Z

+            columnwise_usage = is_backward_pass or torch.is_grad_enabled()
            sharded_tensors = (self._rowwise_data, rowwise_scale_inv)
-            columnwise_usage = self._quantizer.columnwise_usage
            if columnwise_usage:
+                if self._columnwise_data is None or columnwise_scale_inv is None:
+                    raise RuntimeError(
+                        "FSDP2 (reshard_after_forward=False) needs columnwise MXFP8 data "
+                        "for the upcoming backward pass, but the local shard has none. "
+                        "Ensure the weight is quantized with columnwise_usage=True before "
+                        "this all-gather."
+                    )


torch.is_grad_enabled() can be True during eval without torch.no_grad()

model.eval() alone does not disable the gradient tape — torch.is_grad_enabled() stays True unless the caller wraps the eval loop with torch.no_grad(). In that situation columnwise_usage becomes True, but the local shard may not have _columnwise_data (it was never quantized with columnwise support during eval), so the new RuntimeError fires.

Users who ran eval with grads enabled previously got silently incorrect (stale) data; they now get a hard crash. While the crash is more correct, the error message could guide them:

Suggested change

columnwise_usage = is_backward_pass or torch.is_grad_enabled()

sharded_tensors = (self._rowwise_data, rowwise_scale_inv)

columnwise_usage = self._quantizer.columnwise_usage

if columnwise_usage:

if self._columnwise_data is None or columnwise_scale_inv is None:

raise RuntimeError(

"FSDP2 (reshard_after_forward=False) needs columnwise MXFP8 data "

"for the upcoming backward pass, but the local shard has none. "

"Ensure the weight is quantized with columnwise_usage=True before "

"this all-gather."

)

if self._columnwise_data is None or columnwise_scale_inv is None:

raise RuntimeError(

"FSDP2 (reshard_after_forward=False) needs columnwise MXFP8 data "

"for the upcoming backward pass, but the local shard has none. "

"Ensure the weight is quantized with columnwise_usage=True before "

"this all-gather. If you are running evaluation without requiring "

"gradients, wrap the eval loop with torch.no_grad()."

)

vthumbe1503 · 2026-04-27T02:10:18Z

+    # 2. Eval forward -- should set columnwise=False on primary FP8 weight
+    # quantizers, simulating the start of an evaluation loop.
+    run_forward(is_eval=True)
+    for q in get_weight_quantizers():


The test doesnt make sense to me. I dont think we should be toggling the quantizer usages in case of quantized_model_init at all.

This breaks the very principle that quantized tensor and its internal quantizer shouldnt be in conflict with each other. And in here, the columnwise_data is present for the quantized_tensor even though the columnwise_usage is set to False.

vthumbe1503 · 2026-04-27T02:15:52Z

+                if isinstance(weights[0], QuantizedTensorStorage) and not debug:
+                    weight_quantizers = [weight._quantizer for weight in weights]
                for weight_quantizer in weight_quantizers:


I dont think these changes are needed for this file and any of the other files in the PR. If weight is already quantized, it doesnt make sense to change its internal quantizer and have the quantized weight and its internal quantizer in a state of conflict with each other.

In general in case of quantized_model_init, if we are changing quantized_tensor's internal quantizer, quantized_tensor should also be updated to have that appropriate usages.

vthumbe1503 · 2026-04-27T02:28:24Z

        else:
            rowwise_usage = True
-            columnwise_usage = self._quantizer.columnwise_usage
+            columnwise_usage = is_backward_pass or torch.is_grad_enabled()


I think it still makes sense to use self._quantize.columwise_usage as the real truth of what data is
"really avaliable" in the sharded quantized tensor and throw an error if that usage doesnt match
is_backward_pass or torch.is_grad_enabled(Similar to mxfp8 tensor)

What we are doing here is that we are silently creating columnwise data after allgather for allgathered tensor, even though original sharded data tensor didnt have that data.

In my opinion, I am against any change here since even doing such a validation and throwing error is going to incur CPU overheads when using
torch.is_grad_enabled

Same comment in every other FSDP2 related changes

Fix columnwise_usage after eval mode

8432dbe

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

ksivaman requested a review from timmoon10 April 25, 2026 01:24

greptile-apps Bot reviewed Apr 25, 2026

View reviewed changes

vthumbe1503 reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Fix stale columnwise data usage#2925

[PyTorch] Fix stale columnwise data usage#2925
ksivaman wants to merge 1 commit intoNVIDIA:mainfrom
ksivaman:fix_columnwise_usage_after_eval

ksivaman commented Apr 25, 2026

Uh oh!

greptile-apps Bot commented Apr 25, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 25, 2026

Uh oh!

greptile-apps Bot Apr 25, 2026

Uh oh!

vthumbe1503 Apr 27, 2026

Uh oh!

vthumbe1503 Apr 27, 2026

Uh oh!

vthumbe1503 Apr 27, 2026

Uh oh!

vthumbe1503 Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ksivaman commented Apr 25, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Apr 25, 2026 •

edited

Loading