Add single-dispatch layer-by-layer multi-head attention#91
Draft
Add single-dispatch layer-by-layer multi-head attention#91
Conversation
andrej
commented
Apr 6, 2026
Collaborator
Author
There was a problem hiding this comment.
Can we reuse the reference from the existing mha? (Note: does not include RoPE and Q, K, V projections, but some code reuse should be possible.)
Contributor
📊 Test Results for Test Example Applications1d87fe8 (2026_04_07_21_05_39) IRONCLADTested on
📈 Trends (vs main branch) for Test Example Applications1d87fe8 (2026_04_07_21_05_39) IRONCLAD Trendsllama_3.2_1b
llama_3.2_1b_prompt_1024_tokens_1
llama_3.2_1b_prompt_1024_tokens_40
llama_3.2_1b_prompt_13_tokens_1
llama_3.2_1b_prompt_13_tokens_40
llama_3.2_1b_prompt_2048_tokens_1
llama_3.2_1b_prompt_2048_tokens_40
|
Contributor
CI Test Resultsea275b5 (2026_04_20_20_26_40) IRONCLAD - CI SummaryExamples
Small
Extensive
Krackan - SmallIRONCLADTested on
Trends: IRONCLAD TrendsGPT2-Small-256seq
H2
Llama3.2-256seq
M_128-K_128-num_aie_columns_1-tile_size_input_32-tile_size_output_128
M_1792-K_896-N_1152-num_aie_columns_8-b_col_maj_False-c_col_maj_True-m_64-k_32-n_48-trace_size_0-partition_N_1
M_192-K_384-N_64-num_aie_columns_4-b_col_maj_False-c_col_maj_False-m_48-k_96-n_16-trace_size_0-partition_N_1
M_192-K_384-N_64-num_aie_columns_4-b_col_maj_True-c_col_maj_True-m_48-k_96-n_16-trace_size_0-partition_N_1
M_2048-K_2048-N_2048-num_aie_columns_1-b_col_maj_False-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1
M_2048-K_2048-N_2048-num_aie_columns_2-b_col_maj_True-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1
M_2048-K_2048-N_2048-num_aie_columns_8-b_col_maj_True-c_col_maj_True-m_64-k_64-n_64-trace_size_0-partition_N_1
M_2048-K_8192-num_aie_columns_1-tile_size_input_1-tile_size_output_2048
M_2048-K_8192-num_aie_columns_2-tile_size_input_1-tile_size_output_1024
M_2048-K_8192-num_aie_columns_4-tile_size_input_1-tile_size_output_512
M_2048-K_8192-num_aie_columns_8-tile_size_input_1-tile_size_output_256
M_384-K_1536-N_1792-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_32-k_48-n_64-trace_size_0-partition_N_1
M_64-K_512-N_256-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_16-k_64-n_64-trace_size_0-partition_N_4
M_8192-K_2048-num_aie_columns_1-tile_size_input_4-tile_size_output_1024
M_8192-K_2048-num_aie_columns_2-tile_size_input_4-tile_size_output_1024
M_8192-K_2048-num_aie_columns_4-tile_size_input_4-tile_size_output_1024
M_8192-K_2048-num_aie_columns_8-tile_size_input_4-tile_size_output_1024
M_896-K_1792-N_640-num_aie_columns_8-b_col_maj_False-c_col_maj_True-m_32-k_64-n_80-trace_size_0-partition_N_1
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-group_size_32
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-group_size_32
input_length_2048-num_aie_columns_1-tile_size_2048
input_length_2048-num_aie_columns_1-tile_size_2048-scalar_factor_3.0
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-group_size_32
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-group_size_32
input_length_2048-num_aie_columns_2-tile_size_1024
input_length_2048-num_aie_columns_2-tile_size_1024-scalar_factor_3.0
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-group_size_32
input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256
input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-group_size_32
input_length_2048-num_aie_columns_4-tile_size_512
input_length_2048-num_aie_columns_4-tile_size_512-scalar_factor_3.0
input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256
input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256-group_size_32
input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128
input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128-group_size_32
input_length_2048-num_aie_columns_8-tile_size_256
input_length_2048-num_aie_columns_8-tile_size_256-scalar_factor_3.0
input_length_2048-num_cores_1-num_channels_1-bypass_False-tile_size_2048
input_length_2048-num_cores_16-num_channels_2-bypass_False-tile_size_128
input_length_2048-num_cores_2-num_channels_1-bypass_False-tile_size_1024
input_length_2048-num_cores_2-num_channels_2-bypass_False-tile_size_1024
input_length_2048-num_cores_4-num_channels_1-bypass_False-tile_size_512
input_length_2048-num_cores_4-num_channels_2-bypass_False-tile_size_512
input_length_2048-num_cores_8-num_channels_1-bypass_False-tile_size_256
input_length_2048-num_cores_8-num_channels_2-bypass_False-tile_size_256
seq_len_16384-dim_64-num_heads_1-num_pipelines_8-num_kv_heads_0
Krackan - ExamplesIRONCLADTested on
Trends: IRONCLAD Trendsllama_3.2_1b_prompt_1024_tokens_1
llama_3.2_1b_prompt_1024_tokens_40
llama_3.2_1b_prompt_13_tokens_1
llama_3.2_1b_prompt_13_tokens_40
Phoenix - SmallIRONCLADTested on
Trends: IRONCLAD TrendsGPT2-Small-256seq
H2
Llama3.2-256seq
M_128-K_128-num_aie_columns_1-tile_size_input_32-tile_size_output_128
M_192-K_384-N_64-num_aie_columns_4-b_col_maj_False-c_col_maj_False-m_48-k_96-n_16-trace_size_0-partition_N_1
M_192-K_384-N_64-num_aie_columns_4-b_col_maj_True-c_col_maj_True-m_48-k_96-n_16-trace_size_0-partition_N_1
M_2048-K_2048-N_2048-num_aie_columns_1-b_col_maj_False-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1
M_2048-K_2048-N_2048-num_aie_columns_2-b_col_maj_True-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1
M_2048-K_8192-num_aie_columns_1-tile_size_input_1-tile_size_output_2048
M_2048-K_8192-num_aie_columns_2-tile_size_input_1-tile_size_output_1024
M_2048-K_8192-num_aie_columns_4-tile_size_input_1-tile_size_output_512
M_384-K_1536-N_1792-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_32-k_48-n_64-trace_size_0-partition_N_1
M_64-K_512-N_256-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_16-k_64-n_64-trace_size_0-partition_N_4
M_8192-K_2048-num_aie_columns_1-tile_size_input_4-tile_size_output_1024
M_8192-K_2048-num_aie_columns_2-tile_size_input_4-tile_size_output_1024
M_8192-K_2048-num_aie_columns_4-tile_size_input_4-tile_size_output_1024
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048
input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-group_size_32
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024
input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-group_size_32
input_length_2048-num_aie_columns_1-tile_size_2048
input_length_2048-num_aie_columns_1-tile_size_2048-scalar_factor_3.0
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024
input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-group_size_32
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512
input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-group_size_32
input_length_2048-num_aie_columns_2-tile_size_1024
input_length_2048-num_aie_columns_2-tile_size_1024-scalar_factor_3.0
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512
input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-group_size_32
input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256
input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-group_size_32
input_length_2048-num_aie_columns_4-tile_size_512
input_length_2048-num_aie_columns_4-tile_size_512-scalar_factor_3.0
input_length_2048-num_cores_1-num_channels_1-bypass_False-tile_size_2048
input_length_2048-num_cores_2-num_channels_1-bypass_False-tile_size_1024
input_length_2048-num_cores_2-num_channels_2-bypass_False-tile_size_1024
input_length_2048-num_cores_4-num_channels_1-bypass_False-tile_size_512
input_length_2048-num_cores_4-num_channels_2-bypass_False-tile_size_512
input_length_2048-num_cores_8-num_channels_2-bypass_False-tile_size_256
Phoenix - ExamplesIRONCLADTested on
Trends: IRONCLAD Trends |
…o multiple invocations for large sequence lengths
…es for scale/add single scalar, allow more buffers to alias to reduce memory usage
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
"Naive" alternative implementation for multi-head attention from the currently checked-in data-flow design. This is a simple layer-by-layer implementation, but it uses the single-dispatch mechanism to fuse it all into one MLIR file and save on CPU roundtrips and XRT overheads.
Includes two variants:
Q,K,V. This matches the functionality of the checked-in dataflow MHA.