Transformer in FINN: Scaled Dot-Product Attention by iksnagreb · Pull Request #13 · eki-project/finn-plus

iksnagreb · 2025-01-20T15:19:29Z

Adds support for multi-head scaled dot-product attention, i.e., the core operation of a Transformer, to FINN. This includes compiler integration of hardware operators for the attention mechanism and multi-head splitting/and merging as well as related graph transformations. Heavily depends on the related streamlining of scaled dot product attention: #12

Add attention-hlslib dependency to fetch-repos.sh, see https://github.com/iksnagreb/attention-hlslib
Figure out how to integrate the Brevitas modifications...
There are probably some undocumented fixes/modification lying around on some other branches...

To support a complete Transformer, the following PRs must be merged:

WIP: Merge branch for testing the integration of all the Transformer related PRs until they are fully merged into dev: https://github.com/eki-project/finn-plus/tree/transformer

Currently this is not a HLSCustomOp, but a QONNX CustomOp. Implemented are first operator attributes, ONNX graph/model construction and a rather improvised python mode node execution for debugging.

This causes the C++ simulation to fail as multithreshold activations are not implemented on the HLS side yet.

Note: The threshold parameters are generated and included but not connected to the attention operator yet. The attention operator uses uninitialized thresholds of the same type and shape.

Note: Currently there is no method for optimizing the accumulator width of both, the HLSCustomOp and the python simulation. Thus, to make the tests pass, both must be specified manually to the maximum possible accumulator bitwidth. Doing the MinimizeAccumulatorWidth transform would cause the HLS and python operator behavior to diverge.

Note: This is currently not controlling the memory used by the internal threshold operations and also not controlling the resoruce type used for implementing the floating-point operations within the softmax. These are all still handled by the tools' automatic strategy.

This is a temporary solution to get at least node-by-node RTL simulation of models working by simply skipping the attention operator.

The inferred shape is not taken from the model graph but from the node attributes specifying the shape.

Instead of manually squeezing all shapes, explicit Squeeze and Unsqueeze operations are inserted into the graph before deleting and redoing all shape annotations from scratch. This should be more robust and keeps the interface (data layout) the model exposes to the outside. Wraps Im2Col operations in Unsqueeze-Squeeze operators to shield it from squeezing as Im2Col always operates on 4-dimensional layouts.

* Remove hardcoded batch size from kernel execution * Implement setBatchSize for complete Stack * Remove RingBuffer from Synchronous Inference and add full batch mapping * Deduplicate batchsize in basedriver & fix unittests * Fix integrationtests * Change input kernel code to run concurrrently to output kernel code * Optimize inference of lower batch sizes * Increase packing performance * Further optimize OpenMP * Optimize Utils * Some small changes * Add example data * Small Amounts of cleanup * Change Driver to run without XRT managed kernels * Add more efficient version of execute method * Hotfix FPGA bricking * Simplify inference interface to speed up inference * Update unittest * Simplify code * Update CMake * Fix Release Build CMakeLists * Fix wrong old variable names in CMake * Fix formatting * Change format target * Add changes to paper version * Add final paper changes * Add basic host mem functionality * Add switch for Host Memory Access and fix unittests for User Managed Kernels support * Revert timing changes for paper * Formatting changes * Remove unneccesary benchmark * Small changes * Clean up and update dependencies * Merge dev into paperVersion * Fix setting of Host Mem Var and update cppcheck config * Update CI definition * Fix typo in CI * Remove hardcoded path from examples * Fix linting for json files * Expand integrationTests * Update FPGA PCIe signatures * Increase timelimits of jobs * Switch CI partition to HACC for testing * Bump Graphviz version * Optimize CI * Fix integrationtest path * Update CI and add performance benchmark * Fix paths * Change logger and add exptected performance results to synchronous inference benchmark * Update expected results * Add missing path change * Add regression tests * Add test condition to regression test * Fix broken bash script in CI * Fix broken bash script in CI * Update dependencies in CI pipeline * Fix missing boost lib * Fix missing libs * Change number of processors to be correct and simplify regression tests * Fix typo in ci * Fix floating point comparison * Add debug print to CI * Add debug print to CI * Filter colored output * Filter colored output * Update .gitlab-ci.yml * Update .gitlab-ci.yml * Update .gitlab-ci.yml

* Merge dev into main for v1.2 release (#13) * Remove hardcoded batch size from kernel execution * Implement setBatchSize for complete Stack * Remove RingBuffer from Synchronous Inference and add full batch mapping * Deduplicate batchsize in basedriver & fix unittests * Fix integrationtests * Change input kernel code to run concurrrently to output kernel code * Optimize inference of lower batch sizes * Increase packing performance * Further optimize OpenMP * Optimize Utils * Some small changes * Add example data * Small Amounts of cleanup * Change Driver to run without XRT managed kernels * Add more efficient version of execute method * Hotfix FPGA bricking * Simplify inference interface to speed up inference * Update unittest * Simplify code * Update CMake * Fix Release Build CMakeLists * Fix wrong old variable names in CMake * Fix formatting * Change format target * Add changes to paper version * Add final paper changes * Add basic host mem functionality * Add switch for Host Memory Access and fix unittests for User Managed Kernels support * Revert timing changes for paper * Formatting changes * Remove unneccesary benchmark * Small changes * Clean up and update dependencies * Merge dev into paperVersion * Fix setting of Host Mem Var and update cppcheck config * Update CI definition * Fix typo in CI * Remove hardcoded path from examples * Fix linting for json files * Expand integrationTests * Update FPGA PCIe signatures * Increase timelimits of jobs * Switch CI partition to HACC for testing * Bump Graphviz version * Optimize CI * Fix integrationtest path * Update CI and add performance benchmark * Fix paths * Change logger and add exptected performance results to synchronous inference benchmark * Update expected results * Add missing path change * Add regression tests * Add test condition to regression test * Fix broken bash script in CI * Fix broken bash script in CI * Update dependencies in CI pipeline * Fix missing boost lib * Fix missing libs * Change number of processors to be correct and simplify regression tests * Fix typo in ci * Fix floating point comparison * Add debug print to CI * Add debug print to CI * Filter colored output * Filter colored output * Update .gitlab-ci.yml * Update .gitlab-ci.yml * Update .gitlab-ci.yml * Pending changes exported from your codespace * Remove boost form being shipped with the driver * Update CI * Refactor build configuration: remove mdspan submodule, update CMakeLists for output directories, and enhance FINNDriver with static configuration check * update README.md * Format FinnDatatypes.hpp * Fix linting * Update src/FINNCppDriver/FINNDriver.cpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

iksnagreb added 30 commits April 3, 2024 15:21

Start sketching out the scaled dot-product attention custom op

9e7a475

Currently this is not a HLSCustomOp, but a QONNX CustomOp. Implemented are first operator attributes, ONNX graph/model construction and a rather improvised python mode node execution for debugging.

[Attention] Add __init__ method to custom op

7f97332

[Attention] Add datatype and shape queries to custom op

e77ad2b

[Attention] Add stream/bit-width queries to custom op

c95b397

[Attention] Add refactored node attributes matching HLS op template

4a0e98e

[Attention] Adapt the custom op to the new folding concept

c3ea73e

[Attention] Fix get_ap_int_max_w output and mask stream width

602f1ca

[Attention] Start filling some of the HLSCustomOp abstract methods

ad17b1b

[Attention] Fill out includes and defines for C++ code generation

0de1bce

[Attention] Add IP generation C++ source generation step to test

de9dc73

[Attention] Add some interface pragmas for C++ code generation

f21a47c

[Attention] Add stream declarations for C++ simulation code generation

8e94cfe

[Attention] Add attention function body to C++ code generation

03ddfb2

[Attention] Add C++ simulation code feeding the input streams from files

295ab25

[Attention] Add C++ simulation code saving the output stream to file

b6a26e1

[Attention] Add missing "" to generated C++ strings

5d800e7

[Attention] Add missing bit width cases to get_ap_int_max_w

906a8c5

Some clean up and "# noqa" to calm the IDE

acaa9b2

[Attention] Get C++ simulation to compile and prepare inputs

b41575d

[Attention] Move dummy model wrapper construction out of custom op

a718bf6

[Attention] Refactor the cppsim unit test using thresholds in python sim

189a415

This causes the C++ simulation to fail as multithreshold activations are not implemented on the HLS side yet.

[Attention] Switch to the HLS function-call operator style

b00c64a

[Attention] Refactor towards thresholds HLS code generation

094f920

[Attention] Generate HLS code for all three activation thresholds

5d2836a

Note: The threshold parameters are generated and included but not connected to the attention operator yet. The attention operator uses uninitialized thresholds of the same type and shape.

[Attention] Initialize the attention operator using generated thresholds

76e5e0e

[Attention] Numpy softmax matching overflow behavior of the HLS operator

b152f23

[Attention] Satisfy attention output type constraint

8bd5a20

[Attention] Increase test bitwidth to see some more interesting behavior

65de26d

[Attention] Remove python mode node execution

ce1e19b

iksnagreb added 17 commits April 26, 2024 15:54

[Streamline] Fix eager access to potentially empty successors list

15246c8

[Attention] Implement get_exp_cycles for attention-related HWCustomOps

5eda0f6

Add support for ReplicateStream_hls as a PE-operation to SetFolding

0b00f69

[Attention] Add method to get the number of folded inputs

7fba682

[Attention] Make use of resource type attributes for embedded thresholds

174c098

[Attention] Add resource attribute for the attention mask in const mode

4f7072b

[Attention] Refactor RAM_STYLES dictionary

2b9d94b

[Attention] Redirect RTL simulation of attention to Python execution

b5bd0ff

This is a temporary solution to get at least node-by-node RTL simulation of models working by simply skipping the attention operator.

[Attention] Add missing constant mask mode to input shape query

aa742c7

[Attention] Fix Resource::URAM typo

2bf164a

[Attention] Add data layout checks to InferMultiHeads transformation

95f29b0

Fix SplitMultiHeads shape inference is shape is None

ca6cc33

The inferred shape is not taken from the model graph but from the node attributes specifying the shape.

[Streamline] Allow RemoveIdentityReshape for fork-nodes

9f90cce

[Streamline] Prevent MoveTransposePastEltwise from transposing scalars

5548b49

Merge remote-tracking branch 'xilinx/dev' into feature/attention

a98f594

iksnagreb requested review from DeepCowProductions, bwintermann and fpjentzsch January 20, 2025 16:43

iksnagreb added 4 commits January 20, 2025 17:49

[Deps] Add attention-hlslib dependency to fetch-repos.sh

15963e0

Make Squeeze interact properly with Im2Col, Split and initializers

6d56c61

[Streamline] Fix MoveTransposePastEltwise permutation

50544ef

[Deps] Update attention-hlslib dependency

6cee1ec

iksnagreb self-assigned this Jan 28, 2025

iksnagreb marked this pull request as ready for review February 6, 2025 09:10

Merge remote-tracking branch 'eki-project/dev' into feature/attention

1fa862c

iksnagreb merged commit 2fbcd6e into dev Feb 6, 2025
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer in FINN: Scaled Dot-Product Attention#13

Transformer in FINN: Scaled Dot-Product Attention#13
iksnagreb merged 90 commits intodevfrom
feature/attention

iksnagreb commented Jan 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

iksnagreb commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iksnagreb commented Jan 20, 2025 •

edited

Loading