Overview
With their inherent flexibility, AMD adaptive SoCs and FPGAs are ideal for high-performance or multi-channel digital signal processing (DSP) applications that can take advantage of hardware parallelism. AMD adaptive SoCs and FPGAs combine this processing bandwidth with comprehensive solutions, including easy-to-use design tools for hardware designers, software developers, and system architects.
Hardware Parallelism
A standard Von Neumann DSP architecture requires 256 cycles to complete a 256-tap FIR filter, while adaptive SoCs and FPGAs can achieve the same result in a single clock cycle.

This massive parallelism translates into exceptional levels of DSP performance:
- 49.5 TeraMACs of fixed-point performance (8-bit)
- 23.1TeraFLOPs for single-precision floating point
Comprehensive DSP Solutions
AMD DSP solutions include silicon, IP, reference designs, development boards, tools, documentation, and training to enable a wide range of applications in a breadth of markets, including —but not limited to— wireless communications, data center, and aerospace and defense.
Comprehensive Development Flows
Various tool flows are available for different use models and different levels of design abstraction:
Hardware designers can design in:
- RTL and system level design accomplished with theVivado™ Design Suite
- C/C++
- MATLAB® and Simulink® environment usingVitis™ Model Composer
Software developers accustomed to developing in C/C++ can design using:
System architects can rapidly evaluate new algorithms with:
- Vitis Model Composer for system modeling in the MATLAB or Simulink environment
- Vitis HLS for algorithmic exploration in C or C++
Choose Your Solution
With AMD adaptive SoCs and FPGAs, designers can use multiple flows to deploy their DSP applications depending on design approach and level of abstraction.
Versal AI Engines: Meeting the High Performance DSP Compute Demands of Next-Generation Applications
In many dynamic and evolving DSP markets such as aerospace, defense, automotive/industrial, & test/measurement; applications are pushing for ever increasing DSP compute acceleration while remaining power efficient.
With Moore's Law and Dennard Scaling no longer following their traditional trajectory, moving to the next-generation silicon node alone cannot deliver the benefits of lower power and cost with better performance, as in previous generations.
Responding to this non-linear increase in demand by next-generation DSP applications, like polyphase channelization and beamforming, AMD has developed a new innovative processing technology, the AI Engine, as part of the AMD Versal™ architecture.
A Closer Look at Versal AI Engines:
AI Engine Architecture
AI Engines are architected as 2D arrays consisting of multiple AI Engine tiles and allow for a very scalable solution across the Versal portfolio, ranging from 10s to 100s of AI Engines in a single device, servicing the compute needs of a breadth of applications.
Benefits include:
Software Programmability
- C programmable via the Vitis Unified Software Platform
- Model-Based programming via Vits Model Composer
- Learn more about the AI Engine DSP Design Process
Deterministic
- Dedicated instruction and data memories
- Dedicated connectivity paired with DMA engines for scheduled data movement using connectivity between AI Engine tiles
Efficiency
- Delivers increased DSP compute density when compared with traditional programmable logic, while reducing dynamic power consumption. Clickhere to learn more.
Based on an ASIC-class architecture, AMD adaptive SoCs and FPGAs combine multi-hundred giga-bit-per-second I/O bandwidth with over 49 TeraMACs of fixed point DSP performance in the Versal™ Premium series. The AMD DSP slice and its parallelism is key to the achievable DSP performance in the latest generation of AMD FPGAs.
DSP Slice Architecture
The DSP58 in Versal devices slice is the 6th generation of DSP slices in AMD architectures.
This dedicated DSP processing block is implemented in full custom silicon that delivers leading power/performance allowing efficient implementations of popular DSP functions, such as a multiply-accumulator (MACC), multiply-adder (MADD) or complex multiply.
The slice also provides capabilities to perform different kinds of logic operations, such as AND, OR and XOR operations.
The Versal device DSP58 architecture builds on the success of the UltraScale™ FPGA DSP48E2 with further enhancements:
- Wider multiplier (27 x 24 bits)
- Single-precision, floating-point multiplier
- 18x18 complex multiplication using two back-to-back DSPs
- INT8 vector dot product mode
These enhancements help DSP critical applications perform more computation within the DSP48E2 slice before going into the FPGA fabric, ultimately leading to both resource and power savings.
DSP48E2 (UltraScale) vs DSP58 (Versal) Slice Features
| Function | UltraScale | Versal |
|---|---|---|
| DSP Tile/Slice Type | DSP48E2 | DSP58 |
| Multiple Add/Sub/Acc operations | ![]() | ![]() |
| Multiplier and MACC | 27x18 | 27x24 |
| Squaring: [(A or B) +/- D]2 | ![]() | ![]() |
| WMUX Feedback Ultra Efficient Complex Multiply CMACC | 3 x DSP48E2 | 2 x DSP58 |
| SIMD Support | ![]() | ![]() |
| Integrated Pattern Detect Circuitry | ![]() | ![]() |
| Integrated Logic Unit | ![]() | ![]() |
| Wide Mux Functions | 48-bit | 58-bit |
| Wide XOR | 96-bit | 116-bit |
| Single Precision Floating Point Multiplier | ![]() | |
| Optional 96-bit Output | ![]() | ![]() |
| Cascade Routing | ![]() | ![]() |
| Pipeline Registers | ![]() | ![]() |
| D Pre-adder | ![]() | ![]() |
| Sequential Complex Multiply, AB dyn access | ![]() | ![]() |
| AB Register Pipeline Balancing Improved | ![]() | ![]() |
Featured Videos:
- Utilizing the Squaring MUX in the DSP48E2 slice (Video)
- Utilizing the Wide MUX Feedback in the DSP48E2 slice (Video)
Tools and Flows
Depending on your designing preferences, AMD has tools supporting RTL, C/C++ and model-based design entry. This flexibility in the design flow, along with an extensive DSP IP catalog, facilitates easier adoption of AMD tools and devices.

Visit Tools, Libraries & Frameworks for more information.
DSP Performance Metrics
The following table shows some of the key DSP performance metrics for 7 Series, UltraScale™ and UltraScale+™ families. For adaptive SoC device performance, see Software Developer section.
| Kintex UltraScale | Kintex UltraScale+ | Virtex UltraScale | Virtex UltraScale+ | Versal AI Core | Versal AI Edge | Versal AI Prime | Versal AI Premium | |
|---|---|---|---|---|---|---|---|---|
| System Logic Elements (K) | 318–1,451 | 356–1,143 | 783–5,541 | 862–3,780 | 540 - 1,968 | 44 - 1,139 | 329 - 2,233 | 833 - 7,352 |
| DSP Slices | 768–5,520 | 1,368–3,528 | 600–2,880 | 2,280–12,288 | 928 - 1,968 | 90 - 1,312 | 464 - 3,984 | 1,140 - 14,352 |
| 27x18 Multipliers | 768–5,520 | 1,368–3,528 | 600–2,880 | 2,280–12,288 | 928 - 1,968 | 90 - 1,312 | 464 - 3,984 | 1,140 - 14,352 |
| INT8 GOPs1 | 1,774–14,315 | 4,263–11,000 | 1,554–7,469 | 7,108–38,318 | 6,403 - 13,579 | 62 - 9,052 | 3,201 - 27,489 | 7,866 - 99,029 |
| INT16 GOPs | 1,014–8,180 | 2,436–6,286 | 888–4,268 | 4,062–21,896 | 2,134 - 4,526 | 21 - 3,017 | 1,067 - 9,163 | 2,622 - 33,010 |
| Complex INT18 GOPs | 676 - 5,453 | 1,624 - 4,191 | 592 - 2,845 | 2708 - 14,597 | 913 - 1,937 | 8 - 1,291 | 456 - 3,920 | 1,122 - 14,122 |
| Single Precision Floating Point (GFLOPs)2 | 320–2,685 | 800–1,673 | 294–1,411 | 1,354–7,299 | 1,494 - 3,168 | 14 - 2112 | 747 - 6,414 | 1,835 - 23,107 |
We have introduced software development environments and a comprehensive set of familiar and powerful tools, libraries and methodologies which allow software developers to target AMD adaptive SoCs and FPGAs with ease. With high level abstraction environment Vitis™ unified software platform. We can offer GPU-like and familiar embedded application development and runtime experiences for C, C++ and/or OpenCL development.
AMD MPSoCs and Versal Devices
The Zynq™ UltraScale+™ MPSoC and the Versal architecture combine a powerful processing system (PS), incorporating Arm® Cortex® processors, and user-programmable logic (PL), in a single device.
Application Profiling for Acceleration
The Vitis unified software platform provides the ability to profile a given application and allows for the creation of hardware accelerators to run more efficiently in the Programmable Logic (PL), where the flexibility and parallelism of the FPGA are leveraged to provide large performance improvements. This also enables other functions of the application to run in the Processing System (PS) in parallel if desired.
By targeting AMD adaptive SoCs and FPGAs, many DSP and embedded applications will see improvements in efficiency and reduced power for their applications.
Features and DSP Performance of AMD SoC Devices
The following tables show some of the key features and DSP performance metrics for both AMD Zynq UltraScale+ MPSoC families and Versal™ devices. For non-SoC device performance, visit the Hardware Designer section.
| Processing System | Zynq 7000 SoC | Zynq UltraScale+ MPSoC |
|---|---|---|
| Application Processing Unit (APU) |
|
|
| Real-Time Processing Unit (RPU) | - |
|
| Multimedia Processing | - |
|
| Dynamic Memory Interface | DDR3, DDR3L, DDR2, LPDDR2 | DDR4, LPDDR4, DDR3, DDR3L, LPDDR3 |
| High-Speed Peripherals | USB 2.0, Gigabit Ethernet, SD/SDIO | PCIe® Gen2, USB3.0, SATA 3.1, DisplayPort, Gigabit Ethernet, SD/SDIO |
| Security | RSA, AES, and SHA, ARM TrustZone® | RSA, AES, and SHA, ARM TrustZone |
| Max I/O Pins | 128 | 214 |
| Programmable Logic | Zynq 7000 SoC | Zynq UltraScale+ MPSoC |
|---|---|---|
| System Logic Elements (K) | 23–444 | 103–1,045 |
| Max Memory (Mb) | 1.8–26.5 | 5.3–70.6 |
| Max I/O Pins | 100–362 | 252–668 |
| DSP Slices | 60–2,020 | 240–3,528 |
| 18x18 Multipliers | 60–2,020 | 240–3,528 |
| Fixed Point Performance (GMACs) (1) | 42–1,313 | 213–3,143 |
| Fixed Point Performance For Symmetric Filters (GMACs) (1) (2) | 84–2,626 | 426–6,286 |
| INT8 GOPs (1) (3) | 84–2,626 | 745–11,000 |
| INT16 GOPs (1) | 84–2,626 | 426–6,286 |
| Single Precision Floating Point (GFLOPs) (1) (4) | 23–716 | 142–1,673 |
| Single Precision Floating Point (GFLOPs) (1) (5) | 17–537 | 106–1,571 |
| Half Precision Floating Point (GFLOPs) (1) (6) | 34–1,074 | 212–3,142 |
Notes:
- All performance calculations based of -2 speed grade parts for Zynq 7000 adaptive SoC and -3 for Zynq UltraScale+ MPSoC
- Using the pre-adder DSP performance can be increased 2x for symmetric filters
- Please refer to WP486 – Deep Learning with INT8 Optimization on AMD Devices (Not applicable for Zynq devices)
- Single Precision Floating Point performance using Floating Point Operator core with 3 DSP slices
- Single Precision Floating Point performance using Floating Point Operator core with 4 DSP slices
- Half Precision Floating Point performance using Floating Point Operator core with 2 DSP slices
To learn more about AMD adaptive SoCs and MPSoCs, go to:
DSP in the Processing Subsystem
The Processing System (PS) provides DSP processing capabilities by way of the different ARM processing cores.
For more information on DSP capabilities in the ARM processors, visit:
- Cortex-A Series Family
- SIMD and Advanced SIMD (NEON) technologies
- ARM Floating Point Architecture
Some useful examples can be found at the following locations:
For Zynq UltraScale+ MPSoC, see UG1211 for a demonstration of an FFT using the ARM NEON instruction set.
For Zynq 7000 SoC, the following Tech Tips are available on Xilinx wiki when targeting the Cortex-A9 and ARM SIMD:
AMD Data-type Support
AMD has very flexible data-type support in their devices. Varying precisions of Fixed Point, Floating Point and Integer are supported natively in AMD tools with Floating Point being implemented with the aid of the Floating Point Operator IP core.
Floating Point designs implemented on FPGAs will always lead to higher resource and power usage compared to Fixed Point or Integer implementations. Converting to a fixed point solution where possible will bring large benefits:
- Fewer FPGA resources
- Lower power
- Lower cost
For more details on the benefits of converting from floating point to fixed point data types, please read WP491.
Benchmarks
The below tables show a small selection of algorithms and possible performance improvements by using an AMD device and in particular the fabric in the programmable logic (PL) to accelerate the design.
| Algorithm | CPU/GPU | Zynq UltraScale+ MPSoC | Advantage |
|---|---|---|---|
| Stereo LocalBM @ 2K | ARM: 0.5 FPS/Watt nVidia: 3.5 FPS/Watt | 146 FPS/Watt | 292x 42x |
| Optical Flow (Lucas-Kanade) | ARM: 0.1 FPS/Watt nVidia: 0.8 FPS/Watt | 7.1 FPS/Watt | 9.3x |
| GoogleNet (Batch=1) | ARM: 0.1 Imgs/s/w nVidia: 8.8 Imgs/s/w | 53 Imgs/s/w | 530x 6x |
Notes:
- ARM: Quad-core A53 run on Raspberry Pi @ 1200MHz
- Nvidia benchmarks were done using Tegra X1
- Optical Flow (LK) – Window Size 11x1
| Algorithm | CPU/DSP | Zynq 7000 | Advantage |
|---|---|---|---|
| Forward Projection | ARM: 3 sec/view | 0.016 sec/view | 188x |
| Motion Detection | ARM: 0.7 FPS | 67 FPS | 90x |
| Noise Reduction-Sobel | ARM: 1 FPS | 67 FPS | 60x |
| Canny Edge Detection | ARM: 0.66 FPS | 40 FPS | 45x |
| 3D Image Reconstruction | ARM: 75k | 8k | 9x |
| DPD | ARM: 506 ms | 31.3 ms | 16x |
| FIR | TI DSP: 64020 ns | 1200 ns | 53x |
| FFT | TI DSP: 1036 ns | 128 ns | 8x |
Notes:
- Cortex-A9 core used only on the Zynq devices when targeting ARM
- TI benchmarks were done using C66 DSP core
AMD high-level design tools like Vitis Model Composer for DSP and High Level Synthesis provide a level of abstraction that empower system architects and domain experts to rapidly evaluate new algorithms and focus on developing the differentiating parts of their design. The complete AMD DSP solution is a combination of these design tools, IP, reference designs, methodologies and boards that work together to get to a working production design in the shortest time possible.
The Vitis Model Composer is a Model-Based design tool that leverages the MATLAB and Simulink environment to define, test and implement production quality DSP algorithms in programmable logic in a fraction of traditional RTL development times.
The tool provides:
- 100+ optimized DSP blocks, many with C simulation models for 2-3X faster simulation vs RTL
- Integration of RTL, IP, Simulink, MATLAB and C/C++ components of a DSP system
- Bit and cycle accurate floating and fixed-point simulations
- Hardware co-simulation to accelerate simulation and validate algorithm on working hardware
- Automatic code generation from Simulink to packaged IP or low-level HDL
- Automatic generation of HDL test bench, including test vectors
Learn more about Vivado System Generator for DSP:
- Introduction to Vitis Model Composer (Video)
- Vitis Model Composer (Documentation)
High Level Synthesis
High-Level Synthesis, include Vitis unified software platform, enables portable C, C++ and System C algorithm specifications to be directly targeted into AMD FPGA & Adaptive SoCs without the need to create RTL. Just as there are compilers from C/C++ to different processor architectures, the HLS compiler provides the same functionality from C/C++ to AMD FPGA & Adaptive SoCs.
Learn more about Vivado High Level Synthesis:
- Getting Started with Vitis High-Level Synthesis (Video)
- Vitis High Level Synthesis User Guide (Documentation)
Tools & Ecosystem
AMD provides best-in-class tools to enable Digital Signal Processing (DSP) applications to be implemented efficiently and at low power on AMD adaptive SoCs and FPGAs. Whether you are designing with RTL, C/C++/SystemC, or Matlab/Simulink, the AMD tools below can easily facilitate your DSP design and reduce your time-to-market.
Libraries and Frameworks
AMD offers a range of libraries that are optimized for performance, resource utilization and ease of use.
| Libraries & Frameworks | Description | Application |
|---|---|---|
| GitHub Repositories | AMD has created GitHub repositories, which contain useful examples for many applications including DSP-related functions. | |
| Vitis Accelerated Libraries | AMD has created an extensive set of open-source, performance-optimized libraries that offer out-of-the-box acceleration with minimal to zero-code changes to your existing applications. | Vitis Libraries |
Partners, Boards & Kits
AMD and its partners work together to produce tools and boards to ease the adoption of AMD FPGAs and SoCs for DSP applications across many market segments.
| Partner | Description | Solution |
|---|---|---|
| Avnet DSP-Centric Development Kits and Modules | MathWorks and leading high-speed analog supplier Avnet offer, DSP-centric development kits and production-ready system-on-modules (SOM) for embedded vision, software-defined radio, and high-performance motor control. | Avnet |
| Mathworks Computing Software | Mathworks MATLAB® and Simulink® can reduce adaptive SoCs and FPGA system development time significantly by enabling users to:
| Mathworks |
| Analog Devices Add-On Boards | The AD-FMCDAQ2-EBZ FMC board is a self-contained data acquisition and signal synthesis prototyping platform supporting ease of use operation enabling quicker end-system signal processing development.
| Analog Devices |
Resources
Multi-Channel Fractional SRC Filter in HLS
This application note focuses on the design of a multi-channel fractional sample rate conversion (SRC) filter using the Vivado tool, which takes the source code in C++ programming language and generates highly efficient synthesizable Verilog or VHDL code for FPGA.
Stay Informed
Join the adaptive SoC and FPGA notification list to receive the latest news and updates.
Footnotes- Please refer to WP486 – Deep Learning with INT8 Optimization on AMD Devices
- Single Precision Floating Point performance using Floating Point Operator core with 3 DSP slices in Ultrascale+
- Please refer to WP486 – Deep Learning with INT8 Optimization on AMD Devices
- Single Precision Floating Point performance using Floating Point Operator core with 3 DSP slices in Ultrascale+

