close
close

AMD Zen vs Zen 2 vs Zen 3 vs Zen 4 vs Zen 5 Core Architecture: The Road to Ryzen 9000 CPUs

AMD Zen vs Zen 2 vs Zen 3 vs Zen 4 vs Zen 5 Core Architecture: The Road to Ryzen 9000 CPUs

The fifth iteration of AMD’s Zen core architecture is launching later this month. Zen 5 marks nearly seven years of Ryzen processors, starting with the release of Zen in 2017. Since then, we’ve gotten Zen+, Zen 2, Zen 3, and Zen 4 at the heart of the Ryzen 2000, 3000, 5000, and 7000 series CPUs. The Zen 5 core that powers the Ryzen 9000 series is the largest architectural renovation of the frontend design, making it broader, faster, and more efficient than ever. Here’s a look at the five iterations of Zen.

AMD Zen vs Zen 2 vs Zen 3 vs Zen 4 vs Zen 5: CPU front end

The Branch Predictor and I-Cache

At the top we have the branch predictorcontrolling the flow of instructions like a ship navigator:



if condition
    {dosomething}
else
    {dosomethingelse}
  • It predicts whether the next instruction will be a branchand if so, what kind (conditional/unconditional).
  • Branches are usually “if/else” conditional or unconditional (always branches).
  • The next step is to calculate the address of the instruction (usually in the cache).
  • The branch target buffer (BTB) is instrumental here. It contains a history of the last n branches (taken or not) and the destination address (PC) of the instructions.
  • The branch predictor allows CPUs to continue executing even before the location of the next instruction has been generated. This is called out of use execution.
Retrieve/Decode
  • The Zen Core architecture uses a Hash perceptron branch predictor using a 3-level BTB.
    • The branch predictor on Zen and Zen 2 stores up to 2 branches per BTB entry.
    • The L0 BTB holds its ground 4 forward and 4 backwards taken branches. The L1 BTB has 256 mentions, while L2 has 4096.
    • There is also an Indirect Target Array with 1024 mentions for indirect purposes.
    • The 64 KB L1I cache is associated with an 8-input L0 TLB, a 64-input L1 TLB, and a 512-input L2 TLB.
  • Zen2 uses an L1 Hashed Perceptron and an L2 TAG predictor.
    • The L0 BTB holds its ground 8 forward and 8 backwards taken branches. The L1 BTB has 512while the L2 BTB 7168 mentions.
    • The ITA is doubled with 1024 mentions.
    • The L1I cache is 32 KB in size with an L1 TLB of 64 entries and an L2 TLB of 512 entries.
  • Zen-3 improves branch prediction accuracy and bandwidth, with lower mispredict penalty. Most branches taken have zero-bubble penalty (does not stop pipeline).
    • The L1 BTB is resistant to 1024and the L2 BTB holds until 6.5K mentions.
    • The ITA is expanded by 50% to 1536 mentions.
  • Zen4 is able to 2 branches taken per cycle.
    • The L1 BTB can handle up to 1536and the L2 BTB has 7168 mentions.
    • The ITA is expanded to approximately 3K mentions.
  • Zen 5 can up to 2 taken and 2 branch out forward.
    • The L1I cache size is unchanged, but can be increased to a maximum of 32Bx2 (formerly 32B) of data from the L2 cache per cycle.

The decoders, op-cache and op-queue

The decoders take the instructions (complex) from the instruction cache, via the instruction queue, and break them into simpler micro operations. These are passed to the micro-op queue and also stored in the op cache. Micro-ops in the micro-op queue and sent to the execution backend.

Decoders are quite energy intensive, so it is important to on-cache more efficient and accurate is the key. Instructions cached in the op-cache allow the front-end to bypass the decodersthereby improving throughput and efficiency.

  • The Zen Core uses a 4-wide decoder, which can decode four instructions per cycle.
    • It is powered by a 20x16B instruction queue, which sends up to four instructions to the decoders.
    • The op-cache has 2048 entries and can send up to 8 micro-ops to the op queue.
    • The micro op queue has 72 entries and is getting 8 of 4 micro-operations either from the op-cache or the decoders.
    • Maximum 6 micro-operations are sent to the integer/floating point backends.
  • Zen2 doubles the size of the op cache to 4096 mentions.
  • Zen-3 increase the instruction queue to 24x16B.
  • Zen4 increases the op-cache capacity to 6.75K listings, making it possible to 9 micro-operations to the micro-op queue.
  • Zen 5 makes drastic changes to the decoders.
    • It features 2x 4-wide decoders that can each transmit up to 4 micro-ops.
    • The decoders are powered by a double ported get instruction.
    • The micro-op cache has been reduced to 6Kbut it is capable of 12 (6×2) micro-operations to the micro-op queue.
    • The shipping has been extended to 8-wideable to send 8 micro-operations to the execution backend.
    • When SMT is enabled, each thread gets one 4-wide decoder.
Zen 5 Front

ROB, Schedulers and Registers

The reordering buffer or the ROB is a critical part of out-of-order processors. It ensures that instructions are written to the registers according to their original order. It feeds the schedulers which contain instructions and their operands per the program sequence. When the operands for a particular set of scheduled instructions are available, they are sent to the execution units for execution.


Rename registry is another crucial part of OoO execution. When two or more instructions depend on the same memory location (register), but are independent of each other, the processor uses logical registers to create different variants of it. The renamed registers are executed in parallel without introducing any data hazards.

Zen 2 to Zen 3 CCD
  • The Zen Core can keep 192 entries in the ROB containing separate integer and floating-point schedulers/registers and execution units.
    • The FP name change is 6-widewhile the entire renaming is 4-wide.
    • It has 4x 14 input integer and 2x 14 input AGU planners.
    • The FP scheduling queue has 96 entries.
    • The entire number register file has 168while the floating point file 160 mentions.
  • Zen2 increase the ROB size to 224 and doubles the floating point data paths.
    • It has 4x 16 input integer and 1x 28-input AGU planner.
    • The entire registry file has been expanded to 180.
    • The FP side has a 64 input unplanned and 36 entries in the scheduler buffer.
  • Zen-3 has a slightly wider ROB with 256 mentions.
    • 4x 24 input integers+AGU schedulers.
    • The FP side has a 64 input planning and 2x 32-inputs planning queues.
    • The entire registry file has been expanded to 192.
  • Zen4 increase the ROB size to 320.
    • The functions of the entire registry file 224 submissions, while the FP side 192x 512-bit registers.
    • There is also an AVX512 mask register file with 68 entries.
  • Zen 5 increase the ROB size to 448 mentions.
    • The FP name change is 6-widewhile the entire renaming is 8-wide.
    • The FP side has 3x 32-inputs planners and a 96 input queue without planning.
    • The entire number register file has 240 (64b) submissions, while the FP side 384 (512b) mentions.
Zen 5 Back-end INT Execution

Execution units and memory subsystem

The execution units perform various arithmetic, floating point, address generation, load-store, or branching related calculations according to the program order to obtain the final result. Modern CPU cores have multiple independent execution paths that are specialized for specific instructions, such as EARLY, FMUL, FMA, ALU, AGU, LD/STetc. The obtained results are written to the registers or forwarded to the retire queue.

Zen 4 Back
  • The Zen Core functions ten execution ports.
    • The entire side has 4x aluminum And 2x AGU ports.
    • The FP side has 2x FMUL/FMA And 2x FADD execution units (128-bit).
    • The load buffer has 72 entries and the save buffer has 44 entries.
    • Zen is capable of 2x 128-bit loads and 1x 32B stores.
    • It is supported by a 32 KB 8-way data cache and a 512 KB L2 cache.
  • Zen2 has eleven execution ports.
    • The entire side has 4x aluminum And 3x AGU ports.
    • The FP side has 2x FMA And 2x FADD units (256-bit).
    • There is a queue in front of the store with 48 listings.
    • The load/storage bandwidth is increased to 256-bit per cycle.
  • Zen-3 has fourteen execution ports.
    • The entire side has 4x aluminum, 3x AGUAnd 1x branch execution units.
    • The FP side has 2x FMA, 2x FADDAnd 2x shop Units.
    • The store queue contains 64 items, while the loading queue contains 116 items.
    • The tax bandwidth is a maximum of 3x 64-bit (or 2x 256-bit) and save in 2x 64-bit (or 1x 256-bit).
  • Zen4 retains the 14-wide execution backend.
    • Zen 4 is able to AVX512 instructions by double-pumping the 256-bit wide floating-point units.
    • The L2 cache has been increased to 1 MB 8-fold.
  • Zen 5 has sixteen execution ports.
    • The integer execute consists of 6x aluminum ports, and 4x AGU ports.
    • The FP side has 2x FMUL, 2x FADDAnd 2x intD/StD execution ports.
    • Zen 5 supports native AVX512 execution using 512b data paths.
    • It is capable of 4x 64-bit or 2x 256-bit stores and 2x 128-bit/256-bit or 1x 512-bit stores per cycle.
    • It also increases the L1 data cache to 48 KB 12-way, with 4 read and 2 write operations per cycle.
    • The L1 to L2/FP bandwidth has been doubled to 64 bytes per cycle.
Running Zen 5 back-end FP

AMD Zen 1 vs Zen 2 vs Zen 3 vs Zen 4 vs Zen 5: Front-end summary

Zen Zen2 Zen-3 Zen4 Zen 5
L1I cache 64 KB 32 KB 32 KB 32 KB 32 KB
ITLB listings 64 L1/512 L2 64 L1/512 L2 64 L1/512 L2 64 L1/512 L2 64 L1/512 L2
BTB Submissions 256 L1/4K L2 512 L1/7K L2 1024 L1/6,5K L2 1536 L1/7K L2 ?
Instruction Q 20x 16B 20x 16B 24x 16B 24x 16B ?
Decoder width 4-wide 4-wide 4-wide 4-wide 2x 4-wide
Micro-op Cache Entries 2K 4K 4K 6.75K 6K
On Cache bw (uops) 8 8 8 9 2×6
Shipping (uops) 6 6 6 6 8

AMD Zen 1 vs Zen 2 vs Zen 3 vs Zen 4 vs Zen 5: Back-end summary

Zen Zen2 Zen-3 Zen4 Zen 5
ROB listings 192 224 256 320 448
INT planner 6x 14 entries 4x 16 entry (1x 28 entry AGU) 4x 24 entry 4x 24 entry ?
FP planner 96 mentions 36 mentions 2x 32 entry 2x 32 entry 3x 32 entry
INT registers 168 entries 180 entries 192 entries 224 entries 240 entries
FP registers 160 entries 160 entries 160 entries 192x 512-bit (68 512b mask) 384x 512-bit
ALU gates 4 4 4 (+1 bedroom) 4 (+1 bedroom) 6
AGU gates 2 3 3 3 4
FP ports 4 (128b) 4 (256b) 4 4 (+2 F2I) 4 (+2 StD IntD)
LD/ST-Q 72/44 entries 72/48 entries 116/64 entries 136/64 entries ?
LD/ST body weight 32B/2x 128b 2x 256b 2x 256b/256b 2x 256b/256b 2x 256b/512b
L1D 32 KB 32 KB 32 KB 32 KB 48 KB
L2 512 KB 512 KB 512 KB 1 MB 1 MB

Read further:

Intel Golden Cove vs Raptor Cove vs Redwood Cove vs Lion Cove: Intel’s P-Core Architectures Compared