Digital PDFs

XX-63CFF-C0-11K

2001

714 pages

Original

36MB

Document:	21464 Internal Design Specification
Order Number:	XX-63CFF-C0
Revision:	11K
Pages:	714
Original Filename:

OCR Text

Compaq Confidential

21464 Internal Design
Specification

Available Internally from: HTTP://segsrv.hlo.dec.com/arana

This document specifies the internal design for the Alpha microprocessor
that is also known as EVS and Arana.

Revision/Update Information:

COMPAQ
..

Revision 1. lk, January, 2001

Compaq Computer Corporation
Shrewsbury, Massachusetts
Compaq Confidential

January 2001
The information in this publication is subject to change without notice.
COMPAQ COMPUIBR CORPORJUION SHALL NOT BE LIABLE FOR TECHNICAL OR EDITORIAL
ERRORS OR OMISSIONS CONTAINED HEREIN, NOR FOR INCIDENTAL OR CONSEQUENTIAL DAMAGES RESULTING FROM THE FURNISHING, PERFORMANCE, OR USE OF THIS MATERIAL. THIS
INFORMATION IS PROVIDED "AS IS" AND COMPAQ COMPUTER CORPORATION DISCLAIMS ANY
WARRANTIES, EXPRESS, IMPLIED OR STATUTORY AND EXPRESSLY DISCLAIMS THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR PARTICULAR PURPOSE, GOOD TITLE AND AGAINST
INFRINGEMENT.
This publication contains information protected by copyright. No part of this publication may be photocopied or
reproduced in any form without prior written consent from Compaq Computer Corporation.
© Compaq Computer Corporation 2001.
All rights reserved. Printed in the U.S.A.
COMPAQ, the Compaq logo, the Digital logo, and VAX Registered in United States Patent and Trademark Office.
Pentium is a registered trademark of Intel Corporation.
Other product names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

Compaq Confidential
5 January 2001 ·-- Subject to Change

Contents
Preface
1

Introduction
1.1

Terminology and Conventions ............................................... .

1-1

Architecture Overview
New Features ........................................................... .
Processor Features ................................................... .
Memory Features ..................................................... .
Multiprocessor Features ................................................ .
2.2
Microarchitecture Diagram ................................................. .
Simultaneous Multithreading (SMT) .......................................... .
2.3
Instruction Unit .......................................................... .
2.4
Instruction Fetch Unit - the lbox ......................................... .
2.4.1
Dependency Mapper Unit - the Pbox ..................................... .
2.4.2
Instruction Issue and Retire Unit - the Qbox ............................... .
2.4.3
Execution Unit ........................................................... .
2.5
Register File ......................................................... .
2.5.1
Integer Instruction Execution Unit - the Ebox .............................. .
2.5.2
Floating-Point Instruction Execution Unit - the Fbox ......................... .
2.5.3
Functional Units .................................................. .
2.5.3.1
2.6
Memory Controller Unit - the Mbox .......................................... .
External Interface ........................................................ .
2.7
Scache Controller - the Cbox ........................................... .
2.7.1
Router - the Rbox ................................................... .
2.7.2
Rambus Interface - the Zbox ........................................... .
2.7.3
2.7.4
Cache Coherency Protocol ............................................. .
Introduction to the Protocol .......................................... .
2.7.4.1
Structures that Maintain the Cache Coherence .......................... .
2.7.4.2
Pipeline Organization ..................................................... .
2.8
Pipeline Diagram ..................................................... .
2.8.1
Conversion Between Negative Integer and Alphabet ......................... .
2.8.2
Basic Pipeline Stage Conversion Equations ................................ .
2.8.3
Conversion Table ..................................................... .
2.8.4
Instruction Execution Pipelines and Latency .................................... .
2.9
Instruction Issue and Retire Rules ........................................... .
2.10
Issue Rules ......................................................... .
2.10.1
Bidding Rules .................................................... .
2.10.1.1
2.10.2
Retirement Rules ..................................................... .
Completion Rules ................................................. .
2.10.2.1
Implementation-Specific Architecture Features .................................. .
2.11
New Instructions ...................................................... .
2.11.1
Thread Synchronization ............................................ .
2.11.1.1
Short Vector SIMD (Single Instruction Stream, Multiple Data Streams) ........ .
2.11.1.2
CMOV Instruction Processing ........................................... .
2.11.2
Integer CMOV Specification ......................................... .
2.11.2.1
Native CMOV .................................................... .
2.11.2.2
2.11.2.3
Floating-Point FCMOVxx Specification ................................. .
Native FCMOV ................................................... .
2.11.2.4

2.1

2.1.1
2.1.2
2.1.3

2-1
2-1

2-3
2-3
2-4
2-5
2-7
2-7
2-8
2-9
2-11
2-11
2-11
2-14
2-15
2-16
2-17
2-17

2-18
2-18
2-18
2-18
2-19
2-19
2-20
2-21
2-21
2-21
2-22
2-27
2-27
2-27
2-29
2-29
2-29
2-29
2-29
2-30
2-32
2-32
2-33
2-33
2-34

Compaq Confidential
5 Jiuwary2001 -· Subject To Change

iii

2.11.2.5
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.2.5.1
Native CMOV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.2.5.2
Legacy CMOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.11.3
Mapper Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12
Interrupts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.1
IPR Access Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12.1.1
HW_MFPR and HW_MTPR PALcode Instructions.........................
AMASK and IMPLVER Instruction Processing and Values ......................... .
2.13
Performance Monitoring ................................................... .
2.14

2-34
2-34
2-34
2-35
2-35
2-36
2-36
2-36
2-36

Instruction Fetch Unit- the lbox
3.1
Features ............................................................... .
3.2
Major Sections ........................................................... .
3.3
Forward Path Pipeline ..................................................... .
3.4
Index Unit .............................................................. .
Fetch TPU Chooser ................................................... .
3.4.1
3.4.2
Line Predictor ........................................................ .
3.4.3
Thread Index Latches ................................................. .
3.4.3.1
(Re)Starting/Resuming the Pipe ...................................... .
Exceptions ................................................... .
3.4.3.1.1
Misprediction - PC Cale ......................................... .
3.4.3.1.2
3.4.3.1.3
Thread Resume - Line Predictor (two indexes) ....................... .
Other Index Latch Tracking Functions ................................. .
3.4.3.2
3.4.4
Thread Training Latches ............................................... .
3.5
Instruction Processing Unit ................................................. .
3.5.1
lcache Data Array .................................................... .
3.5.2
lcache Tag Array ..................................................... .
3.5.3
Store-Sets Based Memory Dependence Predictor ........................... .
3.5.4
Collapsing Buffer ..................................................... .
3.5.4.1
Instruction Buffer .................................................. .
3.5.4.1.1
Data Path .................................................... .
Control Path .................................................. .
3.5.4.1.2
3.5.4.2
Collapser ........................................................ .
3.5.4.2.1
Data Path .................................................... .
Start/End Buffer ............................................... .
3.5.4.2.2
f\.l ...... u1 vc+""'
...+ r"'""'l"'-111"""+:,.,.n
..a11.
'--IQl"'UIQl.IVll

1'1VVV

•••••••••••••••••••••••••••••••••••••••••••

CMov ....................................................... .
3.5.4.2.4
3.6
Control Flow Prediction Unit ................................................ .
3.6.1
Conditional Branch Prediction ........................................... .
3.6.1.1
Branch Prediction Components ...................................... .
3.6.1.1.1
Branch History (LGHist) ......................................... .
3.6.1.1.2
Prediction Tables .............................................. .
Bank Selection ................................................ .
3.6.1.1.3
Unshuffle Network ............................................ .
3.6.1.1.4
Backend logic and checkpoint information .......................... .
3.6.1.1.5
Branch Training ................................................... .
3.6.1.2
3.6.1.2.1
Predictor Training ............................................. .
3.6.1.2.2
Hysteresis Training ............................................ .
3.6.1.3
PAL mode ....................................................... .
3.6.2
Jump Target Predictor ................................................. .
Return Address Stack ................................................. .
3.6.3
PC UnR ................................................................ .
3.7
PC Calculation ....................................................... .
3.7.1
PC Compare ........................................................ .
3.7.2
3.7.2.1
Index Mispredicts ................................................. .
3.7.2.2
lcache HR Determination ............................................ .

3-1
3-2
3-4
3-4
3-4
3-5
3-7
3-7
3-7
3-8
3-9
3-9
3-10
3-11
3-11
3-12
3-14
3-16
3-16
3-16
3-17
3-18
3-18
3-18
3-18
3-18
3-19
3-19
3-20
3-20
3-21
3-22
3-22
3-23
3-24
3-24
3-25
3-26
3-26
3-27
3-28
3-28
3-32
3-33
3-33

Compaq Confidential
iv

5 Jam1c1ry 2001 ... Subject To Cfumge

3.7.2.3
lcache Access Violation: ............................................ .
3.7.2.4
lcache Way Mispredict Determination: ................................. .
3.7.2.5
Instruction Cache Fill Request: ....................................... .
3.8
Fill Unit
3.8.1
Instruction Translation Buffer ............................................ .
3.8.1.1
Architecture ...................................................... .
3.8.1.2
IPRs That Affect the ITB ............................................ .
3.8.1.3
ITB Operations ................................................... .
3.8.1.3.1
Fills ........................................................ .
3.8.1.3.2
Reads ...................................................... .
3.8.1.3.3
Invalidates ................................................... .
3.8.2
Instruction Fill Unit .................................................... .
3.8.2.1
Demand Misses .................................................. .
3.8.2.1.1
Demand case: simple .......................................... .
3.8.2.1.2
Demand case: index and way match of active request: "piggybacking" .... .
3.8.2.1.3
Demand case: flip_way active .................................... .
3.8.2.1.4
Demand case: capacity stall ..................................... .
3.8.2.2
Prefetching ...................................................... .
3.8.2.2.1
Prefetch case: simple .......................................... .
3.8.2.2.2
Prefetch cases: tag match or page boundary crossing ................. .
3.8.2.2.3
Prefetch case: Index CAM match ................................. .
3.8.2.2.4
Prefetch case: alternate TPU demand during pref etching ............... .
3.8.2.2.5
Prefetch cases: badpath indication during prefetching ................. .
3.8.2.3
Fill ............................................................. .
3.8.2.3.1
Predecode Bit Generation ....................................... .
3.8.2.3.2
Predecode Bits for Control Flow Instructions ......................... .
3.8.2.3.3
Fill Data Routing .............................................. .
3.9
Checkpoint Unit .......................................................... .
3.9.1
Checkpoint Table Components .......................................... .
3.9.1.1
Checkpoint Table Functions ......................................... .
3.9.1.1.1
Restarting on an exception ...................................... .
3.9.1.1.2
Restoring Predictor States ....................................... .
3.9.1.1.3
Predictor Training ............................................. .
3.10
lbox Interfaces ........................................................... .
3.10.1
Pbox Interface ....................................................... .
3.10.2
Qbox Interface ....................................................... .
3.10.3
Ebox Interface ....................................................... .
3.10.4
Mbox Interface ....................................................... .
3.10.5
Cbox Interface ....................................................... .

3-34
3-35
3-35
3-36
3-36
3-37
3-40
3-40
3-40
3-41
3-41
3-41
3-42
3-43
3-44
3-44
3-45
3-45
3-46
3-47
3-47
3-47
3-48
3-48
3-49
3-53
3-55
3-55
3-56
3-59
3-60
3-61
3-62
3-62
3-62
3-62
3-62
3-62
3-62

Dependency Mapper Unit - the Pbox
4.1
Dependency Analysis: General Concepts ...................................... .
4.2
INum Space ............................................................. .
4.2.1
INum Age Comparison ................................................. .
4.3
Component Details ....................................................... .
4.3.1
INum Mapper (IMP) ................................................... .
4.3.1.1
Design considerations .............................................. .
4.3.1.2
Design Architecture ................................................ .
4.3.1.3
Map Predecode Bits from the Ibox .................................... .
4.3.2
Physical Register Map (PMP) ........................................... .
4.3.2.1
Design Considerations ............................................. .
4.3.2.2
Design Architecture ................................................ .
4.3.3
INum Allocator (INA) .................................................. .
4.3.3.1
Design Considerations ............................................. .
4.3.3.2
Design Architecture ................................................ .
4.3.3.3
Map Thread Chooser (MTC) ......................................... .

4-2
4-4
4-5
4-7
4-7
4-7
4-7
4-9
4-10
4-10
4-11
4-13
4-13
4-13
4-14

Compaq Confidential
5 January 2001 ···Subject To Change

4.3.4
4.3.4.1
4.3.4.2
4.3.5
4.3.5.1
4.3.5.2
4.3.5.3
4.3.5.4
4.3.5.5
4.3.5.6
4.3.5.7
4.3.5.8
4.3.5.9
4.3.6
4.3.6.1
4.3.6.2
4.3.7
4.3.7.1
4.3.7.2
4.3.8
4.3.8.1
4.3.8.2
4.3.9
4.3.9.1
4.3.9.2
4.3.10
4.3.10.1
4.3.10.2
4.3.11
4.3.11.1
4.3.11.2

Mapper Exception Logic (MEX) .......................................... .
Design Considerations ............................................. .
Design Architecture ................................................ .
Memory Queue Allocation Unit (MQA) ..................................... .
Allocation ....................................................... .
Background and Terminology ........................................ .
Basic Allocation Loop .............................................. .
Reset. .......................................................... .
Deallocation ..................................................... .
Kills ............................................................ .
Retires .......................................................... .
Quiesce ......................................................... .
Merge Buffer Purging .............................................. .
Instruction Decoder (IDC) .............................................. .
Design Considerations ............................................. .
Design Architecture ................................................ .
Load/Store Serial Number Allocator (LSN) ................................. .
Design Considerations ............................................. .
Design Architecture ................................................ .
Post-Map Skid Buffer (PSB) ............................................. .
Design Considerations ............................................. .
Design Architecture ................................................ .
RC/RS Interrupt Flag Widget (RIF) ....................................... .
Design Considerations ............................................. .
Design Architecture ................................................ .
Bid/Grant Exception Logic (BEL) ......................................... .
Design Considerations ............................................. .
Design Architecture ................................................ .
Retire/Kill Unit (RKU) .................................................. .
Design Considerations ............................................. .
Design Architecture ................................................ .

4-15
4-15
4-15
4-15
4-15
4-16
4-16
4-17
4-17
4-17
4-18
4-19
4-19
4-19
4-19
4-19
4-20
4-20
4-20
4-22
4-22
4-22
4-23
4-23
4-23
4-24
4-24
4-24
4-25
4-25
4-25

Instruction Issue and Retire Unit - the Qbox
Scheduling Decisions - General Concepts .................................... .
5.1
Component Details ....................................................... .
5.2
Instruction Queue (IQ) Generalities ....................................... .
5.2.1
Design Considerations ............................................. .
5.2.1.1
Design Architecture ................................................ .
5.2.1.2
Queue Entry Table (QET) and Reallocation Logic (RAL) ...................... .
5.2.2
Design Considerations ............................................. .
5.2.2.1
Design Architecture ................................................ .
5.2.2.2
Algorithm .................................................... .
5.2.2.2.1
Physical Organization .............................................. .
5.2.2.3
Dependency Arrays (DAs) .............................................. .
5.2.3
Design Considerations ............................................. .
5.2.3.1
Design Architecture ................................................ .
5.2.3.2
Physical Organization .............................................. .
5.2.3.3
Picker Arrays (PKs) ................................................... .
5.2.4
Design Considerations ............................................. .
5.2.4.1
Design Architecture ................................................ .
5.2.4.2
Bid Enable Logic (BID) ................................................. .
5.2.5
Design Considerations ............................................. .
5.2.5.1
Design Architecture ................................................ .
5.2.5.2
Physical Organization .............................................. .
5.2.5.3
FPCR Control Unit (FCR) ............................................... .
5.2.6
Profile-Me Data Collection (PRM) ........................................ .
5.2.7

5-2
5-3
5-3
5-3
5-3
5-9
5-9
5-9
5-9
5-11
5-12
5-12
5-12
5-13
5-13
5-13
5-13
5-14
5-14
5-14
5-14
5-14
5-14

Compaq Confidential
vi

5 Janu~1ry 2001 ··· Subject To Change

5.2.8
5.2.8.1
5.2.8.2
5.2.9
5.2.9.1
5.2.9.2
5.2.10
5.2.10.1
5.2.10.2
5.2.11
5.2.11.1
5.2.11.2
5.2.12
5.2.12.1
5.2.12.2
5.2.13
5.2.13.1
5.2.13.2
5.2.14
5.2.14.1
5.2.14.2
5.2.15
5.2.15.1
5.2.15.2
5.2.16
5.2.16.1
5.2.16.2
5.2.16.2.1
5.2.16.2.2
5.2.16.2.3
5.2.16.2.4
5.2.17
5.2.17.1
5.2.17.2
5.2.18
5.2.18.1
5.2.18.2

Source Register Number Arrays (SRNs) ................................... .
Design Considerations ............................................. .
Design Architecture ................................................ .
Destination Register Number Array (ORN) ................................. .
Design Considerations ............................................. .
Design Architecture ................................................ .
Load/Store Number High-Water Marker (HWM) ............................. .
Design Considerations ............................................. .
Design Architecture ................................................ .
Load/Poison Re-Arm Widget (LPR) ....................................... .
Design Considerations ............................................. .
Design Architecture ................................................ .
Post-Issue Logic (PIL) ................................................. .
Design Considerations ............................................. .
Design Architecture ................................................ .
Oldest CSR Selector (OCS) ............................................. .
Design Considerations ............................................. .
Design Architecture ................................................ .
Queue Chunk Allocator/Deallocator (ALC) ................................. .
Design Considerations ............................................. .
Design Architecture ................................................ .
in-Flight Table (IFx) ................................................... .
Design Considerations ............................................. .
Design Architecture ................................................ .
Completion Unit (CMP) ................................................ .
Design Considerations ............................................. .
Design Architecture ................................................ .
Completion ................................................... .
Kills .......................................... · ..... ·········
Retirement ................................................... .
Mbox Interface ................................................ .
Payload Array (PAY) .................................................. .
Design Considerations ............................................. .
Design Architecture ................................................ .
Exception Kill Logic (EKC) .............................................. .
Design Considerations ............................................. .
Design Architecture ................................................ .

5-15
5-15
5-15
5-15
5-15
5-15
5-16
5-16
5-16
5-18
5-18
5-19
5-20
5-20
5-20
5-21
5-21
5-21
5-22
5-22
5-22
5-23
5-23
5-23
5-24
5-24
5-25
5-25
5-25
5-25
5-26
5-26
5-26
5-26
5-27
5-27
5-27

Integer Execution Unit - the Ebox
6.1
6.1.1
6.1.2
6.2
6.2.1
6.2.2
6.2.3
6.2.4
6.2.5
6.2.6
6.2.7
6.2.8
6.2.9
6.3
6.4
6.4.1
6.4.2

Major Components ....................................................... .
Datapath ............................................................ .
Timing ............................................................. .
Integer Clusters .......................................................... .
Adder .............................................................. .
Shifter .............................................................. .
Logic Box ........................................................... .
Register File Operand Interface .......................................... .
Virtual Address Generator .............................................. .
Load Data Interface ................................................... .
Multimedia Interface ................................................... .
Global Control ....................................................... .
Store Data Interface ................................................... .
Operand Steering ........................................................ .
Register Caches ......................................................... .
Writing the Rcache .................................................... .
Reading the Rcache ................................................... .

6-1
6-2
6-3
6-4
6-6
6-7
6-8
6-8
6-9
6-10
6-10
6-11
6-11
6-12
6-12
6-16
6-17

Compaq Confidential
5 January 2001 -- Subject To Change

vii

6.5
Multimedia Unit .......................................................... .
6.5.1
Inputs and Outputs .................................................... .
6.5.2
Signal Nomenclature .................................................. .
6.5.3
Timing ............................................................. ·
6.5.4
Instruction Decode/Control Section ....................................... .
6.5.5
MVI Section ......................................................... .
6.5.6
ALU ..................................... ···························
6.5.6.1
TADD, TSUB PADD, PSUB, CMPWGE, MIN, MAX Instructions ............. .
6.5.6.2
TABSERR Instruction .............................................. .
6.5.6.3
TSQERR Instruction ............................................... .
6.5.6.4
Min/Max Instruction ................................................ .
6.5.7
Multiplier Array ....................................................... .
6.5.8
Count Logic ......................................................... .
6.5.9
Compare Word, Saturation, and the 21264 Min Max .......................... .
6.5.10
MinMax Logic ........................................................ .
6.5.11
Pack, Unpack, Permute Byte ............................................ .
6.5.12
Shifter .............................................................. .
6.5.13
Delay ............................................ ···.···············
6.5.14
Integer Multiplier. ..................................................... .
6.6
Debug Features .......................................................... .
6.7
Testability Features ....................................................... .
6.8
External Interfaces: lbox, Qbox, Pbox, Mbox, Register File, Fbox ................... .
6.8.1
lbox ........................................ · · .. · · · · · · · · · · · · · · · · · · · ·
6.8.2
Qbox ............................... · · · · · · · · · ·· · · · · · · · · · · ·· · · · · · · · · ·
6.8.3
Pbox ............................................................... .
6.8.4
Mbox .............................................................. .
6.8.5
Register File ......................................................... .
6.8.6
Fbox ............................................................... .
6.8.7
Global .............................................................. .
6.9
IPRs ................................................................... .
6.10
Exceptions .............................................................. .
6.11
Poisoned Data ........................................................... .
6.12
Format Conversions ...................................................... .

6-18
6-18
6-18
6-18
6-19
6-19
6-20
6-21
6-21
6-21
6-21
6-22
6-24
6-25
6-25
6-26
6-26
6-27
6-27
6-29
6-30
6-30
6-30
6-31
6-32
6-32
6-33
6-33
6-33
6-34
6-34
6-35
6-36

Test Structures .......................................................... .
Timing ............................................................. .
Read Timing ......................................................... .
Write/Read Timing .................................................... .
External Interfaces ....................................................... .
Qbox to Register File Interface .......................................... .
Ebox to Register File Interface ........................................... .
Fbox to Register File Interface ........................................... .
Global Register File Interface ............................................ .

7-2
7-2
7-3
7-3
7-3
7-3
7-4
7-4
7-4

Floating-Point Execution Units - the Fbox
8.1
8.2
8.2.1
8.2.2
8.2.3
8.2.4
8.2.5
8.2.6

Major Sections ........................................................... .
Interface Section ......................................................... .
External Interface ..................................................... .
Qbox Timing to Fbox .................................................. .
Fbox Pipeline Timing .................................................. .
Register File/Operand Bus .............................................. .
Loads/Stores to/from Fbox .............................................. .
Register Cache (F _RGC) ............................................... .

8-3
8-3
8-3
8-3
8-4
8-4
8-5
8-6

Compaq Confidential
viii

5 J(1nu(1ry 2001 -· Subject To Change

8.2.7
The Operand Steering Unit (F _OSU).......................................
8.2.8
Interface Control (F_INT)...................... . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.9
Divide and SQRT - Qbox interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 O
Fbox Exceptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3
Fbox Floating-Point Control Register (FPCR)........ . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3.1
FPCR Format.........................................................
8.4
Fbox Multiplier Unit - F_MUL and F_GML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.4.1
FMUL Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.5
FboxAdd Pipeline.........................................................
8.6
Fbox Add Pipe1 - F_AP1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.1
Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.1.0.1
Phase FOA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.1.0.2
Phase FOB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.1.0.3
Phase F1A...... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.1.0.4
Phase F1 B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.1.0.5
Phase F2A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.6.1.0.6
Phase F2B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7
Fbox Add Pipe2 - F_AP2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1
Cycle 1 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1.1
Fraction: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1.2
Exponent.... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.1.3
Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.2
Cycle 2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.2.1
Fraction..........................................................
8.7.2.2
ExponenVControl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.3
Cycle 3 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.7.3.1
Fraction..........................................................
8.7 .3.2
ExponenVControl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8
Fbox Short Pipe - F_SHP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.1
Short Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.1.1
CPYS, CPYSN, CPYSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.1.2
FCMOVEQ, FCMOVGE, FCMOVGT, FCMOVLE, FCMOVLT, FCMOVNE.. . . . .
8.8.2
Unusual Input Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.2.1
Unusual Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.2.2
IEEE Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.2.2.1
ADDS, ADDT..................................................
8.8.2.2.2
DIVS, DIVT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.2.2.3
MULS, MUL T. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.2.2.4
SORTS, SQRTT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.2.2.5
SUBS, SUBT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.3
Floating-Point Control Register (FPCR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.3.1
Reading the FPCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.3.2
Dynamic Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8.3.3
Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9
Fbox Divider - F_DIV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9.1
Divider Description.... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9.2
The Divider in Detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.9.3
Over-Redundant Digits to Binary and Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.1 O
Fbox Square-Root Unit - F_SQR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11
Fbox Graphics Pipeline.... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.1
Paired SP Floating-point Operate Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.2
Register and Memory Formats................... . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.3
Rounding Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.4
Exceptions........... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5
Paired Single-Precision Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5.1
Graphics Add Pipeline: F_GAD.......................................
8.11.5.2
Fraction Datapath. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5.2.1
OP_MUX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5.2.2
FTA, FTB....... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8-8
8-10
8-10
8-11
8-14
8-14
8-16
8-16
8-18
8-19
8-21
8-21
8-22
8-23
8-23
8-25
8-26
8-26
8-26
8-26
8-28
8-28
8-29
8-29
8-29
8-30
8-30
8-30
8-31
8-32
8-32
8-32
8-32
8-33
8-34
8-34
8-34
8-34
8-35
8-35
8-35
8-36
8-36
8-37
8-39
8-39
8-39
8-41
8-44
8-45
8-46
8-46
8-46
8-46
8-47
8-50
8-51
8-51
8-51

Compaq Confidential
5 January 2001 ··· Subject To Change

' 8.11.5.2.3
FGT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5.2.4
LXD and EXP PRED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5.2.5
LXS and LXE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5.2.6
Fl1/Fl2 MUX and the LEFT/LR Shifters..............................
8.11.5.2.7
RND CSA and ADDER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5.3
Exponent Data Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5.3.1
EDIFF ADDER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5.3.2
EDIFF DETECT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5.3.3
ER MUX.............. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.11.5.3.4
EXP_RES_ADD................................................
8.12
G_AD Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.12 .1
Fraction Data Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.13
Sticky Bit Calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8-52
8-52
8-52
8-52
8-53
8-54
8-54
8-54
8-54
8-54
8-55
8-55
8-56

Memory Instruction Execution Unit - the Mbox
Major Inputs & Outputs .................................................... .
9.1
Inputs .............................................................. .
9.1.1
Outputs ............................................................. .
9.1.2
Dcache ................................................................ .
9.2
Dtags .................................................................. .
9.3
Load Queue ............................................................. .
9.4
Merge Buffer ............................................................ .
9.5
Pre-MAF ............................................................... .
9.6
Store Queue (SQA and SQD) ............................................... .
9.7
Translation Buffers ....................................................... .
9.8
Back End Bus ........................................................... .
9.9
Operations .............................................................. .
9.10
Read Requests ...................................................... .
9.10.1
9.10.2
Prefetches .......................................................... .
Write Requests ....................................................... .
9.10.3
9.10.4
Retries ............................................................. .
Dcache Misses ....................................................... .
9.10.5
Load Locked/Store Cond~ional .......................................... .
9.10.6
Traps .............................................................. .
9.10.7
Invalidates/Probes .................................................... .
9.10.8
Memory Barriers ...................................................... .
9.10.9
Multi-threading ....................................................... .
9.10.10
Interfaces ............................................................... .
9.11
Pipeline Legend ...................................................... .
9.11.1
Data address Translation buffer (DTB) ........................................ .
9.12
9.12.1
Timing ......................................... ·····················
What Data are Com pared on a DTB Lookup? ............................... .
9.12.2
The TPU Group ................................................... .
9.12.2.1
Granularity Hints .................................................. .
9.12.2.2
64K Pages .......................................................... .
9.12.3
Hit Determination ..................................................... .
9.12.4
9.12.5
Returned Status ...................................................... .
Effects of a DTB Miss .................................................. .
9.12.6
Speculative and Duplicate DTB entries ................................ .
9.12.6.1
Data Storage in the PTE ............................................... .
9.12.7
IPRs That Affect the Contents or Behavior of the DTB ........................ .
9.12.8
Superpages ......................................................... .
9.12.9
Possible Support for Generic Superpages .................................. .
9.12.10
Page Table Array(PTA) Implementation ................................ .
9.12.10.1
9.12.10.2
Virtual Address Array(VAA) Implementation ............................. .
Replacement Policy ................................................... .
9.12.11

9-2
9-2
9-2
9-2
9-3
9-3
9-4
9-5
9-5
9-5
9-6
9-6
9-6
9-6
9-7
9-7
9-8
9-8
9-9
9-10
9-10
9-10
9-10
9-10
9-11
9-12
9-13
9-14
9-14
9-15
9-15
9-16
9-17
9-17
9-18
9-18
9-20
9-21
9-21
9-21
9-21

Compaq Confidentia I
x

5 January 2001 -· Subject To Change

9.12.12
DTB Size ........................................................... .
9.12.13
ITB Usage .......................................................... .
9.12.14
Reset and Testability .................................................. .
9.12.15
Issues .............................................................. .
9.13
Store Logic ............................................................. .
9.13.1
Overview ........................................................... .
9.13.2
Store Issue Flow ..................................................... .
9.13.3
Load Issue Flow ...................................................... .
9.13.4
Store Copy-Out Flow .................................................. .
9.13.5
Block Allocate Flow (TBD) .............................................. .
9.13.6
Things Not Done ..................................................... .
9.14
Merge Buffer ............................................................ .
9.14.1
Overview ........................................................... .
9.14.2
Merge Buffer Allocation ................................................ .
9.14.2.1
Boundary Case ........................................ : .......... .
9.14.3
Merge Buffer Writes to Dcache .......................................... .
9.14.4
Scache Writes ....................................................... .
9.14.5
Probe handling in the Merge Buffer ....................................... .
9.14.6
Line fill and Merge Buffer ............................................... .
9.14.7
10 Stores ........................................................... .
9.14.8
Store Conditional Support .............................................. .
9.14.9
MB and WMB Processing .............................................. .
9.14.10
MAF request ......................................................... .
9.14.11
Cache Movement ops (WH64, Evict) ...................................... .
9.14.12
Merge Buffer States ................................................... .
9.14.13
Data Array .......................................................... .
9.14.14
Address Array ....................................................... .
9.14.15
Control Section ....................................................... .
9.15
Load Queue ............................................................. .
9.15.1
Load Queue Allocation ................................................. .
9.15.2
(Age) Young Vector generation .......................................... .
9.15.3
Load Queue Limit and Block Allocation .................................... .
9.15.4
Thread Choosing ..................................................... .
9.15.5
Block Assignment ..................................................... .
9.15.6
Load Issue .......................................................... .
9.15.7
Load Retries ......................................................... .
9.15.8
Dcache Miss ......................................................... .
9.15.8.1
MAF Pick ........................................................ .
9.15.8.2
Load Queue Pick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.15.9
Scache Line Miss......................................................
9.15.10
Load Queue retry- Bank Conflict..........................................
9.15.11
Retry at retirement.....................................................
9.15.12
Retry Block...........................................................
9.15.12.1
Pick Oldest Retry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.15.12.2
Oldest and Next Oldest Retry Chooser. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.15.12.3
Thread Chooser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.15.13
Prefetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.16
Load Traps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.16.1
DTB trap.............................................................
9.16.1.1
Load/store Order Trap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.16.1.2
lnval Trap (Traps Due to Probe-invalidates). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.16.1.3
MGB Trap (Traps Due To Merge Buffer Dispatches On Back End Bus) . . . . . . . .
9.16.1.4
Trap Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.16.2
Trap Resolution.......................................................
9.16.3
Thread chooser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.16.4
Kill Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.16.5
Litmus 1 Handling .................................................... .
9.17
Dcache Tags ............................................................ .
9.17.1
Front End Tags ...................................................... .

9-21
9-21
9-22
9-22
9-23
9-23
9-25
9-25
9-25
9-26
9-26
9-26
9-26
9-27
9-27
9-28
9-29
9-31
9-31
9-32
9-32
9-33
9-33
9-33
9-33
9-34
9-35
9-35
9-35
9-36
9-36
9-36
9-37
9-37
9-37
9-37
9-38
9-38
9-38
9-38
9-39
9-39
9-39
9-39
9-39
9-39
9-40
9-40
9-40
9-40
9-40
9-40
9-41
9-41
9-41
9-42
9-42
9-42
9-42

Compaq Confidential
5 January 2001 -~ Subject To Change

9.17.1.1
Timing...........................................................
9.17.1.2
Tag Operations....................................................
9.17.2
Back End Tag.........................................................
9.17.2.0.1
Tag Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.17.3
IPRs................................................................
9.18
Dcache Array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.18.1
Read Dcache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.18.2
Write Dcache.........................................................
9.18.3
Bypass Fill Data.......................................................
9.18.4
Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.19
Pre-MAF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.19.1
Merge Buffer Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.19.2
D-stream Queue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.19.3
Killing Retries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.19.4
I-stream Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.20
Mbox Back End Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.21
Internal Processor Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.21.1
Implicitly Written IPRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Internal Ring Bus

Second-Level Cache and Controller (Cbox)
11.1
Cbox Overview .......................................................... .
11.2
Sbox Overview .......................................................... .
11.3
Scache Control - the CS Partition ........................................... .
11.3.1
Overall Pipeline Flow .................................................. .
11.3.2
Miss Address File-the MAF ........................................... .
11.3.2.1
Overview ....................................................... .
11.3.2.2
Principle of Operation ............................................. .
11.3.2.2.1
Requests from the Core ........................................ .
11.3.2.2.2
Fills/Responses from the System ................................. .
11.3.2.2.3
Probes From Other Processors .................................. .
11.3.2.3
MAF Pipeline Timing Diagram and Pipeline Overview ..................... .
11.3.2.3.1
CZ, CO: MAF Arbitration Logic .................................... .
11.3.2.3.2
C 1: MAF Bank Conflict Detection Logic I MAF CAM I MAF RD .......... .
11.3.2.3.3
Exceptions ................................................... .
11.3.2.3.4
C1: MAF CAM /MAF RD ........................................ .
11.3.2.3.5
c2: MAF logic ................................................ .
11.3.2.3.6
C3-C6: Scache Tag Access ..................................... .
11.3.2.3.7
C7: Fill Pipe Control ........................................... .
11.3.2.4
Contents of MAF Entries ............................................ .
11.3.2.5
MAF Allocation/Merge/Retry ......................................... .
11.3.2.6
MAF Deallocation ................................................. .
11.3.3
RSQ ............................................................... .
11.3.4
Internal Probe Queue -the IPQ ......................................... .
11.3.5
Probe Queue - the PRQ .............................................. .
11.3.5.0.1
Principle of Operation .......................................... .
11.3.5.1
Probe Address File (MAF) Contents per Entry ........................... .
11.3.6
Victim Address File - the VAF .......................................... .
11.3.6.1
Victim Address File (VAF) Contents per Entry ........................... .
11.3.6.2
Principle of Operation .............................................. .
11.3.6.3
Secondary VAF Flows ............................................. .
11.3.6.4
Reserved VAF Entries ............................................. .
11.3.7
System Interface (SYS) ................................................ .
11.3.7.1
Principle of Operation: ............................................. .

9-43
9-43
9-43
9-43
9-44
9-44
9-45
9-45
9-45
9-45
9-45
9-47
9-47
9-48
9-48
9-48
9-48
9-49

11-1
11-3
11-3
11-4
11-6
11-6
11-7
11-7
11-7
11-7
11-7
11-8
11-8
11-9
11-9
11-9
11-10
11-10
11-10
11-12
11-14
11-15
11-15
11-16
11-16
11-17
11-17
11-18
11-19
11-20
11-20
11-20
11-21

Compaq Confidential
xii

5 Jc1nuc1ry 2001 ···Subject To Change

Response FIFO Entry Fields ..................................... .
11.3.7.1.1
11.3.7.1.2
Request FIFO Entry Fields ...................................... .
11.3.8
System Request Queue (SRO) .......................................... .
11.3.8.1
Principle of Operation .............................................. .
Retry Queue (RTQ) ................................................... .
11.3.9
11.3.9.1
Principle of Operation .............................................. .
11.3.10
TTQ ............................................................... .
11.4
Fill Datapath - the CF Partition ............................................. .
11.4.1
FBE ............................................................... .
11.4.2
VDB ............................................................... .
11.4.3
FOB ............................................................... .
11.4.4
DBM ............................................................... .
11.4.5
RBI ................................................................ .
11.4.6
RBO ............................................................... .
11.5
Scache Tag Array - the ST Partition ......................................... .
11.5.0.1
Principle of Operation .............................................. .
11.5.0.2
Pipeline Stages ................................................... .
11.5.0.3
State Transition ................................................... .
11.5.0.4
Stale Fill Table ................................................... .
11.5.0.5
The 21464 Scache Least Recently Used (LRU) Scheme ................... .
11.5.0.6
Scache Tag ECG Code ............................................. .
11.6
Scache Data Array- the SG Partition ........................................ .
11.7
Flows .................................................................. .
11. 7 .1
Overall Pipeline Flow .................................................. .
11.7.1.1
Pipe Operation ................................................... .
11.7.1.2
Pipeline Timing Diagrams ........................................... .
11. 7 .1 .2.1
Scache Control Pipeline Stages .................................. .
11. 7 .1 .3
Resource Conflict ................................................. .
11. 7 .1.4
Scache Bank Conflict Check ......................................... .
11.7.2
Fill and LRU Evict Flow ................................................ .
11.7.2.1
Hiccup Flow ..................................................... .
11.7.3
Probe Flow .......................................................... .
11.7.4
Mbox Request Flow ................................................... .
11. 7 .5
Victim Flow .......................................................... .
11.7.6
Retry Flow .......................................................... .
11.8
Special Support .......................................................... .
11.8.1
Input - Output ....................................................... .
11.8.1.1
1/0 Request Ordering and Merging .................................... .
11.8.1.2
1/0 System Request ............................................... .
11.8.1.3
Others .......................................................... .
11.8.1.4
1/0 Request Flow ................................................. .
11.8.1.5
1/0 Specific Structures/Operations .................................... .
11.8.1.6
1/0 System Request Timing ......................................... .
11.8.1.7
1/0 Request Packet Format ......................................... .
11.8.1.7.1
Read 1/0 (RDIO) .............................................. .
11.8.1.7.2
Write 1/0 (WRIO) .............................................. .
11.8.2
Memory Barriers - the MB Instruction .................................... .
11.8.3
Load-Locked Store-Conditional (LDx_L/STx_C) Instruction Processing ........... .
11.8.3.1
Lock Register for Each Thread ....................................... .
11.8.3.2
Stx_C Issuing .................................................... .
11.8.4
Prefetch/Modify ...................................................... .
11.9
IPRs, CSRs, and Error Handling ............................................. .
11.9.1
Required IPRs and CSRs .............................................. .
11.9.2
Error Handling ....................................................... .
11.9.3
Cbox Deadlock Avoidance Mechanisms ................................... .
11. 1o Profiling Support ......................................................... .
11.11
Stuff From Original Cbox Spec Not in Outline ................................... .
11.11. 1
Scache Index (paddr<18:6>) Conflict. ..................................... .
11.11.2
ShrToDirty[STC]Req .................................................. .

11-21
11-22
11-22
11-22
11-23
11-23
11-24
11-24
11-24
11-24
11-24
11-25
11-25
11-25
11-25
11-25
11-26
11-26
11-28
11-28
11-30
11-32
11-32
11-32
11-32
11-37
11-37
11-39
11-40
11-42
11-42
11-42
11-42
11-43
11-46
11-46
11-46
11-46
11-47
11-47
11-47
11-49
11-49
11-50
11-50
11-51
11-51
11-51
11-53
11-53
11-53
11-53
11-53
11-55
11-55
11-55
11-55
11-55
11-56

Compaq Confidential
5 January 2001 -~ Subject To Change

xiii

11.11.3
11.11.4
11.11.5
11.11.6
11.11.7

Scache Tag Launch Pipe ............................................... .
Probe Processing in Cbox .............................................. .
Order Dependency .................................................... .
Possible Race Conditions and Other Concerns .............................. .
CBox mechanisms .................................................... .

11-57
11-60
11-61
11-62
11-62

Cache Coherence Protocol Processing
12-1
Introduction to the Protocol ................................................. .
12.1
12-2
Structures that Maintain the Cache Coherence ................................. .
12.2
12-3
Miss Address File (MAF) ............................................... .
12.2.1
12-3
System Request Queue (SRQ) .......................................... .
12.2.2
12-3
Victim Buffer ......................................................... .
12.2.3
12-4
Probe Queue (PRQ) ................................................... .
12.2.4
12-5
12.2.5
DIFT .......................................... · · · · · · · · · · · · · · · · · · · · · ·
12-5
Overview of the Cache Coherency Protocols ................................... .
12.3
12-5
Comparison Between 21363 and 21464 Cache Coherence Protocols ............ .
12.3.1
12-6
Onchip Directory Cache ................................................ .
12.3.2
12-6
Coherence Messages are Split into Three Types ............................ .
12.3.3
12-7
Protocol Races .......................................................... .
12.4
12-8
Probe Processing ........................................................ .
12.5
12-10
Coherence State ......................................................... .
12.6
12-11
MAF Address CAM ....................................................... .
12.7
12-14
Scache Hit .............................................................. .
12.8
12-17
VAF Address CAM ....................................................... .
12.9
Directory Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-18
12.10
System Command Opcodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-20
12.11
Protocol Message Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-21
12.12
10 CHANNEL Message Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-21
12.12.1
RdBytes, RdLWs, RdQWs, RdlPR.....................................
12-21
12.12.1.1
WrBytes, WrLWs, WrQWs, WrlPR.....................................
12-22
12.12.1.2
REQUEST CHANNEL Message Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-24
12.12.2
ReadReq. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-24
12.12.2.1
ReadSharedReq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-24
12.12.2.2
ReadModReq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-24
12.12.2.3
FetchReq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-24
12.12.2.4
SharedtoDirtyReq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-25
12.12.2.5
SharedtoDirtySTCReq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-25
12.12.2.6
lnvaltoDirtyReq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12-25
12.12.2.7
FORWARD CHANNEL Message Details....................................
12-26
12.12.3
ReadForward, ReadSharedForward, ReadModForward, FetchForward, lnvaltoDirtyFor12.12.3.1
ward ....................................................... 12-26
12-26
SharedlnvalSingle ................................................. .
12.12.3.2
12-27
SharedlnvalBroadcast. ............................................. .
12.12.3.3
12-27
RESPONSE CHANNEL Message Details .................................. .
12.12.4
12-27
BlkShared ....................................................... .
12.12.4.1
12-27
BlkExclusiveCnt .................................................. .
12.12.4.2
12-27
Blklnval ......................................................... .
12.12.4.3
12-28
BlklO ........................................................... .
12.12.4.4
........................................................... Victim 12-29
12.12.4.5
12-30
VictimtoShared ................................................... .
12.12.4.6
12-30
VictimAckExcl .................................................... .
12.12.4.7
12-30
VictimAckShared .................................................. .
12.12.4.8
12-30
lnvaltoDirtyRespCnt ............................................... .
12.12.4.9
12-31
SharedtoDirtySuccessCnt ........................................... .
12.12.4.10
12-31
SharedtoDirtyProbCnt .............................................. .
12.12.4.11
12-31
SharedtoDirtyFail ................................................. .
12.12.4.12

Compaq Confidential
xiv

5 Janw~ry 2001 -- Subject To Change

12.12.4.13
NXMResp ....................................................... .
12.12.4.14
ERRResp ....................................................... .
12.12.4.15
lnvalAck ........................................................ .
12.12.4.16
WrlOAck ........................................................ .
12.12.4.17
WrlONAck ....................................................... .
12.12.4.18
VictimClean ...................................................... .
12.12.4.19
VictimCleantoShared .............................................. .
12.12.4.20
ForwardAckExcl .................................................. .
12.12.4.21
ForwardAckShared ................................................ .
12.12.4.22
ForwardMiss ..................................................... .
12.12.4.23
SharedtoDirtyComplete ............................................. .
12.12.4.24
SharedtoDirtyRelease .............................................. .
12.12.5
SPECIAL CHANNEL Message Details .................................... .
12.12.5.1
NZNOP ......................................................... .
12.12.5.2
SpeciallnvalBroadcast ............................................. .
12.13
Protocol Race Descriptions ................................................. .
12. 13. 1
Early Forward Race ................................................... .
12.13.2
Late Forward Race .................................................... .
12. 13.3
Dual Victim Race ..................................................... .
12.13.4
Early lnvalAck Race ................................................... .
12. 13.5
Early lnvalShared Race ................................................ .
12.13.6
Wrong SharedtoDirtySuccess Race ....................................... .
12. 13. 7
A Note on SharedtoDirties and their Resolution ............................. .
12.13.8
Special Store-Conditional Support ........................................ .
12. 13.9
Local CBOX Too Far Ahead ............................................ .

Router Interface -

the Rbox

13.1
Protocol Messages ....................................................... .
13.1.1
Messages on the IO_CHANNEL ......................................... .
13.1.2
Messages on the REQUEST_CHANNEL .................................. .
13.1.3
Messages on the FORWARD_CHANNEL .................................. .
13.1.4
Messages on the RESPONSE_CHANNEL ................................. .
13.1.5
Messages on a SPECIAL_CHANNEL ..................................... .
13.2
Message Format Details .................................................. .
13.2.1
Route Information .................................................... .
13.2.2
Flow Control and Dealloc Information .................................... .
13.2.3
Packet Formats ...................................................... .
13.2.3.1
IO_CHANNEL Formats ............................................ .
13.2.3.2
REQUEST_CHANNEL Format ...................................... .
13.2.3.3
FORWARD_CHANNEL Format ...................................... .
13.2.3.4
RESPONSE_CHANNEL Formats .................................... .
13.2.3.5
SPECIAL_CHANNEL Formats ...................................... .
13.2.3.6
INPUT 1/0 PORT HEADER TICK Formats ............................. .
13.2.3.7
ROUTE FIELD Format ............................................. .
13.3
SharedlnvalBroadcast Details .............................................. .
13.4
1/0 Port and 1/0 ASIC Assumptions .......................................... .
13.5
Interrupt Delivery ......................................................... .
13.6
DMA Device Assumptions .................................................. .
13.6.1
1/0 DMA Access and Exclusive Caching ................................... .
13.6.2
1/0 DMA Access via Timeouts ........................................... .
13.7
1/0 Space Ordering and Assumptions ......................................... .

12-31
12-32
12-32
12-32
12-32
12-33
12-33
12-33
12-33
12-34
12-34
12-34
12-34
12-34
12-35
12-36
12-36
12-36
12-37
12-37
12-37
12-38
12-38
12-38
12-39

13-1
13-2
13-3
13-3
13-4
13-5
13-6
13-6
13-7
13-11
13-12
13-13
13-13
13-14
13-15
13-15
13-16
13-16
13-17
13-18
13-19
13-20
13-20
13-21

Rambus Interface-the Zbox
14.1

The 5th Rambus Channel .................................................. .

14-1

Compaq Confich:mtial

5 January 2001 ·- Subject To Change

Miscellaneous Interfaces
15.1
The G 10 Port ............................................................ .
15.1.1
Signals ............................................................. .
15.1.2
Transactions ......................................................... .
15.1.3
Registers ........................................................... .
15.1.3.1
GIO_CNFG ...................................................... .
15.1.3.2
GIO_ADDR ...................................................... .
15.1.3.3
GIO_DATA ...................................................... .
15.1.4
Use ................................................................ .
15.1.4.1
Differences In Implementation Between the 21364 and 21464 .............. .

15-1
15-1
15-2
15-2
15-2
15-3
15-3
15-4
15-6

Internal Processor Registers
16.1
Internal Processor Register Summary ........................................ .
16.1.1
PALcode Coding Rules ................................................ .
16.1.2
IPR Issues: .......................................................... .
16.1.3
Reset .............................................................. .
16.2
lbox IPRs ............................................................... .
16.2.1
Cycle Counter Register - CC[tpu] ....................................... .
16.2.2
DTB Single-Miss Return Address Register - DTBMS_RET_ADDR[tpu] ........... .
16.2.3
Exception Address Register- EXC_ADDR[tpu] ............................. .
16.2.4
Exception Summary Register - EXC_SUM[tpu] ............................. .
16.2.5
lbox CPU Configuration Register-CPU_CNFG ............................ .
16.2.6
lbox TPU Configuration Register-TPU_CNFG ............................. .
16.2.7
lbox Control Register- l_CTL[tpu] ....................................... .
16.2.8
lbox Process Mode Register - l_MODE[tpu] ............................... .
16.2.9
lbox Process Context Register - l_PCTX[tpu] ............................. .
16.2.10
lcache Status Register- IC_STAT[tpu] ................................... .
16.2.11
lcache Flush Register- IC_FLUSH[tpu] .................................. .
16.2.12
lcache Flush (ASM=O) Register - IC_FLUSH_ASM[tpu] ..................... .
16.2.13
ITB Invalidate Multiple Register- ITB_IM[tpu] .............................. .
16.2.14
ITB Invalidate Single Register - ITB_IS[tpu] ............................... .
16.2.15
Instruction PTE Array Write Register - ITB_PTE[tpu] ........................ .
16.2.16
Instruction Tag Array Write Register - ITB_TAG[tpu] ........................ .
16.2.17
Instruction Virtual Address Format Register- IVA_FORM[tpu] ................. .
16.2.18
PALcode Base Address Register - PAL_BASE[tpu] .......................... .
16.2.19
PALcode Temp Registers- PAL_TEMP1[tpu], PAL_TEMP2[tpu] .............. .
16.3
Mbox IPRs .............................................................. .
16.3.1
Dcache Control Register- DC_CTL ..................................... .
16.3.2
Dcache Status Register - DC_STAT[tpu] ................................. .
16.3.3
DTB Invalidate Multiple Register - DTB_IM[tpu] ............................ .
16.3.4
DTB Invalidate Single Register - DTB_IS[tpu] ............................. .
16.3.5
DTB PTE Array Write Registers - DTB_PTEO[tpu], DTB_PTE1 [tpu] ............ .
16.3.6
DTB Tag Array Write Registers - DTB_TAGO[tpu], DTB_TAG 1[tpu] ............ .
16.3.7
Mbox Control Register - M_CTL[tpu] .................................... .
16.3.8
Mbox Process Mode Register- M_MODE[tpu] ............................. .
16.3.9
Mbox Process Context register - M_PCTX[tpu] ............................ .
16.3.10
Mbox Memory Management Status Register - M_STAT[tpu] .................. .
16.3.11
Quiesce Timeout Register - QUIESCE_TIMEOUT[tpu] ...................... .
16.3.12
Virtual Address Register - VA[tpu] ...................................... .
16.3.13
Virtual Address Format Register - VA_FORM[tpu] .......................... .
16.3.14
Watch Physical Address Register-WATCH_PHYS_ADDR[tpu] ............... .
16.4
Cbox IPRs .............................................................. .
16.4.1
Hardware Interrupt Clear Register- HW_INT_CLR[tpu] ...................... .
16.5
Rbox IPRs .............................................................. .
16.5.1
Router Configuration1 (R,W) - R_CFG1 .................................. .

16-1
16-5
16-5
16-5
16-6
16-6
16-7
16-8
16-9
16-11
16-12
16-13
16-14
16-15
16-16
16-17
16-17
16-18
16-19
16-19
16-20
16-21
16-22
16-23
16-24
16-24
16-25
16-26
16-27
16-28
16-29
16-30
16-31
16-32
16-33
16-34
16-35
16-36
16-37
16-37
16-37
16-38
16-38

Compaq Confidential
xvi

5 JcWUc1ry 2001

Subject To Change

Router Configuration2 (R, W) - R_CFG2 .................................. .
16.5.2
Router Channel {N,S,E,W} Configuration1 (R,W) - R_n_CFG1 ................ .
16.5.3
Router Channel {N,S,E,W} Configuration2 (R,W) - R_n_CFG2 ................ .
16.5.4
Router Channel {N,S,E,W} Timer1 Configuration (R,W) - R_n_ T1CFG .......... .
16.5.5
Router Channel {N,S,E,W} Timer2 Configuration (R,W) - R_n_T2CFG .......... .
16.5.6
16.5.7
Router Channel {N,S,E,W} Error Status (R, W1C)- R_n_ERR ................. .
16.5.8
Router Channel {N,S,E,W} Performance Counter (R, W)- R_n_PERF .......... .
Router 1/0-Port Configuration1 Register (R, W) - R_IO_CFG1 ................. .
16.5.9
Router 1/0-Port Configuration2 Register (R, W) - R_IO_CFG2 ................. .
16.5.10
Router 1/0-Port Buffer Size (R,W) - R_IO_BUFSIZ .......................... .
16.5.11
Router 1/0-Port Timer1 Configuration (R,W) - R_IO_T1CFG .................. .
16.5.12
Router 1/0-Port Timer2 Configuration (R,W)- R_IO_T2CFG ................... .
16.5.13
16.5.14
Router 1/0-Port Error Status (R, W1 C) - R_IO_ERR ......................... .
16.5.15
Router 1/0-Port Performance Counter (R, W) - R_IO_PERF ................... .
16.5.16
Router Local-Port Error Status Register (R, W1 C) - R_LOC_ERR .............. .
Router Routing Table Register (R,W) - R_ROUT ........................... .
16.5.17
Router WHOAMI Register (R,W) - R_WHOAMI ............................ .
16.5.18
Router Overall-Timer-Control Register (R,W)- R_OVER ..................... .
16.5.19
. 16.5.20
Router Interrupt Status (R, WIC) - R_INT_STAT ............................ .
Router Interrupt Mask (R, W) - R_INT_MASK .............................. .
16.5.21
Router Interrupt Request (WO) - R_INT_REQ ............................. .
16.5.22
16.5.23
Router Interrupt Queue Register (RO) - R_INT_QUE ........................ .
16.5.24
Router Interrupt Queue Add Register (WO)-R_INT_QUEADD................. .
16.5.25
Router Interval Timer Register (R,W) - R_INTER_TIM ....................... .
Router Scratch Register 1 (R,W) - R_SCRATCH1 .......................... .
16.5.26
Router Scratch Register 2 (R,W) - R_SCRATCH2 .......................... .
16.5.27
Zbox IPRs .............................................................. .
16.6
16.6.1
DRAM Error Status 1 -ZBOXn_DRAM_ERR_STATUS1 ..................... .
16.6.2
DRAM Error Status 2 -ZB0Xn_DRAM_ERR_STATUS2 ..................... .
16.6.3
DRAM Error Status 3 - ZBOXn_DRAM_ERR_STATUS3 ..................... .
DRAM Error Control - ZBOXn_DRAM_ERROR_CTL ......................... .
16.6.4
16.6.5
DRAM Timing Control 1 -ZB0Xn_DRAM_TIMING_CTL1 ..................... .
16.6.6
DRAM Timing Control 2 -ZBOXn_DRAM_TIMING_CTL2 ..................... .
16.6.7
DRAM Timing Control 3 -ZB0Xn_DRAM_TIMING_CTL3 ..................... .
16.6.7.1
Calculating Read to Write and Write to Read Spacing ..................... .
Terminology ..................................................... .
16.6.7.2
Ideal Rambus .................................................... .
16.6.7.3
16.6.7.4
Non-Ideal Ram bus ................................................ .
DRAM Refresh Control - ZBOXn_DRAM_REFR_CTL ........................ .
16.6.8
16.6.9
DRAM Calibration Control 1 - ZBOXn_DRAM_CALIB_CTL 1 .................. .
16.6.9.1
Temperature Calibration Interval ..................................... .
16.6.9.2
Current Control Interval. ............................................ .
DRAM Calibration Control 2-ZB0Xn_DRAM_CALIB_CTL2 .................. .
16.6.10
Read to Current Control Transition .................................... .
16.6.10.1
16.6.10.2
Temperature Calibrate to Read transition ............................... .
16.6.10.3
Read to Temperature Calibrate transition ............................... .
16.6.11
DRAM Timing Control 4 - ZB0Xn_DRAM_TIMING_CTL4 ..................... .
16.6.12
DRAM Refresh Row - ZBOXn_DRAM_REFRESH_ROW...................... .
DRAM Initialization Control - ZBOXn_DRAM_INIT_CTL ...................... .
16.6.13
16.6.14
DIFT Control - ZBOXn_DIFT_CTL ....................................... .
16.6.15
DRAM Error Address - ZBOXn_DRAM_ERR_ADR .......................... .
16.6.16
DIFT Timeout - ZBOXn_DIFT_TIMEOUT .................................. .
16.6.17
DRAM Mapper Control - ZBOXn_DRAM_MAPPER_CTL. ..................... .
16.6.18
Zbox Performance Counter O - ZBOXn_ZPM_CTRO ......................... .
16.6.19
Zbox Performance Counter 1 -ZBOXn_ZPM_CTR1 ......................... .
16.6.20
Zbox Performance Control- ZBOXn_ZPM_CTL. ............................ .
16.6.21
Zbox Sweep Directory B~s - ZBOXn_DRAM_SWEEP_DIR .................... .
16.6.22
Zbox Force-Error Address register - ZBOXn_FRC_ERR_ADR ................. .
16.6.23
Zbox DIFT Error Status - ZBOXn_DIFT_ERR_STATUS ...................... .

16-39
16-40
16-42
16-43
16-43
16-44
16-44
16-45
16-47
16-47
16-48
16-48
16-48
16-49
16-49
16-50
16-51
16-51
16-51
16-51
16-52
16-52
16-52
16-52
16-52
16-52
16-52
16-52
16-53
16-54
16-56
16-58
16-61
16-62
16-64
16-65
16-65
16-65
16-66
16-68
16-69
16-69
16-69
16-70
16-70
16-70
16-71
16-71
16-72
16-73
16-75
16-76
16-77
16-83
16-84
16-85
16-88
16-89
16-90

Compaq Confidential
5 January 2001 -~Subject To Change

xvii

16.6.24

Zbox RAC Control -ZBOXn_RAC_CTL....................................

16-91

Privileged Architecture Library Code
HW_LD and HW_ST Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17-1
HW_MFPR and HW_MTPR Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17-3
HW_MFPR Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17-3
HW_MTPR Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17-4
Execution of the RET Instruction in PALmode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17-5
CMOV Execution Within PALcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17-6
PALcode Restrictions and Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17-7
Restriction 1: PALcode Must Guarantee That IPR Writes Retire Before Returning . . .
17-7
Restriction 2: IFETCHB Required Between IPR Writes in the Same IPR Group . . . . .
17-7
Restriction 3: Mbox IPRs Must be Written Twice to Ensure Correct Slotting . . . . . . . .
17-7
Restriction 4: All Instructions in the DTB Writer Block Must be in the Same Map Block
17-8
Restriction 5: All Four DTB MTPR Instructions Must Appear in the Same Fetch Block.
17-8
Restriction 6: Non-DTB Writer Block DTBMS_RET_ADDR MFPRs Require IFETCHB
17-9
Restriction 7: IFETCHB Required Between Non-DTB Writer Block DTB Writer Block MxPRs
.................... ................ ................. ....... ........
17-9
Restriction 8: Padding Required Between DTB Writer Block and OTB-Dependent Instructions
17.5.8
... .............. ... ................................... ..... .... ....
17-9
17.5.9
Restriction 9: PALcode Must Not Allow Writes INVALID DTB_PTE Entries to Retire..
17-1 O
Restriction 10: TAG and PTE Must be Written as Pairs with TAG Writes Before PTE Writes
17.5.10
.................... ......................... .......... ..... ........
17-10
Restriction 11: Register-Dependent MTPRs Must Not Have Read Class Dependent MxPRs
17.5.11
................. ... ....... ......... ................... ..... .... ....
17-10
Restriction 12: CMOV instructions Cannot Specify PALcode Shadow Registers as Destinations
17.5.12
.................... ................................... ......... ....
17-11
Restriction 13: PALmode Native CMOV Instructions Cannot Specify R24 or R25 as Destinations
17.5.13
.................... ......................... ............... ........
17-12
Restriction 14: PALmode JMP Instructions Must be Followed by IFETCHB...... . . .
17-12
17.5.14
Guideline 15: No Push or Pop Instructions in the First Fetch Block of a PALmode Flow 17-13
17.5.15
Restriction 16: PALmode MT_FPCR Must be Followed by IFETCHB..............
17-13
17.5.16

17.1
17.2
17.2.1
17.2.2
17.3
17.4
17.5
17.5.1
17.5.2
17.5.3
17.5.4
17.5.5
17.5.6
17.5.7

Initialization and Configuration

Performance Monitoring
Instruction Based Profiling .................................................. .
19.1
19.1.1
Profiling Methodology .................................................. .
19.1.2
Initiating an Instruction Profile Sample ..................................... .
19.1.3
Instruction Profile Record IP Rs .......................................... .
Data/Event IPRs .................................................. .
19.1.3.1
Timeline/Latency IPRs ............................................. .
19.1.3.2
Aggregate Event/Data IPRs ......................................... .
19.1.3.3
Memory Reference Performance Monitoring .................................... .
19.2
19.2.1
Cbox Performance CSRs ............................................... .
19.2.1.1
Cbox Performance Control - CBOX_PRF_CTL<31 :0> ................... .
Cbox Performance Address - CBOX_PRF_ADR<63:0> .................. .
19.2.1.2
19.2.1.3
Cbox Performance Status - CBOX_PRF_STS<25:0> .................... .
19.2.1.4
Cbox Performance Match - CBOX_PRF_MAT<25:0> .................... .
19.2.1.5
Cbox Performance Match Value-CBOX_PRF_MATV<25:0> .............. .
Cbox Performance Counter - CBOX_PRF _CNT<31 :O> ................... .
19.2.1.6
Zbox Performance CSRs ............................................... .
19.2.2
Zbox Performance Counter O-ZB0Xn_ZPM_CTR0<31 :0> ............... .
19.2.2.1

19-1
19-2
19-2
19-6
19-6
19-12
19-15
19-17
19-17
19-17
19-18
19-18
19-18
19-19
19-19
19-19
19-19

Compaq Confidential
xviii

5 Jc1nuary 2001 - Subject To Cfuange

Zbox Performance Counter 1 - ZB0Xn_ZPM_CTR1 <31 :0> .............. .
19.2.2.2
Zbox Performance Control- ZB0Xn_ZPM_CTL<31 :0> ................... .
19.2.2.3
Rbox Peformance CSRs ............................................... .
19.2.3
Rbox Port Performance Counter - RBOX_n_PERF<27:0> ................ .
19.2.3.1
Rbox 10 Port Performance Counter - RBOX_IO_PERF<27:0> ............. .
19.2.3.2
Addendum: lmplemention Notes ............................................. .
19.3
From Data/Event IPRs ................................................. .
19.3.1
Following Table 17-4 .................................................. .
19.3.2

Hardware Debug Features
20.1
20.2
20.2.1
20.2.2
20.2.3
20.2.4
20.3
20.3.1
20.3.2
20.3.3
20.4
20.4.1
20.4.2
20.4.3
20.4.4
20.5

20-1
20-2
20-2
20-3
20-4
20-4
20-4
20-4
20-5
20-6
20-7
20-7
20-8
20-8
20-8
20-8

Global Block Diagram ..................................................... .
Group 1 -Array BiST/BiSR Satellites .................................... .
Group 2 - BiSt Satell~es .............................................. .
Group 3 - Observability Registers (LFSRs) ................................ .
Group 4- Scan Islands (TBD) ........................................... .
Group 5- Boundary Scan Register ....................................... .
Test Pins ............................................................... .
Central Port Controller ..................................................... .
IEEE 1149.1 Test Access Port Controller ................................... .
Port Configuration and FireWall Logic ..................................... .
Clock Control Unit .................................................... .
Tbox Reset Engine .................................................... .
SROM Engine ....................................................... .
Dot1 Test Decode and Dispatch Logic ........................................ .

21-2
21-3
21-3
21-4
21-4
21-4
21-4
21-5
21-6
21-7
21-7
21-7
21-8
21-10

Error Detection and Error Handling
22.1
22.1.1
22. 1.2

Debug Process .......................................................... .
Feature Overview ........................................................ .
Scan ............................................................... .
Trace Bus ........................................................... .
Internal Processor Registers ............................................ .
Derived Signals ...................................................... .
Global Support .......................................................... .
Scan ............................................................... .
Trace Bus ........................................................... .
Trigger Logic ........................................................ .
Box Support ............................................................. .
lbox ............................................................... .
Pbox/Qbox .......................................................... .
Ebox/Register File .................................................... .
Mbox .............................................................. .
Software Support. ........................................................ .

Testability and Diagnostics
21.1
21.1.1
21.1.2
21.1.3
21.1.4
21.1.5
21.2
21.3
21.3.1
21.3.2
21.3.3
21.3.4
21.3.5
21.4

19-20
19-20
19-22
19-23
19-23
19-23
19-23
19-24

Disruptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
High-Level Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Low-Level Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22-1
22-3
22-8

Hardware Interface
23.1

Signal Pad Requirements .................................................. .

23-1

Compaq Confide11tial

5 January 2001 ···Subject To Change

xix

New Instructions

System Configurations

Physical Addressing and Input/Output

Requirements to Support "Tandem"

Instruction Decoding
A.1
A.2
A.3
A.4
A.5
A.6
A.6.1
A.6.2
A.6.3
A.6.4
A.6.5
A.6.6
A.6.7
A.6.8
A.6.9
A . 6.10
A.6.11
A.6.12
A.6.13
A.6.14
A.6.15

Instruction Format ........................................................ .
Predecodes ............................................................. .
Instruction Latency ....................................................... .
Execution Pipelines ....................................................... .
Instruction Info (INST_INF0<15:0>) .......................................... .
Specific Opcode and Instruction Type Decoding ................................ .
Opcode 00, CALL_PAL ................................................ .
Opcodes 01 through 07, Reserved ....................................... .
Opcode 10, Integer Add/Subtract/Compare ................................. .
Opcode 11, Integer Logical ............................................. .
Opcode 12, Integer Shift ............................................... .
Opcode 13, Integer Multiply ............................................. .
Opcode 14, ITOFx and Floating-Point Square Root .......................... .
Opcode 15, VAX Floating-Point .......................................... .
Opcode 16, IEEE Floating-Point ......................................... .
Opcode 17, Miscellaneous Floating-Point .................................. .
Opcode 18, Miscellaneous .............................................. .
Load and Store Instructions ............................................. .
Opcode 1C, Integer Multimedia .......................................... .
Branch and Jump Instructions ........................................... .
PALcode Instructions .................................................. .

A-2
A-3
A-4
A-4
A-5
A-5
A-5
A-5
A-6
A-7
A-7
A-8
A-9
A-10
A-11
A-12
A-13
A-13
A-14
A-16
A-17

LDx_ARM/QUIESCE Instruction Characteristics
B.1
B.2
B.2.1
B.2.1.1
B.2.2
B.2.2.1
B.3
B.4
B.4.1
B.4.2
B.4.3
B.4.4
B.5
B.5.1
B.5.2
B.5.3
B.5.4
B.5.5
B.5.6

Relationship Between SMT and LDx_ARM/QUIESCE ............................ .
Goals for the LDx_ARM and QUIESCE Instruction Definition ....................... .
Specific LDx_ARM Instruction Characteristics ............................... .
Instruction Description ............................................. .
Specific QUIESCE Instruction Characteristics ............................... .
Data Sharing Using LDx_ARM/Quiesce ................................ .
Proposed Opcode Assignments ............................................. .
Implementation .......................................................... .
Interaction of Interrupts and QUIESCE .................................... .
Quiesce-Related Hardware ............................................. .
Reallocation Hardware Resources During Quiesce ........................... .
Issues to Consider While Finalizing the Hardware Design...................... .
Alternative Proposals to the LDx_ARM/QUIESCE Current Design ................... .
Timer-Based ......................................................... .
Unified QUIESCE Instruction ............................................ .
Use architectural Registers to Enforce LDx_ARM/QUIESCE Dependency ......... .
Add LDx_ARM Functionality to LDx_L ..................................... .
Define QUIESCE to be a load and test .................................... .
Define QUIESCE to be a read of memory and compare with a register ........... .

B-1
B-2
B-2
B-3
B-6

B-8
B-9
B-10
B-11
B-12
B-13
B-13
B-14
B-14
B-14
B-14
B-15
B-16
B-16

Compaq Confidetitial

5 Jc1miary 2001 ··· Subject To Change

B.6

Open Issues ............................................................ .

B-17

Proposed Memory Management IPR Design
C.1
C.2
C.3
C.3.1
C.3.2
C.3.3
C.4
C.4.1
C.4.2
C.4.3
C.4.4
C.5
C.5.1
C.5.2

Motivation for This Design .................................................. .
Page Table Assumptions .................................................. .
I-Stream (l_CTL) and D-Stream (M_CTL) Control Registers ....................... .
l_CTL .............................................................. .
M_CTL ............................................................. .
PAGE_SIZE, VA_SIZE, and REDUCED_PAGE_TABLE Field Combinations ....... .
VA_FORM and IVA_FORM ................................................. .
The Transformation From VA to VA_FORM ................................ .
43-bit VA I 8 KB Page ................................................. .
52-bit VA I 64 KB page ................................................ .
52-bit VA I 64 KB Page I Reduced Page Tables ............................. .
Sign Extension Checking .................................................. .
Previous Implementation ............................................... .
Proposed Implementation .............................................. .

C-1
C-1
C-3
C-3
C-5
C-6
C-6
C-7
C-7

C-8

C-9
C-10
C-10
C-11

Glossary
Index

Compaq Confidential
5 January 2001 -·· Subject To Change

xxi

Figures
2-1
2-2
3-1
3-2
3-3
3-4
3-5
3-6
3-7
3-8
4-1
4-2
5-1
5-2
5-3
5-4
6-1
6-2
6-3
6-4
6-5
6-6
6-7
6-8
6-9
6-10
6-11
6-12
6-13
6-14
6-15
6-16
6-17
6-18
6-19
7-1
8-1
8-2
8-3
8-4
8-5
8-6
8-7
8-8
8-9
8-10
8-11
9-1
9-2
9-3
9-4
15-1
15-2
15-3
15-4
15-5

21464 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21464 Pipeline Stage Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
lbox Block Diagram........................................................
Line Predictor Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
High level diagram of the 21464 branch predictor....... . . . . . . . . . . . . . . . . . . . . . . . . . .
Jump Predictor Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Instruction Fill Unit (IFU) Request and Fill Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Instruction Fill Unit (IFU) Demand Subsection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Instruction Fill Unit (IFU) Prefetch Subsection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Instruction Fill Unit (IFU) Fill Section. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pbox Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The INum Circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simplified View of One-Half of the Instruction Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simplified View of Full Instruction Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simplified Diagram of QET and Pickers for Two Pipelines. . . . . . . . . . . . . . . . . . . . . . . . .
Tracking Data-Ready Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Datapath Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cluster Section Organization... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox ITOFx and FTOlx Floating-Point Store Data Paths . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Register Cache Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Register Cache Multiport Static RAM Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Register Cache Single-Cycle Result Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Register Cache Multi-Cycle Result Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Writing Entries in the Ebox Register Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Multimedia Unit Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Multimedia Unit Pipeline Timing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Multimedia Unit MVI Section Block Diagram................................
Ebox Multimedia Unit Arithmetic Logic Unit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Multimedia Unit Computation of the Min/Max Instruction. . . . . . . . . . . . . . . . . . . . . . .
Ebox Multimedia Unit Multiplier Array Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Multimedia Unit Multiplier Array Tree Adder......... . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Multimedia Unit Min/Max Logic Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Multimedia Unit Shifter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ebox Multimedia Unit Integer Multiplier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Register File Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fbox Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Register Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
FPCR Update Mechanism...................................................
F_API Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CMP Instruction Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F_AP2 Block Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fbox Floating-Point Control Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F_SHP Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F_DIV Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F_SQR Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
F_GAD Block Diagram for One-Half of the Pair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Address and Data Path............. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Scache Write-Through Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Merge Buffer Entry States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pre-MAF Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
GIO Port Read Transaction Timing............................................
GIO Port Write Transaction Timing............................................
GIO_CNFG Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
GIO_ADDR Register.......................................................
GIO_DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2-5
2-20
3-2
3-5
3-20
3-26
3-42
3-43
3-46
3-48
4-1
4-4
5-5
5-7
5-11
5-18
6-2
6-3
6-6
6-12
6-13
6-13
6-14
6-15
6-17
6-18
6-18
6-20
6-20
6-21
6-23
6-24
6-26
6-27
6-28
7-2
8-2
8-8
8-16
8-21
8-24
8-31
8-36
8-38
8-44
8-45
8-50
9-2
9-30
9-34
9-46
15-2
15-2
15-3
15-3
15-4

Compaq Confidential
xxii

5 Jam.1c1ry 2001 ~· Subject To Change

16-1
16-2
16-3
16-4
16-5
16-6
16-7
16-8
16-9
16-10
16-11
16-12
16-13
16-14
16-15
16-16
16-17
16-18
16-19
16-20
16-21
16-22
16-23
16-24
16-25
16-26
16-27
16-28
16-29
16-30
16-31
16-32
16-33
16-34
16-35
16-36
16-37
16-38
16-39
16-40
16-41
16-42
16-43
16-44
16-45
16-46
16-47
16-48
16-49
16-50
16-51
16-52
16-53
16-54
16-55
16-56
16-57
16-58
16-59

Cycle Counter Register - CC[tpu] ........................................... .
DTB Single-Miss Return Address Register- DTBMS_RET_ADDR[tpu] .............. .
Exception Address Register - EXC_ADDR[tpu] ................................ .
Exception Summary Register - EXC_SUM[tpu] ................................ .
lbox CPU Configuration Register- CPU_CNFG ................................ .
lbox TPU Configuration Register - TPU_CNFG ................................ .
lbox Control Register- l_CTL[tpu] .......................................... .
lbox Process Mode Register - l_MODE[tpu] ................................... .
lbox Process Context Register - l_PCTX[tpu] .................................. .
lcache Status Register- IC_STAT[tpu] ....................................... .
lcache Flush Register- IC_FLUSH[tpu] ...................................... .
lcache Flush (ASM =0) Register- IC_FLUSH_ASM[tpu] ........................ .
ITB Invalidate Multiple Register- ITB_IM[tpu] ................................. .
ITB Invalidate Single Register- ITB_IS[tpu] ................................... .
Instruction PTE Array Write Register - ITB_PTE[tpu] ........................... .
Instruction Tag Array Write Register- ITB_TAG[tpu] ............................ .
Instruction Virtual Address Format Register - IVA_FORM[tpu] .................... .
PALcode Base Address Register - PAL_BASE[tpu] ............................ .
PALcode Temp Registers - PAL_TEMP1 [tpu], PAL_TEMP2[tpu] .................. .
Dcache Control Register - DC_CTL ......................................... .
Dcache Status Register- DC_STAT[tpu] ..................................... .
DTB Invalidate Address Space Register - DTB_IASN[tpu] ....................... .
DTB Invalidate Multiple Register - DTB_IM[tpu] ................................ .
DTB Invalidate Single Register - DTB_IS[tpu] ................................. .
DTB PTE Array Write Registers - DTB_PTEO[tpu], DTB_PTE1 [tpu] ................ .
DTB Tag Array Write Registers - DTB_TAGO[tpu], DTB_TAG1 [tpu] ................ .
Mbox Control Register- M_CTL[tpu] ........................................ .
Mbox Process Mode Register - M_MODE[tpu] ................................ .
Mbox Process Context Register - M_PCTX[tpu] ............................... .
Mbox Memory Management Status Register - M_STAT[tpu] ...................... .
Quiesce Timeout Register- QUIESCE_TIMEOUT[tpu] .......................... .
Virtual Address Register - VA[tpu] .......................................... .
Virtual Address Format Register - VA_FORM[tpu] ............................. .
Watch Physical Address Register - WATCH_PHYS_ADDR[tpu] ................... .
Hardware Interrupt Clear Register - HW_INT_CLR[tpu] .......................... .
DRAM Error Status 1...................................................... .
DRAM Error Status 2 ...................................................... .
DRAM Error Status 3 ...................................................... .
DRAM Error Control ...................................................... .
DRAM Timing Control 1 ................................................... .
DRAM Timing Control 2 ................................................... .
DRAM Timing Control 3 ................................................... .
DRAM Refresh Control .................................................... .
DRAM Calibration Control 1 ................................................ .
DRAM Calibration Control 2 ................................................ .
DRAM Timing Control 4 ................................................... .
DRAM Refresh Row ...................................................... .
DRAM Initialization Control ................................................. .
DIFT Control ............................................................ .
DRAM Error Address ...................................................... .
DIFT Timeout ........................................................... .
DRAM Mapper Control .................................................... .
Interpretation of Row High .................................................. .
Zbox Performance Counter 0 ............................................... .
Zbox Performance Counter 1 ............................................... .
Zbox Performance Control ................................................. .
Zbox Sweep Directory Bits ................................................. .
Zbox Force-Error Address Register .......................................... .
Zbox DIFT Error Status Register ............................................. .

16-7
16-8
16-8
16-10
16-11
16-12
16-13
16-15
16-16
16-16
16-17
16-18
16-18
16-19
16-20
16-21
16-21
16-23
16-24
16-24
16-25
16-26
16-27
16-27
16-28
16-29
16-30
16-32
16-33
16-34
16-35
16-36
16-36
16-37
16-38
16-52
16-53
16-55
16-57
16-59
16-62
16-63
16-66
16-68
16-69
16-71
16-72
16-72
16-74
16-76
16-76
16-78
16-83
16-83
16-84
16-85
16-88
16-89
16-90

Compaq Confidential
5 January 2001 ~· Subject To Change

xxiii

16-60
17-1
17-2
17-3
17-4
19-1
20-1
20-2
20-3
21-1
21-2
21-3
21-4
21-5
21-6
21-7
A-1

Zbox RAC Control Register ................................................. .
HW_LD/HW_ST Instruction Format .......................................... .
HW_MFPR Instruction Format .............................................. .
HW_MTPR Instruction Format .............................................. .
RET Instruction Fields ..................................................... .
Captured Timeline for Each Profiled Instruction ................................. .
Trace Bus Timing Relationships ............................................. .
Trace Bus Routing ........................................................ .
Trigger Logic ............................................................ .
Basic Tbox Contract ...................................................... .
Tbox Global Block Diagram ................................................. .
Central Port Controller ..................................................... .
TAP Controller State Machine ............................................... .
Tbox Reset Engine ....................................................... .
Tbox Reset Engine State Diagram ........................................... .
SROM Engine State Diagram ............................................... .
Instruction Formats ....................................................... .

16-91
17-1
17-3
17-4
17-6
19-12
20-3
20-5
20-6
21-1
21-2
21-6
21-7
21-8
21-8
21-9
A-2

Compaq Confidential
xxiv

5 Jam.sary 2001 ~ Subject To Change

Tables
2-1
2-2
2-3

2-4
2-5
2-6
2-7
2-8
2-9
2-10
2-11
2-12
2-13
2-14
2-15

2-16
3-1

3-2
3-3
3-4
3-5
3-6
3-7
3-8
3-9
3-10
3-11
3-12
3-13
3-14
3-15

3-16
3-17
3-18
3-19
3-20
3-21
3-22
3-23
3-24
3-25

3-26
4-1
4-2

4-3
5-1
6-1

6-2
6-3
6-4

6-5
6-6
6-7
6-8
6-9
6-10

Microarchitecture Major Sections Summary .................................... .
lbox Major Component Summary ............................................ .
Pbox Major Component Summary ........................................... .
Qbox Major Component Summary ........................................... .
Ebox Major Component Summary ........................................... .
Ebox Cluster Section Summary ............................................. .
Fbox Major Component Summary ........................................... .
Fbox Functional Unit Summary .............................................. .
Mbox Major Component Summary ........................................... .
Cbox Major Component Summary ........................................... .
Negative Integers to Alphabetics Conversion ................................... .
Pipeline Stage Conversion Equations ......................................... .
Pipeline Stage Conversion ................................................. .
Instruction Execution Pipelines and Latency .................................... .
Thread Synchonization Instructions .......................................... .
Short Vector SIMD Instructions .............................................. .
lbox Major Sections ....................................................... .
lbox Main Pipeline ........................................................ .
lcache Data Array Cache Block Contents ...................................... .
lcache Tag Array Predecode for Fetch Blocks .................................. .
Fields in the Start/End Buffer ............................................... .
Fetch-Block Exit Conditions ................................................ .
PC1 Calculation .......................................................... .
Conditions that Sqaush the Second Fetch Chunk ................................ .
Hardware PC Calculation Components ........................................ .
Matrix Legend ........................................................... .
NextPC 0 Calculation Matrix ................................................ .
lcache Mispredict Signalling ................................................ .
Superpage support in the Main ITB ........................................... .
Granularity Hint (GH) Mapping .............................................. .
IPRs that Affect the ITB .................................................... .
ITB Invalidate Operations .................................................. .
Predecode Bits Defined by the lbox Instruction Fill Unit ........................... .
lbox Predecode Bit Summary ............................................... .
Fields in a Pre-Map Table Entry ............................................. .
Collapsed fields Stored Into a Post-map Table Entry at Map Time ................... .
Post-Map Table Entry Fields ................................................ .
Fields that are Available from Collapsing Buffer at Map Time ....................... .
Fields in Post-Map Table Entry That are Created During Execute (E) and Kill Time (K) .. .
Exception Types and Restart Address ........................................ .
Creating Slot-Based Predictor States From Mapped Information in the Post-Map Table .. .
Restoring Predictor States on a Restart ....................................... .
Pbox Components ........................................................ .
INum Age Relationship .................................................... .
Predecode Value Meaning for l%MAP_INST_14A_H[7:0]<35:32> ................... .
Qbox Component Summary ................................................ .
Ebox Major Component Summary ........................................... .
lnterbox Timing Relationships ............................................... .
Integer Cluster Sections ................................................... .
Instructions Serviced by the Ebox Addr Unit .................................... .
Instructions Serviced by the Ebox Shifter Unit .................................. .
Instructions Serviced by the Ebox Logic Box Unit ................................ .
Instructions Serviced by the Ebox Virtual Address Generator Unit ................... .
Instructions Serviced by the Ebox Load Data Interface Unit ........................ .
Instructions Serviced by the Ebox Multimedia Interface Unit ....................... .
Instructions Serviced by the Ebox Store Data Interface Unit ....................... .

2-4
2-8
2-9
2-10
2-12
2-12
2-15
2-16
2-17
2-18
2-21
2-21
2-21
2-22
2-29
2-30
3-3
3-4
3-12
3-14
3-18
3-29
3-29
3-31
3-31
3-31
3-32
3-35
3-38
3-39

3-40
3-41
3-49
3-54
3-56
3-57
3-57
3-58
3-58
3-61
3-61
3-61
4-2
4-6
4-9
5-1
6-1
6-4
6-4
6-6
6-7
6-8
6-9
6-10
6-11
6-11

Compaq Confidential
5 January 2001 ··· Subject To Change

xxv

6-11
6-12
6-13
6-14
6-15
6-16
6-17
6-18
7-1
7-2
8-1
8-2
8-3
8-4
8-5
8-6
8-7
8-8
8-9
8-10
8-11
8-12
8-13
8-14
8-15
8-16
8-17
8-18
8-19
8-20
8-21
8-22
8-23
8-24
9-1
9-2
9-3
9-4
9-5
9-6
9-7
9-8
9-9
11-1
11-2
11-3
11-4
11-5
11-6
11-7
11-8
11-9
11-10
11-11
11-12
11-13
11-14
11-15
11-16

Ebox Register Cache Single-Cycle Result Flow ................................. .
Ebox Register Cache Multi-Cycle Result Flow .................................. .
Ebox Cycle Timing of Operand Control Information .............................. .
Ebox Multimedia Unit Min/Max Instruction Byte Reshuffling ........................ .
Instruction Information From the Qbox to the Ebox............................... .
Exceptions Reported by the Ebox ............................................ .
Ebox Reserved Opcode Exceptions .......................................... .
Ebox/Fbox/Mbox Data Conversion Matrix ...................................... .
Register File Read Timing .................................................. .
Register File Write/Read Timing ............................................. .
Fbox Pipeline Functional Units, Instructions, and Latencies ....................... .
Operation of a Single Fbox Pipe - all Operands From Register File ................. .
Timing for Load Data ...................................................... .
Pipeline Stages of Fbox Register Cache ....................................... .
FDIV_SP (9 cycles), FDIV_DP (14 cycles) ..................................... .
FSQRT_SP (12 CYCLES), FSQRT_DP(28 CYCLES) ............................ .
Arithmetic Exceptions ..................................................... .
Fbox Exception Signaling Timing ............................................ .
FPCR Update/Floating-Point Arithmetic Trap Legend ............................. .
Fbox Retire-Time Exception (RTE) Encodings .................................. .
Floating-Point Control Register Format ........................................ .
Exponent Difference Estimation ............................................. .
Filing of Extension Word for F_AP2 Instructions ................................. .
Arithmetic Instruction Explicit Dynamic Rounding Bits ............................ .
FPCR Dynamic Rounding Bits .............................................. .
Maskable Exceptions ..................................................... .
F_DIV Timing Sequence ................................................... .
Paired SP Floating-point Operate Instruction Format ............................. .
Paired Single-Precision .................................................... .
Paired Single-Precision Instructions .......................................... .
Fl1/Fl2 Shifter Operand/Control Selection ..................................... .
Fraction Data Path ....................................................... .
Operand Data Fraction and Exponenet Data Paths .............................. .
Equations of Sticky Bit Calculation ........................................... .
Mbox Major Components .................................................. .
Memory Operation (Launch) ................................................ .
HW_MTPR TB Invalidate, TAG or PTE Issue ................................... .
HW_MTPR TB Invalidate or PTE Retire ....................................... .
HW_MTPR TB PTE Retire Bubble ........................................... .
HW_MTPR TB Invalidate Retire Bubble ....................................... .
Granularity Hint Encoding .................................................. .
Trap Summary ........................................................... .
Dcache Front-End Tag Timing .............................................. .
Cbox Pipeline Stages ..................................................... .
MAF Pipeline Timing Diagram ............................................... .
Scache Tag Array Bank Conflicts ............................................ .
Contents of Each MAF Entry ................................................ .
PRO Contents for Each Entry ............................................... .
VAF Commands .......................... ,............................... .
VAF Contents For Each Entry ............................................... .
Main Victim Flow for Each Cbox Pipeline Stage ................................. .
System Interface Section Response FIFO Entry Fields ........................... .
System Interface Section Response FIFO Entry Fields ........................... .
Scache Tag Array Pipeline Stages ........................................... .
Scache Tag State Transition Table ........................................... .
Stale Fill Table (SFT) ..................................................... .
Scache Least Recently Used (LRU) State Bits .................................. .
Scache TAG Syndrome Bits ................................................ .
Scache Control Pipeline Diagram ............................................ .

6-14
6-15
6-17
6-22
6-31
6-34
6-35
6-36
7-3
7-3
8-1
8-4
8-6
8-6
8-11
8-11
8-11
8-12
8-13
8-13
8-14
8-22
8-27
8-36
8-37
8-37
8-39
8-46
8-46
8-47
8-53
8-55
8-55
8-56
9-1
9-12
9-12
9-12
9-12
9-12
9-14
9-41
9-43
11-4
11-7
11-8
11-10
11-17
11-18
11-18
11-19
11-21
11-22
11-26
11-26
11-28
11-29
11-30
11-33

Compaq Confidential
xxvi

5 January 2001 ···Subject To Change

11-17
11-18
11-19
11-20
11-21
11-22
11-23
11-24
11-25
11-26
12-1
12-2
12-3
12-4
12-5
12-6
12-7
12-8
12-9
12-10
12-11
12-12
12-13
12-14
12-15
12-16
12-17
13-1
13-2
13-3
13-4
13-5
13-6
13-7
13-8
13-9
13-10
13-11
13-12
13-13
13-14
13-15
13-16
13-17
13-18
13-19
13-20
13-21
13-22
15-1
15-2
15-3
15-4
15-5
16-1
16-2
16-3
16-4
16-5

Resource and Order Conflicts ............................................... .
Scache Control Pipeline Stages ............................................. .
Required Resource ....................................................... .
Scache Bank Conflict Timing ............................................... .
Miss Request Command Summary ........................................... .
Victim Command Summary ................................................. .
110 Request Packet Format ................................................. .
Scache Block State ....................................................... .
Scache Tag Request Command ............................................. .
Scache Access Order to the Same Cache Block ................................ .
Comparison Between 21364 and 21464 Cache Coherence Protocols ................ .
MAF Coherence State Bits ................................................. .
Forwards hit MAF (Full Address Match) ....................................... .
Response Hit MAF (MAF Index) ............................................. .
Miss Requests from Mbox .................................................. .
Forwards From (Remote) Directory ........................................... .
Responses (Fills) from System .............................................. .
VAF Hit ................................................................ .
Directory State Request Responses .......................................... .
System Command Opcodes ................................................ .
Location of Useful Data for Fully-Merged WrQW's and WrlPR's .................... .
Location of Useful Data for Fully-Merged WrLW's ............................... .
Location of Useful Data for Quadword Specified by QWADD(5,3) of a WrByte ......... .
Location of Useful Data in a BlklO in Response to a Fully-Merged RdQW or Rd IPR ..... .
Location of Useful Data in Response to Fully-Merged RdLW's ...................... .
Location of Useful Data in Quadword Specified by QWADD(5,3) of a BlklO Packet ..... .
ALERT Wire Allocation .................................................... .
Messages on the IO_CHANNEL ............................................. .
Messages on the REQUEST_CHANNEL ...................................... .
Messages on the FORWARD_CHANNEL ..................................... .
Messages on the RESPONSE_CHANNEL ..................................... .
Messages on a SPECIAL_CHANNEL. ........................................ .
Route Information Bits ..................................................... .
Dealloc 3-Bit Variable-Length Encoding (IPs) ................................... .
Buffer Message Formats ................................................... .
Dealloc 3-Bit Encoding (1/0 port) ............................................. .
1/0 Port Buffer Size and Number ............................................. .
Zport Buffer Message Format ............................................... .
Cport Buffer Message Format ............................................... .
Packet Formats ......................................................... .
l/O_CHANNEL Formats (3 Ticks) ............................................ .
REQUEST_CHANNEL Format .............................................. .
FORWARD_CHANNEL Format ............................................. .
RESPONSE_CHANNEL Formats ............................................ .
SPECIAL_CHANNEL Formats .............................................. .
INPUT 1/0 PORT HEADER TICK Formats ..................................... .
ROUTE FIELD Format .................................................... .
Interrupt Level Sources .................................................... .
Router IO_CHANNEL Point-to-Point Rules ..................................... .
G 10 Port Signals ......................................................... .
GIO_CNFG Register Field Descriptions ....................................... .
GIO_ADDR Register Fields Description ....................................... .
GIO_DATA Register Fields Description ....................................... .
GIO Address Space Registers Defined by Marvel ............................... .
Internal Processor Register Summary ........................................ .
IPR Initialization Classification .............................................. .
IPR Reserved Field Type Definitions ......................................... .
Cycle Counter Register Fields Description ..................................... .
DTB Miss Return Address Register Field Descriptions ............................ .

11-35
11-37
11-39
11-40
11-42
11-44
11-50
11-57
11-60
11-61
12-5
12-10
12-11
12-12
12-14
12-15
12-16
12-18
12-18
12-20
12-22
12-23
12-23
12-28
12-28
12-29
12-35
13-2
13-3
13-3
13-4
13-5
13-6
13-7
13-8
13-9
13-9
13-9
13-10
13-11
13-12
13-13
13-13
13-14
13-15
13-15
13-16
13-19
13-22
15-1
15-3
15-3
15-4
15-4
16-1
16-6
16-6
16-7
16-8

Compaq Confidential
5 January 2001 ··· Subject To Change

xxvii

16-6
16-7
16-8
16-9
16-10
16-11
16-12
16-13
16-14
16-15
16-16
16-17
16-18
16-19
16-20
16-21
16-22
16-23
16-24
16-25
16-26
16-27
16-28
16-29
16-30
16-31
16-32
16-33
16-34
16-35
16-36
16-37
16-38
16-39
16-40
16-41
16-42
16-43
16-44
16-45
16-46
16-47
16-48
16-49
16-50
16-51
16-52
16-53
16-54
16-55
16-56
16-57
16-58
16-59
16-60
16-61
16-62
16-63
16-64

Exception Address Register Field Descriptions .................................. .
Exception Summary Register Field Descriptions ................................ .
CPU Configuration Register Fields Description .................................. .
lbox TPU Configuration Register Field Descriptions .............................. .
lbox Control Register Field Descriptions ....................................... .
lbox Process Mode Register Fields Description ................................. .
lbox Process Context Register Field Descriptions ............................... .
lcache Status Register Fields Descriptions ..................................... .
lcache Flush Register Fields Description ...................................... .
lcache Flush (ASM = O) Register Fields Description .............................. .
ITB Invalidate Multiple Register Fields Descriptions .............................. .
ITB Invalidate Single Register Fields Description ................................ .
Instruction PTE Array Write Register Field Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Instruction Tag Array Write Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Instruction VA Format Register (43-Bit VA) Fields Description.......................
Instruction VA Format Register (52-Bit VA, REDUCED-PT=0) Fields Description.. . . . . . .
Instruction VA Format Register (52-Bit VA, REDUCED-PT =1) Fields Description . . . . . . . .
PALcode Base Address Entry Points and Offsets.................................
PALcode Base Address Register Fields Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dcache Control Register Field Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Dcache Status Register Field Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OTB Invalidate Multiple Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OTB Invalidate Single Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
DTB_PTE Array Write Registers Fields Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OTB Tag Array Write Registers Fields Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mbox Control Register Fields Description ...................................... .
Mbox Process Mode Register Field Descriptions ................................ .
Mbox Process Context Register Field Descriptions .............................. .
Mbox Memory Management Status Register Field Descriptions..................... .
Quiesce Timeout Register Field Descriptions ................................... .
Instruction VA Format Register (43-Bit VA) Fields Description. . . . . . . . . . . . . . . . . . . . . . .
Instruction VA Format Register (52-Bit VA, REDUCED-PT =0) Fields Description . . . . . . . .
Instruction VA Format Register (52-Bit VA, REDUCED-PT=1) Fields Description . . . . . . . .
Watch Physical Address Register Fields Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hardware Interrupt Clear Register Fields Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Router-Configuration1 Register Fields Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Router-Configuration2 Register Fields Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Router-{N,S,E,W}-Configuration1 Register Fields Description . . . . . . . . . . . . . . . . . . . . . . .
Router Channel {N,S,E,W} Configuration2 Register Fields Description. . . . . . . . . . . . . . . .
Router {N,S,E,W} Timer1 Configuration Register Fields Description . . . . . . . . . . . . . . . . . .
Router {N,S,E,W} Timer2 Configuration Register Fields Description . . . . . . . . . . . . . . . . . .
Router {N,S,E,W} Error Status Register Fields Description ........................ .
Router {N,S,E,W} Performance Counter Register Fields Description ................. .
Router 1/0-Port Configuration Register Fields Description ......................... .
Router 1/0-Port Configuration 2 Register Field Description ......................... .
Router 1/0-Port Buffer Size Register Fields Description ........................... .
Routerl/0-Port Timer1 Configuration Register Fields Description .................... .
Router 1/0-Port Timer2 Configuration Register Fields Description ................... .
Router 1/0-Port Error Status Register Fields Description .......................... .
Router 1/0-Port Performance Counter Register Fields Description ................... .
Router 1/0-Port Error Status Register Fields Description .......................... .
Router Routing Table Register Fields Description ............................... .
WhoAml Register Fields Description .......................................... .
Router Overall-Timer-Control Register Fields Description ......................... .
Router Overall-Timer-Control Register Fields Description ......................... .
DRAM Error Status 1 Fields Description ....................................... .
DRAM Error Status 2 Fields Description ....................................... .
DRAM Error Status 3 Register Fields Description ................................ .
DRAM Error Control Register Fields Description ................................ .

16-9
16-10
16-11
16-12
16-13
16-15
16-16
16-17
16-17
16-18
16-18
16-19
16-20
16-21
16-21
16-22
16-22
16-22
16-23
16-24
16-25
16-27
16-27
16-28
16-30
16-30
16-32
16-33
16-34
16-35
16-36
16-36
16-37
16-37
16-38
16-38
16-39
16-41
16-42
16-43
16-43
16-44
16-45
16-45
16-47
16-47
16-48
16-48
16-48
16-49
16-50
16-50
16-51
16-51
16-52
16-53
16-54
16-56
16-57

Compaq Confidential
xxviii

5 Jam1c1ry 2001 ···Subject To CfJange

16-65
16-66
16-67
16-68
16-69
16-70
16-71
16-72
16-73
16-74
16-75
16-76
16-77
16-78
16-79
16-80
16-81
16-82
16-83
16-84
17-1
17-2
17-3
17-4
17-5
17-6
19-1
19-2
19-3
19-4
19-5
19-6
19-7
19-8
19-9
19-10
19-11
19-12
19-13
19-14
19-15
19-16
19-17
19-18
19-19
21-1
21-2
21-3
21-4
21-5
22-1
22-2
22-3
22-4
22-5
23-1
A-1
A-2
A-3

DRAM Timing Control 1 Fields Description ..................................... .
DRAM Timing Control 2 Fields Description .................................... .
DRAM Timing Control 3 Fields Description ..................................... .
DRAM Refresh Control Fields Description ..................................... .
DRAM Calibration Control 1 Fields Description .................................. .
DRAM Calibration Control 2 Fields Description .................................. .
DRAM Timing Control 4 Fields Description ..................................... .
DRAM Refresh Row Fields Description ........................................ .
DRAM Initialization Control Fields Description .................................. .
PID Control Fields Description .............................................. .
DRAM Error Address Fields Description ....................................... .
DIFT Timeout Fields Description ............................................. .
DRAM Mapper Control Fields Description ...................................... .
Zbox Performance Counter O Fields Description ................................. .
Zbox Performance Counter 1 Fields Description ................................. .
Zbox Performance Control Fields Description ................................... .
Zbox Sweep Directory Bits Fields Description ................................... .
Zbox Force-Error Address Fields Description ................................... .
Zbox DIFT Error Status Fields Description ..................................... .
Zbox RAC Control Fields Description ......................................... .
HW_LD/HW_ST Instruction Fields Description .................................. .
HW_MFPR Fields Description ............................................... .
MT_MTPR Instruction Fields Description ...................................... .
GPR[1 :O] Encoding ....................................................... .
RET Instruction Mode Transitions ............................................ .
RET Instruction Fields Description ........................................... .
Control IPRs for Instruction-Based Profiling .................................... .
IAGG_EVENT and MAGG_EVENT IPRs ...................................... .
Fields in the PRO_PC<63:0> and PR1_PC<63:0> ............................... .
Fields in PR_l_INF0<63:0> ................................................ .
Fields in PR_Q_INF0<63:0> ............................................... .
Fields in PRO_MEM_INF0<63:0> and PR1_MEM_INF0<63:0> .................... .
Fields in PRO_DMISS_INF0<63:0> and PR1_DMISS_INF0<63:0> ................. .
PRn_TIMELINE IPRs ..................................................... .
Fields in PR_ST_LATENCY<63:0> ........................................... .
Aggregate Event Counter IPRs .............................................. .
Fields in CBOX_PRF_CTL<31 :0> ............................................ .
Fields in CBOX_PRF_ADR<63:0> ........................................... .
Fields in CBOX_PRF_STS<25:0> ........................................... .
Fields in CBOX_PRF_CNT <31 :0> ........................................... .
Fields in ZBOXn_ZPM_CTR0<31:0> ......................................... .
Fields in ZBOXn_ZPM_CTL 1<31 :0>.......................................... .
Fields in ZB0Xn_ZPM_CTL<31 :0> ........................................... .
Fields in RBOX_n_PERF<27:0> ............................................. .
Fields in RBOX_IO_PERF<27:0> ............................................ .
Array Test Command Broadcast Bus ......................................... .
Simple BiSt Command Bus ................................................. .
Observability Register Command Bus ........................................ .
Dedicated Test Port Pins ................................................... .
Shared Test Pins ........................................................ .
Key to Table 22-2, "Summary of Disruption High-Level Features' ................... .
Summary of Disruption High-Level Features .................................... .
Disruption PALcode Entry Points ............................................ .
Key to Table 22-5, "Summary of Disruption Low-Level Features' ................... .
Summary of Disruption Low-Level Features .................................... .
Signal Pad Requirements .................................................. .
Opcode Groups .......................................................... .
Predecode Logic Groups ................................................... .
Opcode 10 Instruction Decoding ............................................. .

16-59
16-62
16-64
16-67
16-69
16-70
16-71
16-72
16-73
16-75
16-76
16-76
16-78
16-84
16-85
16-85
16-88
16-89
16-91
16-92
17-1
17-3
17-4
17-5
17-6
17-6
19-3
19-5
19-6
19-7
19-10
19-11
19-11
19-13
19-15
19-17
19-17
19-18
19-18
19-19
19-19
19-20
19-20
19-23
19-23
21-3
21-4
21-4
21-4
21-5
22-3
22-3
22-7
22-8
22-8
23-1
A-1
A-3
A-6

Compaq Confidentia I
5 January 2001 ··· Subject To Change

xx ix

A-4
A-5
A-6
A-7
A-8

A-9
A-10
A-11
A-12
A-13
A-14
A-15
A-16
A-17
B-1
B-2
C-1
C-2
C-3

Opcode 10 Specific Logic Functions Within the Integer Adder ...................... .
Opcode 11 Instruction Decoding ............................................. .
Opcode 12 Instruction Decoding ............................................. .
Opcode 13 Instruction Decoding ............................................. .
Opcode 13 Specific Logic Functions Within the Integer Adder ...................... .
Opcode 14 Instruction Decoding ............................................. .
Opcode 15 Instruction Decoding ............................................. .
Opcode 16 Instruction Decoding ............................................. .
Opcode 17 Instruction Decoding ............................................. .
Opcode 18 Instruction Decoding ............................................. .
Load and Store Instruction Decoding ......................................... .
Opcode 1C Instruction Decoding ............................................ .
Branch and Jump Instruction Decoding ....................................... .
PALcode Instruction Decoding .............................................. .
SMT AMASK Instruction Bit ................................................ .
Proposed LDx_ARM/QUIESCE Opcode Assignments ............................ .
l_CTL Field Definitions .................................................... .
M_CTL Field Definitions ................................................... .
Valid and Invalid PAGE_SIZE, VA_SIZE, and REDUCED_PAGE_TABLE Combinations ..

A-6
A-7
A-7
A-8

A-9
A-9
A-10
A-11
A-12
A-13
A-13
A-14
A-16
A-17
B-2

B-9
C-3
C-5
C-6

Compaq Confidential
xxx

5 Jc1nuc1ry 2001 ·-Subject To Change

Preface
Audience
This specification is for system designers and programmers who are involved in the
Alpha 21464 microprocessor engineering project.

Organization
This specification contains the following chapters. A top-level presentation of the main
topics in these chapters is presented in Chapter 2.
Chapter 1, Introduction, which describes the terminology and conventions that are used
in this specification.
Chapter 2, Architecture Overview, which summarizes the 21464 new features and
design organization.
Chapter 3, Instruction Fetch Unit - the Ibox, which describes the first part of the
instruction unit microarchitecture.
Chapter 4, Dependency Mapper Unit - the Pbox, which describes the second part of
the instruction unit microarchitecture.
Chapter 5, Instruction Issue and Retire Unit - the Qbox, which describes the third part
of the instruction unit microarchitecture.
Chapter 6, Integer Execution Unit - the Ebox, which describes how integer instructions are executed.
Chapter 7, Register File, which describes the creation and management of the virtual
and physical registers in that file.
Chapter 8, Floating-Point Execution Units - the Fbox, which describes how floatingpoint instructions are executed.
Chapter 9, Memory Instruction Execution Unit memory-reference instructions are executed.

the Mbox, which describes how

Chapter 10, Internal Ring Bus, which describes the bus that connects the Cbox, Rbox,
andZbox.
Chapter 11, Second-Level Cache and Controller (Cbox), which describes how the second-level cache is controlled.
Chapter 12, Cache Coherence Protocol Processing, which describes how caches in a
multprocessor system maintain their coherency.
Compaq Confide~itia I
5 January 2001 -·Subject To Change

xx xi

Chapter 13, Router Interface -

the Rbox, which describes the interprocessor switch.

Chapter 14, Rambus Interface -

the Zbox, which describes that interface.

Chapter 15, Miscellaneous Interfaces, which describes the GIO Port.
Chapter 16, Internal Processor Registers, which describes those registers.
Chapter 17, Privileged Architecture Library Code, which describes the interface
between the microarchitecture and the PALcode environment.
Chapter 18, Initialization and Configuration, which describes the sequences that are
used in the initialization and configuration of the microprocessor, along with their characteristics.
Chapter 19, Performance Monitoring, which describes the means available for monitoring the performance of the 21464.
Chapter 20, Hardware Debug Features, which describes the physical capabilities that
have been placed in the 21464 to aid debugging.
Chapter 21, Testability and Diagnostics, which describes the capabilities that have been
placed in the 21464 to aid in testing and performing diagnostics.
Chapter 22, Error Detection and Error Handling, which describes the various error
detection mechanisms that have been pfaced in the 21464 and the corresponding recovery procedures.
Chapter 23, Hardware Interface, which describes the 21464 at the level of its interface
pins.
Chapter 24, New Instructions, which describes instructions that are new for the 21464.
Chapter 25, System Configurations, which describes considerations for configuring
systems.
Chapter 26, Physical Addressing and Input/Output, which describes physical addressing and Input/output considerations.
Chapter 27, Requirements to Support "Tandem", which describes those parts of the
design that are significant to Tandem machines that will use the 21464.
A Glossary, which provides the definition of the terms used in the specification for
which the definitions can be specific to this specification.
An Index, which provides the appropriate references into the specification.

Related Documentation
The following documents are included by reference in this specification:
•

The Alpha System Reference Manual (the SRM), Version 7

•

The ALPHA_SRM notesfile, which includes an on-going discussion of topics
related to this design

To obtain an SRM and access to the ALPHA_SRM notesfile, send mail to
Audrey.Reith@Compaq.com.

Compaq Confidential
xxxii

5 Jam.1~1ry 2001 -- Subject To Change

The following documents are referenced in this specification. These documents can
provide historical context, supporting information, additional information, or be of
general interest to those using this specification. These documents are available in the
same general directory as the specification and can be viewed in your browser.

Compaq Confidential
5 January 2001 -~ Subject To Change

xxxiii

Compaq Confidential
xxxiv

5 Janwiry 2001 ··· Subject To Change

Terminology and Conventions

1
Introduction
1.1 Terminology and Conventions
This section defines the abbreviations, terminology, and other conventions used
throughout this document.
Abbreviations

•

Binary Multiples
The abbreviations K, M, and G (kilo, mega, and giga) represent binary multiples
and have the following values.
K
M

= 210 (1024)

220 (1,048,576)
230 (1,073,741,824)

For example:
2KB
4MB
8GB
2K pixels
4M pixels

•

= 2 kilobytes = 2 x 2 10 bytes
= 4 megabytes = 4 x 220 bytes
= 8 gigabytes = 8 x 230 bytes
= 2 kilopixels = 2 x 2 10 pixels
= 4 megapixels = 4 x 220 pixels

Register Access
The abbreviations used to indicate the type of access to register fields and bits have
the following definitions:

Abbreviation

Meaning

IGN

Ignore
Bits and fields specified are ignored on writes.

MBZ

Must Be Zero
Software must never place a nonzero value in bits and fields specified as
MBZ. A nonzero read produces an Illegal Operand exception. Also, MBZ
fields are reserved for future use.

RAZ

Read As Zero
Bits and fields return a zero when read.

Compaq Confidential
5 January 2001 --· Subject To Change

Introduction

1-1

Terminology and Conventions

Abbreviation

Meaning

Read Clears
Bits and fields are cleared when read. Unless otherwise specified, such bits
cannot be written.

RES

Reserved
Bits and fields are reserved by Compaq and should not be used; however,
zeros can be written to reserved fields that cannot be masked.

Read Only
The value may be read by software. It is written by hardware. Software write
operations are ignored.

RO,n

Read Only, and takes the value n at power-on reset.
The value may be read by software. It is written by hardware. Software write
operations are ignored.

Read/Write
Bits and fields can be read and written.

RW,n

Read/Write, and takes the value n at power-on reset.
Bits and fields can be read and written.

WlC

Write One to Clear
If read operations are allowed to the register, then the value may be read by
software. If it is a write-only register, then a read operation by software
returns an UNPREDICTABLE result. Software write operations of a 1 cause
the bit to be cleared by hardware. Software write operations of a 0 do not
modify the state of the bit.

WlS

Write One to Set
If read operations are allowed to the register, then the value may be read by
software. If it is a write-only register, then a read operation by software

returns an UNPREDICTABLE result. Software write operations of a 1 cause
the bit to be set by hardware. Software write operations of a 0 do not modify
the state of the bit.
WO

Write Only
Bits and fields can be written but not read.

WO,n

Write Only, and takes the value n at power-on reset.
Bits and fields can be written but not read.

•

Sign extension
SEXT(x) means xis sign-extended to the required size.

Addresses

Unless otherwise noted, all addresses and offsets are hexadecimal.
Aligned and Unaligned

The terms aligned and naturally aligned are interchangeable and refer to data objects
that are powers of two in size. An aligned datum of size 2n is stored in memory at a
byte address that is a multiple of 2n; that is, one that has n low-order zeros. For example, an aligned 64-byte stack frame has a memory address that is a multiple of 64.
A datum of size 2n is unaligned if it is stored in a byte address that is not a multiple of

2n.

Compaq Confidential
1-2

Introduction

5 January 2001 ···Subject To Change

Terminology and Conventions

Bit Notation

Multiple-bit fields can include contiguous and noncontiguous bits contained in square
brackets([]). Multiple contiguous bits are indicated by a pair of numbers separated by a
colon[:]. For example, [9:7,5,2:0] specifies bits 9,8,7,5,2,1, and 0. Similarly, single bits
are frequently indicated with square brackets. For example, [27] specifies bit 27. See
also Field Notation.
Caution

Cautions indicate potential damage to equipment or loss of data.
Data Units

The following data unit terminology is used throughout this manual.
Term

Byte

Words
~

Word

Bytes

Bits

Other

Longword

Dword

Quadword

2 longword

Do Not Care (X)

A capital X represents any valid value.
External

Unless otherwise stated, external means not contained in the chip.
Field Notation

The names of single-bit and multiple-bit fields can be used rather than the actual bit
numbers (see Bit Notation). When the field name is used, it is contained in square
brackets([]). For example, RegisterName[LowByte] specifies RegisterName[7:0].
Note

Notes emphasize particularly important information.
Numbering

All numbers are decimal or hexadecimal unless otherwise indicated. The prefix Ox indicates a hexadecimal number. For example, 19 is decimal, but Ox19 and Ox19A are hexadecimal (also see Addresses). Otherwise, the base is indicated by a subscript; for
example, 1002 is a binary number.
Ranges and Extents

Ranges are specified by a pair of numbers separated by two periods (.. ) and are inclusive. For example, a range of integers 0 ..4 includes the integers 0, 1, 2, 3, and 4.

Extents are specified by a pair of numbers in square brackets ([]) separated by a colon
(:)and are inclusive. Bit fields are often specified as extents. For example, bits [7:3]
specifies bits 7, 6, 5, 4, and 3.

Compaq Confidential
5 January 2001 ·-Subject To Change

Introduction

1-3

Terminology and Conventions

The gray areas in register figures indicate reserved or unused bits and fields.
Bit ranges that are coupled with the field name specify the bits of the named field that
are included in the register. The bit range may, but need not necessarily, correspond to
the bit Extent in the register.
Signal Names

The following examples describe signal-name conventions used in this document.
AlphaSignal[n:n]

Boldface, mixed-case type denotes signal names that are
assigned internal and external to the 21464 (that is, the signal traverses a chip interface pin).

AlphaSignal_x[n:n]

When a signal has high and low assertion states, a lowercase italic x represents the assertion states. For example,
Signa1Name_x[3:0] represents Signa1Name_H[3:0] and
Signa1Name_L[3:0].

UNDEFINED

Operations specified as UNDEFINED may vary from moment to moment, implementation to implementation, and instruction to instruction within implementations. The
operation may vary in effect from nothing to stopping system operation.
UNDEFINED operations may halt the processor or cause it to lose information. However, UNDEFINED operations must not cause the processor to hang, that is, reach an
unhalted state from which there is no transition to a normal state in which the machine
executes instructions.
UNPREDICTABLE

UNPREDICTABLE results or occurrences do not disrupt the basic operation of the processor; it continues to execute instructions in its normal manner. Further:

•

Results or occurrences specified as UNPREDICTABLE may vary from moment to
moment, implementation to implementation, and instruction to instruction within
implementations. Software can never depend on results specified as UNPREDICTABLE.

•

An UNPREDICTABLE result may acquire an arbitrary value subject to a few constraints. Such a result may be an arbitrary function of the input operands or of any
state information that is accessible to the process in its current access mode.
UNPREDICTABLE results may be unchanged from their previous values.
Operations that produce UNPREDICTABLE results may also produce exceptions.

•

An occurrence specified as UNPREDICTABLE may happen or not based on an
arbitrary choice function. The choice function is subject to the same constraints as
are UNPREDICTABLE results and, in particular, must not constitute a security
hole.
Specifically, UNPREDICTABLE results must not depend upon, or be a function of,
the contents of memory locations or registers that are inaccessible to the current
process in the current access mode.
Compaq Confidential

1-4

Introduction

5 January 2001 ···Subject To Change

Terminology and Conventions

Also, operations that may produce UNPREDICTABLE results must not:
Write or modify the contents of memory locations or registers to which the current process in the current access mode does not have access, or
Halt or hang the system or any of its components.
For example, a security hole would exist if some UNPREDICTABLE result
depended on the value of a register in another process, on the contents of processor
temporary registers left behind by some previously running process, or on a
sequence of actions of different processes.

x
Do not care. A capital X represents any valid value.

Compaq Confidential
5 January 2001 ··· Subject To Change

Introduction

1-5

New Features

2
Architecture Overview
This chapter presents an overview of the major parts of the 21464 microarchitecture.

•

The new features of the 21464

•

Microarchitecture diagram, a high-level view of the overall architecture

•

Simultaneous multithreading (SMT), the essential new performance element of the
21464 design

•
•
•
•
•
•
•
•
•
•
•
•

Instruction unit, composed of the Ibox, Pbox, and Qbox
Execution unit, composed of the Register File, Ebox, and Fbox
Memory controller unit, composed of the Mbox
External interface, composed of the Cbox, Rbox, and Zbox
Pipeline organization
Instruction Execution Pipelines and Latency
Instruction issue and retire rules
New Instructions
Implementation-specific execution of the CMOV and FCMOV instructions
Interrupt handling
AMASK and IMPLVER instruction values
Performance monitoring features

2.1 New Features
The 21464 can be summarized as follows.

2.1.1 Processor Features
The processor has the following characteristics:
•

Instruction issue and execute out of order
Dynamic four-way simultaneous multi-threading (SMT)
Up to eight instructions mapped, issued, executed, and retired per cycle, from the
following menu:
Up to eight integer operations, including branches
Com p.aq Confidentia I

5 January 2001 ··· Subject To Change

Architecture Overview

2-1

New Features

Up to four floating-point operations
Up to four memory references
Up to four multimedia operations

•

Latency is one cycle for most integer operations and three cycles for loads and most
floating-point operations

•
•

Store-sets memory dependence predictor - for predicting store-load dependencies
Fetches up to 16 instructions for each cycle

•

Collapsing instruction buffer for merging basic blocks

•

Up to 256 instructions in flight

•
•

128 entry instruction queue
1.4 GHz clock rate, resulting in a 700 psec cycle

•
•

Peak instruction rate exceeds 11 billion instructions per second (gigaops)

•

Unified register file (integer and floating point)

New SIMD instructions for video, graphics, and signal-processing applications

512 quadword capacity
16 read ports
8 write ports
•

Instruction L1 cache (lcache)
64 KB capacity, 2-way pseudo-set associative
64 byte (16-instruction) block size
8 instructions per cycle from each of two addresses
Bandwidth is 90 GB/sec
Parity protected

•

Data L1 cache (Dcache)
64 KB capacity
Two-way set associative
64 byte (8-quadword) block size
8 bytes per cycle read from each of three addresses
-

Write-through concurrent with reads, subject to bank conflict

Function unit bandwidth: 32 bytes per cycle, resulting in 45 GB/sec
Hit latency is three cycles
Fill/write bandwidth is 64 bytes per cycle, resulting in 90 GB/sec
Parity protected

•

Onchip L2 cache (Scache)
3 MB capacity
Compaq Confidential

2-2

Architecture Overview

5 Jc1nuc1ry 2001 -· Subject To Change

New Features

Six-way set-associative
64 byte block size
Write-back
DC/IC fill bandwidth: 64 bytes per cycle, resulting in 90 GB/sec
Best-case hit latency is 10 cycles
DC write-through bandwidth is 16 bytes per cycle, resulting in 22 GB/sec
Peak Scache fill rate is 32 bytes per cycle, resulting in 45 GB/sec
ECC protected by quadwords
•

52-bit virtual address, 48-bit physical address, 8-bit ASN

•

SK and 64K page sizes, granularity hint for bigger contiguous regions

•

Independent 128-entry fully-associative ITB and DTB, with superpages for kernel
maps

2.1.2 Memory Features
The memory has the following characteristics:

•
•

Glueless interface to Rambus main memory

•
•
•
•
•
•
•

Each port consists of four channels

Two independent interleavable RDRAM ports per processor, with optional fourway processor striping

All transactions in units of 64-byte (512-bit) blocks
Optional redundant fifth channel protects against full-chip failure
Each channel supports up to 32 RDRAM chips
Each processor can support up to 256 RDRAM chips (plus redundancy)
With 1 GB parts, 32 GB per processor (35 address bits)
Peak processor memory bandwidth is 200M blocks per second, or 12.8 GB per second.

•

With redundant channel deployed, system tolerates total failure of a memory chip
plus single-bit errors in another chip.

•

Without redundant channel, system corrects single-bit errors in memory and detects
double-bit errors.

2.1.3 Multiprocessor Features
The 21464 provides the following support for multiprocessor configurations:
•

Up to 512 processors with main memory and coherent caches

•

Fully-distributed, non-blocking, directory-based CC-NUMA coherence protocol

•

Optional I/O node per processor, may have cache and/or I/O memory but not cacheable main memory

•

Glueless torus configuration - others possible with switch ASIC's
Compaq Confidential

5 January 2001 ···Subject To Change

Architecture Overview

2-3

Microarchitecture Diagram

•
•
•
•

Maximum total physical memory is 244 bytes = 16 Terabytes .
Peak instruction rate exceeds 5.7 trillion instructions per second (teraops) .
Buffered crossbar switch fabric with virtual circuits
In 21364 mode, each network port supports 3.2 GB per second in and out
Four-port network throughput is 12.8 GB per second

•

In 21464 mode, each port supports 4.8 GB per second in and out.
Five-port network throughput is 24 GB per second

•

Bisection bandwidth of a l 6x32 torus, cut across the narrow axis, is more than 300
GB per second

2.2 Microarchitecture Diagram
Figure 2-1 shows a simplified block diagram of the 21464 microarchitecture. As listed
in Table 2-1, the microarchitecture of the 21464 is separated into four major sections or
units, each of which contain one or more functional subsections, called boxes.
Table 2-1 Microarchitecture Major Sections Summary
Major Section

Subsection

Description

Instruction Unit

Ibox

Instruction fetch unit

Pbox

Instruction processing (dependency resolution) unit

Qbox

Instruction issue and retire unit

Execution Unit

Integer instruction execution unit

Fbox

Floating-point instruction execution unit

Memory Controller Unit

Mbox

Memory-reference instruction execution unit

External Interface Unit

Cbox

Second-level cache (Scache) controller

Rbox

Router controller

Zbox

Rambus memory controller

Compaq Confidentia I
2-4

Architecture Overview

5 J,1nu,1ry 2001 -· Subject To Cf1ange

Simultaneous Multithreading {SMT)

Figure 2-1 21464 Block Diagram

><
0

Retie/Kl
Uni

m
a.

><
~

canp1e1ai
Uni

Process in

><
0
m
0

l/O

><
~

a:
Rambus
RI~

RanbLS
RIMVI

2.3 Simultaneous Multithreading (SMT)
SMT differs from the more traditional forms of hardware multithreading in that every
thread can compete for issue slots at every cycle. Traditional multithreading designs
tend to invoke alternative threads only on second-level cache misses or to schedule the
threads in a rigid, round-robin fashion. The result is less resource utilization and less
performance improvement.
Compaq Confidential
5 January 2001 ·- Subject To Change

Architecture Overview

2-5

Simultaneous Multithreading {SMT)

The 21464 can execute up to four programs simultaneously, each program running in
one of the four thread processing units (TPUs). While each TPU has some dedicated
hardware, most resources are shared between the four TPU s. Maximum single-stream
performance occurs when a single program is the only active thread in the CPU. In that
case, most chip resources are available to the active TPU (the single program) and the
design makes no compromises in single-stream performance. On the other hand,
because many programs cannot always use all the chip resources, it is often possible to
at least double overall throughput by running four programs simultaneously.
SMT adds very little cost to a single processor and can be used either to increase
throughput while executing independent programs or to speed up a single task that has
been decomposed into separate threads. The 21464 adds two instructions, LDx_ARM
and QUIESCE, to provide easy synchronization between cooperating threads.
LDx_ARM sets up a memory address register that monitors memory traffic. QUIESCE
suspends a thread until the memory location is written or a time-out counter expires; the
thread does not consume any resources while waiting for the signal to continue. See
Appendix B for information on the LDx_L and QUIESCE instructions.
Each TPU has its own dedicated program state that consists of 32 integer registers, 32
floating-point registers, a PC, and internal processor registers (IPRs). Also, some
microarchitectural structures, such as the Return Stack and the Instruction Buffer, are
statically divided into four parts, with each part dedicated to a TPU. However, most
microarchitecture structures are dynamically shared among the TPUs on an as-needed
basis. Dynamically shared structures include the caches, the translation buffers, the
branch predictor, functional execution units, and the instruction queue.
Overview of SMT Operation

Most thread-specific operations take place in the Ibox, the front end of the CPU pipeline. The Ibox is time-multiplexed on a cycle-by-cycle basis between four active
threads, each being given equal priority. The Ibox thread fetch chooser normally fetches
instructions from that thread with the fewest instructions in the instruction queue. This
policy helps programs with high ILP go as fast as they can, yet provides instructions to
programs with low ILP as those instructions are needed.
The Ibox accesses the line predictor after the thread fetch chooser selects a thread from
which to fetch an instruction. At the next cycle, the resulting indexes are used to access
the Icache and the branch predictor. The two fetch chunks are stored in the instruction
buffer. The thread map chooser selects a thread and reads its two oldest fetch chunks
from the instruction buffer and collapses them into a single map chunk, which is sent to
the mapper in the Pbox. The mapper maps the registers, assigns INums, and slots the
instructions to the various pickers as the instructions are entered into the instruction
queue. The INum space is divided into a four segments for the four threads (the four
TPUs) and the INums are thread-specific because they keep track of program order.
The instruction queue then contains a mix of instructions from the four threads. Once
instructions are in the instruction queue, they are eligible to issue when their source
operands are available, regardless of which thread they belong to. The oldest, issueready instruction is chosen by each picker and sent to the appropriate Ebox or Fbox
execution unit. Because the threads have no register dependencies on each other, it is
much more likely that eight instructions are continuously ready to issue.

Compaq Confidential
2-6

Architecture Overview

5 Jc1n1.uiry 2001 ~-Subject To Change

Instruction Unit

2.4 Instruction Unit
The instruction unit consists of the Ibox, Pbox, and Qbox. The Ibox is the instruction
fetch engine. It provides high instruction-stream bandwidth to the remainder of the
chip. Specifically, the Ibox delivers instructions directly to the Pbox, which is responsible for instruction number (INum) resource management, dependence analysis and register renaming. From there, instructions proceed to the Qbox, where they await the
resolution of their source register dependencies. Once an instruction's register dependencies have been resolved, the instruction is issued, provided that it wins arbitration
for an appropriate functional unit in the Ebox (arithmetic and logic integer operations),
Fbox (arithmetic floating point operations) or Mbox (memory operations). Once an
instruction has completed execution, it retires when it is the oldest non-retired instruction in the machine for the appropriate thread processing unit (TPU) context.

2.4.1 Instruction Fetch Unit - the lbox
Instruction stream bandwidth is one of the major factors in overall chip performance. A
program cannot execute faster than the rate of its instructions entering the machine.
Achieving sufficient instruction bandwidth for a machine that can execute up to eight
instructions per cycle poses several challenges. In order to meet those challanges, the
Ibox contains many new features that were not designed into prior Alpha implementations.
Features

The Ibox delivers up to eight instructions per cycle to the remainder of the machine.
The Ibox maintains the correct program counter (PC) while the CPU executes programs
and receives interrupts and exceptions to properly redirect the machine.
The Ibox contains the following new features to support high-bandwidth instructionstream fetching, advanced control flow prediction, simultaneous multi threading (SMT),
and memory dependence prediction:

•
•

An !cache size of 64 KB or 16K instructions

•
•
•
•
•

A fetch TPU chooser that creates a resource-balanced SMT fetch engine

Up to two potentially noncontiguous cache blocks are fetched per cycle

Advanced branch prediction that predicts up to 16 branches per cycle
History-based jump target prediction
A collapsing buffer that facilitates over-fetching and merging fetch blocks
Memory dependence prediction that uses store sets

•
•

A simultaneous multithreaded fill unit

•

An anti-thrashing !cache fill policy

Advanced hardware !stream prefetching

Compaq Confide11tial

5 January 2001 ·-Subject To Change

Architecture Overview

2-7

Instruction Unit

The Ibox components can be grouped into the following major sections:
Table 2-2 lbox Major Component Summary
Name

Description

Checkpoint
Unit

The Checkpoint Unit maintains state for restarting the CPU in the event of an exception, and
trains the control flow predictors and the memory dependence predictor. In the event of an
exception, the Checkpoint Unit resets the PC, branch predictor, jump target predictor, and return
stack to the state that existed just before the fetch of the instruction that caused an exception.
Training information for the branch and jump target predictors is also kept and used to train the
predictors at the retirement time of branch or jump instructions.

Control Flow
Prediction
Unit

The Control Flow Prediction Unit predicts PC changes at fetch-time for instructions that can
change control flow when executed: conditional branches, computed jumps, and subroutine
returns. There is a corresponding dedicated predictor for each: the conditional branch predictor,
the jump target predictor, and the return address stack.

Fill Unit

The Fill Unit fetches instructions from lower-level memory and can fetch instruction blocks for
multiple TPUs simultaneously. The Fill Unit maintains a dynamic hardware prefetcher that
attempts to fill the Icache with blocks that would have missed in the future. The Fill Unit also
contains the Icache Translation Buffer (ITB) that translates virtual PC miss addresses to physical
addresses before making memory requests.

Index Unit

The Index Unit produces up to two indexes per cycle. The indexes are usually predictions from
the Line Predictor that are used to access the Icache, Branch Predictor, and Store Sets Array. The
Index Unit also contains the Fetch TPU Chooser that arbitrates among multiple TPUs that are
ready to fetch instructions. The indexes that are produced will have an associated TPU that is
sent along with the indexes down the Ibox pipeline. The Line Predictor itself consists of a
sequential and non-sequential component, to address the sequential and non-sequential code
sequences of the running programs.

Instruction
Processing
Unit

The Instruction Processing Unit stores and retrieves instructions and associated tags and data
into its 64KB !cache and associated tag array. Instruction pre-decode bits are also stored in the
Icache data and tag arrays to speed instruction processing in the Ibox and instruction format
decoding in the Pbox. The Instruction Processing Unit also contains the Store Sets Array, which
produces memory synchronization identifiers called store sets for potentially every load and
store operation. The store sets instruct the Pbox to create explicit dependencies between certain
loads and stores. The Instruction Processing Unit also contains the Collapsing Buffer, which
stores instruction blocks that are driven by the Icache and collapses up to two instruction blocks
per cycle to deliver up to 8 instructions per cycle to the Pbox.

PC Unit

The PC Unit maintaines the program counters for each TPU. Typically, the PC Unit calculates
PCs based on the exiting instructions of the fetch blocks (such as branches, jumps, returns, fallthrough, and so forth), but it also can be reset by interrupts and exceptions. The PC Unit is also
responsible for determining Icache misses, index mispredicts, and way mispredicts in the Ibox
pipeline.

2.4.2 Dependency Mapper Unit -

the Pbox

The Pbox processes instructions that are fetched by the Ibox. The Pbox assigns INums
(instruction numbers) to the instructions, analyzes the data dependencies between
instructions, and maps their architectural source and destination values into physical
registers. The Pbox also maintains data structures that allow recovery of all relevant
processor state that corresponds to the architectural state of the machine prior to any unretired instruction. This allows the processor to perform rapid trap recovery in the presence of branch mispredicts or other exception conditions. The Pbox passes the renamed
instructions to the Qbox for scheduling and dispatch.
Compaq Confidentia I
2-8

Architecture Overview

5 Jc1nuc1ry 2001 --·Subject To Cfumge

Instruction Unit

The Pbox consists of the following components:
Table 2-3 Pbox Major Component Summary
Name

Description

Bid/Grant Exception
Logic

Chooses which of the pending kills from all TPUs should be broadcast to the rest of the
chip.

Instruction Decoder

Decodes each of the eight instructions that arrive in a cycle. The decoder is placed early
in the pipe to aid slotting decisions and to provide inputs to the load/store flow control
mechanisms and to the IPR interlock mechanisms

INum Allocator

Allocates INums to new map blocks sent down by the Ibox. Also contains the Map
Thread Chooser, which picks the next thread that will map instruction blocks and subsequently informs the Ibox

INumMapper

Maps source operand registers (VReg) into the INum of the last writer for the source
operand

Load/Store Serial
Number Allocator

Associates a sequential identifier with each load instruction (LUum) and a second identifier with each store instruction (SNum). These LNums and SNums prevent deadlock
and manage flow control into the Mbox load and store queues

Mapper Exception
Logic

When notified of an exception by the Bid/Grant Exception Logic, rolls the Inum Mapper, Physical Register May, Load/Store Serial Number, and RC/RS Interupt Flag Widget
state back to the trap point

Physical Register Map Allocates physical destination registers to each dispatched instruction. This table is also
used to map virtual register operands into the corresponding physical registers
Post-Map Skid Buffer

Holds a silo of the last few map blocks that have passed through the Pbox forward path

RC/RS Interrupt Flag
Widget

Maintains state necessary to implement the RC/RS instructions

Retire/Kill Unit

Communicates the identity of retired and/or killed instructions to all concerned boxes
by way of the Retire/Kill bus

Memory Queue Allocation Unit

Governs the allocation and deallocation of load queue (LQ) and store queue (SQ)
chunks to memory instructions. Also controls the High-Water Mark (HWM) that is sent
to the Qbox to regulate the issuing of loads and stores.

2.4.3 Instruction Issue and Retire Unit -

the Qbox

The Qbox processes instructions that are renamed by the Pbox, and determines an
appropriate schedule for those instructions. Instructions cannot be executed until they
are "data ready", until their dependencies have been resolved. The Qbox can identify a
data-ready instruction by checking to see that both of its parent entries have asserted
their result-ready signals. This method is called a "decoded-space" dependence array.
The Qbox attempts to choose the "best" 8 instructions to execute for each tic of the
clock from a "window" of 128 candidates that are received from the Pbox. Each of the
eight scheduling pipelines can handle a subset of the 128 candidate instructions.
Because the subset can contain (in some cases) up to half of the instructions in the window, the Qbox includes "pickers" that choose the best instruction out of a set of 64 candidates.
Scheduling is a four step process:

1. Identify all data-ready instructions.
Compaq Confidential
5 January 2001 ···Subject To Change

Architecture Overview

2-9

Instruction Unit

2. For each pipe, select the ''oldest" data-ready instruction enabled for execution in
that pipe.
3. Assert the result-ready signal that corresponds to each selected instruction, so that
all instructions that are stored in the instruction queue can see that the chosen
instructions have been issued.
4. For each instruction in the instruction queue, test the result-ready signal for each
operand for each instruction in that queue.
The Qbox selects the eight best "data ready" instructions for execution in eight integer
pipeline units and four floating-point pipeline units. In addition, the Qbox selects up to
four data-ready branch instructions for resolution in each cycle. It also retires all eligible instructions, committing them to architectural state.
The Qbox consists of the following components:
Table 2-4 Qbox Major Component Summary
Name

Description

Bid Enable Logic

Prevents otherwise-ready instructions from bidding in pipes that cannot service them,
either because of a slotting decision or because of non-data-related resource conflict.

Completion Unit

Tracks which instructions have issued, which have passed their trap points, which are
I/O instructions, and which have retired.

Dependency Arrays

Contains an identifier for the producer of each operand for each instruction in the
instruction queue.

Destination
Register Number Array

Contains the destination register specifiers for each instruction. This array are separately located from the SRN because it is not on any performance-critical paths.

Exception Kill Logic

Removes from the Instruction Queue any instructions that have been killed due to an
exception.

FPCR Control

Controls the update of the FPCR in the Fbox. The FCR, along with the native mode
FPCR trap and PALmode fetch barrier, guarantees the correct architecture (in-order)
behavior of writing and reading the FPCR register.

InFlight Table

Tracks instructions that have issued and feeds INums that have passed their trap
points to the Completion unit.

Instruction Queue

The queue from which instructions are picked for execution.

Load/Poison
Re-arm Widget

Handles notification of load/miss events from the Mbox and ensures that all instructions that depend on a missed load will replay at some later time. The LPR also determines when individual instructions are eligible to be deallocated.

Load/Store
Number
High-Water Marker

Disables load and store instructions whose LSNums indicate that there may not be
space available for them in the Mbox load/store queues. Also contains the logic for
preserving the consistency of the DTB on misses.

Oldest CBR Selector

Identifies the oldest conditional branch issuing in the current cycle (that is, the one
most likely to cause a misprediction).

Payload Array

Contains all the instructions and the register file addresses of all operands.

Picker Arrays

On each cycle, chooses the oldest data-ready instruction for each execution pipeline.

Post-Issue Logic

Gathers bubble requests and routes them to the appropriate pipelines. The Post-Issue
Logic is also responsible for sequencing completion signals for the floating-point
pipelines.

Compaq Confidential
2-1 o Architecture Overview

5 Jc1nwtry 2001

Subject To Change

Execution Unit

Table 2-4 Qbox Major Component Summary (Continued)
Name

Description

Profile-Me Collection

Collects the following instruction-time-oriented performance data for the two inflight profile-me instructions: data ready, bid, issue, deallocation, and queue chunk
deallocation.

Queue Chunk Allocator/ Manages the 32 chunks for instruction queue allocation. Picks the two chunks to be
Deallocator
allocated to the next group of eight instructions.
Queue Entry Table

Translates INum dependencies delivered from the Pbox INum Mapper stage into
queue entry number dependencies. The queue entry table also sets the No Live
Dependency bits, when, for example, an instruction is data-ready upon entry into the
queue.

Source Registers Number Arrays

Contain the indexes of the physical registers assigned to each source operand of each
instruction. These arrays (there are two) are kept close to the dependence/bid/grant
logic as the launch of the input physical register specifiers may be a critical path.

2.5 Execution Unit
The execution unit receives instruction information from the Qbox Payload Array and
the Qbox source and destination register number arrays (SRNs and DRN). The former
is received directly by the Ebox or Fbox execution units; the latter by the Register File.

2.5.1 Register File
Although the Alpha architecture only defines 64 registers, the 21464 is a multithreaded, out-of-order machine that requires many more than 64 registers to keep its
pipelines full. The four independent threads each require 64 registers, and an additional
256 temporary registers are used to rename registers of inflight instructions to eliminate write-after-read and write-after-write conflicts. At 65 bits per entry, 512-entries
result in a 4KB register file.
Eight parallel execution units can consume up to 16 source operands and can produce
up to eight results per cycle. The 21464 implements each of 32K 'not-so-little' RAM
cells with 16 read ports and 8 write ports. Although such an implementation is not trivial, defining a register file with fewer ports would have forced the Qbox to either issue
instructions based on the number of operands needed from the register file, or trap
whenever the set of issued instructions needed more than the available number of ports.

2.5.2 Integer Instruction Execution Unit - the Ebox
The Ebox executes those Alpha instructions that do not reference memory and are not
floating point. The Ebox contains multiple copies of its various processing elements,
allowing the Qbox to schedule as many as eight instructions per cycle.

Compaq Confidential
5 January 2001 ··· Subject To Change

Architecture Overview

2-11

· Execution Unit

Table 2-5 lists the Ebox major components.
Table 2-5 Ebox Major Component Summary
Component

Description

Integer Units (8)

The integer functional units execute the traditional integer arithmetic
and logical instructions as well as performing the address generation
and data formatting of memory instructions.

Multimedia Units (4)

The multimedia units execute the newer integer instructions targeted
at accelerating multimedia operations and also perform integer multiplication.

The register caches store recently written register values allowing
dependent instructions to issue before the register file is updated.

Structurally, the Ebox processing elements are organized into eight functional units,
each of which executes a predefined subset of the instruction set, as listed in Table 2-6.
Each integer functional unit is a logical collection of processing elements that collectively execute a specific set of Alpha instructions, and each functional unit is organized
as four clusters of two units each.
Table 2-6 Ebox Cluster Section Summary
Section Name In Units

Description

Adder

A full 64-bit signed integer adder that produces a complete result each cycle. Services
the following instructions:

0-7

Type

Instructions

Add

ADDL, ADDLN, ADDQ, ADDQN, S4ADDL, S8ADDL,
S4ADDQ, S8ADDQ
SUBL, SUBL/V, SUBQ, SUBQN, S4SUBL, S8SUBL, S4SUBQ,
S8SUBQ
CMPBGE, CMPULT, CMPEQ, CMPULE, CMPLT, CMPLE
LDAH, LDA, RS, RC

Sub
Compare
Other
Cross Cluster
Result
Interface

0-7

Receives one-cycle results from the other functional units, bypasses the data onto the
operand busses if immediately needed, and latches the data for writing into the local
register cache.

Global Control

0-7

Decodes the instruction information sent by the Qbox and coordinates the various
processing elements within a functional unit.

Load Data
Interface

4-7

Interfaces the data returned from the Mbox to the functional units and register caches.
Services the following instructions:
Type

Instructions

Load

LDL, LDQ, LDQ_U, LDL_L, LDQ_L, LDBU, LDWU, LDG; LDS,
LDT,LDF
HW_LD, STx_C

Special

Compaq Confidential
2-12

Architecture Overview

5 JamJc1ry 2001 ···Subject To Change

Execution Unit

Table 2-6 Ebox Cluster Section Summary (Continued)
Section Name In Units

Logic Box

0-7

Description

Performs logical and arithmetic operations. Services the following instructions:
Type

Instructions

Cmove

CMOVLBS, CMOVLBC, CMOVNE, CMOVLT, CMOVGE,
CMOVLE, CMOVGT
BLBC, BEQ, BLT, BLE, BLBS, BNE, BGE, BGf
AND, BIC, BIS, ORNOT, XOR, EQV
AMASK, IMPLVER, SEXTB, SEXTW

Branch
Logical
Special
Multimedia
Operand
Interface

4-7

Forwards the instruction operands from the corresponding integer functional unit to
the multimedia clusters. Each multimedia cluster is associated with the lower integer
functional unit in a cluster and derives its operands from that functional unit. Services
the following instructions:
Type

Instructions

Multiply
Multimedia
Store

MULL, MULLN, MULQ, MULQN, UMULH
Opcode lC.XX, except SEXTB, SEXTW
STL, STQ, STQ_U, STL_C, STQ_C, STB, STW, STG, STS, STT,

Special

ITOFF, ITOFS, ITOFf, HW_ST

STF
Register File
Operand
Interface

0-7

Interfaces the operands from the register file to the Ebox opbusses. Also bypasses literals onto the opbusses.

0-3

Handles staging of different result latencies, floating-point load format conversion
and forwarding of results to the register file.

Compaq Confidential
5 January 2001 ·- Subject To Change

Architecture Overview

2-13

Execution Unit

Table 2-6 Ebox Cluster Section Summary (Continued)
Section Name In Units

0-3

Shifter

Store Data
Interface

Description

A full 64-bit shifter that produces a complete result each cycle. Services the following
instructions:
Type

Instructions

Shift
Mask
Extract
Insert
Zap

SRL, SLL, SRA
MSKBL, MSKWL, MSKLL, MSKQL, MSKWH, MSKLH, MSKQH
EXTBL, EXTWL, EXTLL, EXTQL, EXTWH, EXTLH, EXTQH
JNSBL, JNSWL, JNSLL, JNSQL, JNSWH, JNSLH, JNSQH
ZAP,ZAPNOT

Interfaces to the store data buses (to the Mbox). This unit is not actually part of the
integer clusters but resides in a separate partition to the right of the integer clusters.
Services the following instructions:
Type

Instructions

Store

STL, STQ, STQ_U, STL_C, STQ_C, STB, STW, STG; STS, SIT,

Special

ITOFS, ITOFF, ITOFT, FTOIS, FTOIT

STF
Virtual
Address
Generator

4-7

Computes the 16-bit displacement add and factors the big/little endian control to form
a correct virtual memory address. Services the following instructions:
Type

Instructions

Load

LDL, LDQ, LDQ_U, LDL_L, LDQ_L, LDBU, LDWU, LDG; LDS,
LDT,LDF
STL, STQ, STQ_U, STL_C, STQ_C, STB, STW, STG; STS, SIT,
STF

Store
Jump
Special

JM~JSR,RE~JSR_COROUTINE

TRAPB, EXCB, MB, WMB, ECB, FETCH, FETCH_M, WH64,
HW_LD, HW _ST, HW_MTPR, LDx_ARM, QUIESCE

2.5.3 Floating-Point Instruction Execution Unit - the Fbox
The Fbox executes all current Alpha floating-point instructions and the new paired single-precision instructions. The Fbox receives instructions from the Qbox, by way of the
Ebox, and receives operands from the register file, the load data buses (up to three), or
its own register caches. The Fbox returns floating-point results to the Register File and
floating-point store data to the Mbox, again by way of the Ebox. The Fbox returns
exception information to the Qbox.

Compaq Confidential
2-14

Architecture Overview

5 J<1nuary 2001 ·-Subject To Change

Execution Unit

Table 2-7 lists the Fbox major components.
Table 2-7 Fbox Major Component Summary
Component

Description

Floating-point control register
(FPCR)

Contains rounding information and trap disable bits used by the floating-point
operate instructions, and exception status information from floating-point
operate instructions. The FPCR is read from and written to the floating-point
registers by the MF_FPCR and MT_FPCR instructions. In addition, all operate instructions use the dynamic rounding mode bits to round the results and
the trap disable bits to signal traps when an exception is detected.

Interface control (F_JN1)

Performs a partial decode of the opcode, function code, and thread processor
unit (TPU) to determine if a valid floating-point instruction has been issued.
The F_INT also contains logic that allows direct access to internal operand
buses from Register File operand buses, and logic to dispatch floating-point
store data to the Ebox from either the result data of pipelines F _PO and F_Pl,
or from the register cache.

Operand steering unit (F_OSU)

Performs comparisons against incoming physical register (Preg) numbers to
determine the source of input operands to the Fbox pipelines.

Pipeline Clusters (F_Pn)

The Fbox is organized as four identical clusters, each cluster consisting of one
execution pipeline. The four pipelines, F_PO through F_P3, allow up to four
floating-point operate instructions to be issued at each cycle. Two copies of a
register cache, one for each set of two pipelines, are included to allow the
results of recently completed instructions to be used with minimal delay. Each
pipeline contains the functional units needed to execute the various floatingpoint instructions.

Contains staging logic and static RAM that latch and hold recently generated
result data of the Fbox pipelines as well as copies of incoming floating-point
loads. The result data is eventually dispatched to the Register File. However,
this result and load data can be used in subsequent floating-point operations
without incurring the transit time delay in returning data from the Register
File

2.5.3.1 Functional Units
Table 2-8 lists the instructions that are executed by each functional unit in the

Compaq Confidential
5 January 2001 --· Subject To Change

Architecture Overview

2-15

Memory Controller Unit - the Mbox

Fbox.
Table 2-8 Fbox Functional Unit Summary
Functional Unit

Instructions

Add pipe 1 : F_APl

ADD,SUB,CMP

Add pipe 2 : F_AP2

ADD/SUB (align> 1), CVTff, CVTfq, CVTqf, CVTql, CVTlq

Divider : F_DIV

DIV 1

Graphics ADD: F_GAD

Paired single-precision except PMUL, PARCPL, and PARSQRT

Graphics MUL : F_GML

Paired single-precision MUL type instructions: PMUL, PARCPL, PARSQRT

Mull Unit : F_MUL

MUL

Short pipe : F_SHP

CPYSx, FCMOV, FBxx
Special operands (Zeros, Denormal OPD, NANs, INF,RES.OPD),INPUT
EXCEPTIONS, Mx_FPCR
SQRT1

Square root : F_SQR
1

See Section 2.4.3 for instruction issue rules regarding the DIV and SQRT instructions.

2.6 Memory Controller Unit -

the Mbox

The Mbox executes Alpha memory access instructions, including integer and floatingpoint load and store, memory barrier, prefetch, write-hint, load-locked, and store-conditional.
The Mbox can process up to four instructions per cycle, out of order. At each cycle, the
Mbox can accept as many as three load instructions and as many as two store instructions, for a maximum of four operations. The Mbox is solely responsible for tracking
memory reference instructions that have issued but not retired, and for ensuring that the
final effect of memory reference instructions is equivalent to sequential execution of the
thread, within the Alpha SRM definition of equivalence. The Mbox also receives fill
data from the Cbox and, to maintain cache coherence, processes probes that the Cbox
receives from the rest of the system.
There are two data input busses, each of which is associated with a store port.
The Mbox has four instruction ports to handle loads, stores, and prefetches. The Mbox
can return data on three of those ports, so the Mbox can accept a maximum of three
loads issued per cycle.
Of the four ports:

•
•
•

Two can perform loads and prefetches
One can perform loads, stores and prefetches
One can perform only stores

Compaq Confidential
2-16

Architecture Overview

5 Jc111uary 2001 -- Subject To Change

External Interface

Table 2-9 lists the Mbox major components.
Table 2-9 Mbox Major Component Summary
Component

Description

Dcache

64KB of data storage, with a write-allocate, write-through write-policy

Dtags

lK entries of tag storage, arranged as 2-way set-associative with 4 read ports and 1 write
port

Load Queue

64-entry queue that holds issued, but not-retired load addresses. Handles load ordering
traps and re-issuing of loads

Merge Buffer

16-entry buffer that accumulates Store data before writing it into the Dcache and Cbox

Pre-MAF

16-entry buffer that holds the addresses of loads that have missed in the Dcache and need
further activity in the Cbox.

Store Queue

64-entry queue that holds store addresses & data before stores have retired. Used to satisfy load requests to addresses with uncompleted stores

Translation Buffers

128-entry, fully-associative with 4 read ports to perform the virtual-to-physical address
transactions

2. 7 External Interface
The responsibilites of the external interface unit include:

•

Resolve misses in the !cache and Dcache, either in the Scache, local memory, or
remote memory.

•

Ensure that data written by the processor is made visible coherently to other processors and 1/0 nodes.

•

Communicate with other nodes in a multiprocessor configuration so that the total
memory space can be shared.

•

Control Rambus memories to provide physical memory to the multiprocessor.

•

Implement a coherence protocol that ensures that all processors have a consistent
image of memory.

•

Accept and prioritize interrupt requests, delivering thread-specific requests to the
Qbox.

The external interlace unit consists of three major subsections that work together, in
conjunction with the cache coherency protocol, to present a distributed, shared, coherent, cached, multiprocessor memory (CC-NUMA) to the 21464 core.

2.7.1 Scache Controller - the Cbox
The Cbox controls the second-level cache (Scache). In particular, the Cbox controls:

•
•
•

Write-through from the Mbox

•

Probes from the system

Requests for cache blocks from the Ibox and Mbox

Fills and displaced victims

Compaq Confidentia I
5 January 2001 --· Subject To Change

Architecture Overview

2-17

Ex.1ernal Interface

The Cbox contains the following major components:
Table 2-10 Cbox Major Component Summary
Component

Description

Miss address file (MAF)

Holds requests from the processor whilst being processed.

Victim address file (VAF)
Victim data buffer (VDB)

Hold blocks being sent back to the system either as displacement victims or in
response to system probes.

Probe address file (PAF)

Holds probes waiting to be processed.

2.7 .2 Router -

the Rbox

The Rbox provides the interprocessor switch - the communication fabric by which
21464 processors are interconnected to form glueless multiprocessor systems. The
Rbox interfaces the local processor and memory to I/O controllers, all other processors,
and their associated memories, through five bidirectional ports.
The Rbox includes the following physical components:

•

Port input queues -packets received from interface but not yet transferred to an
output queue

•
•

Port output queues - packets waiting to be transferred to a connected processor

•

Arbitration -

Routing tables - translate destination node number or mask into output port selection and virtual channel

2.7.3 Rambus Interface -

selects among port input queues for transfer to output queues

the Zbox

The Zbox provides a glueless interface to two independent interleaved arrays of Rambus memories for processor's main memory, including cache-coherence directory. Each
array consists of four busses, each accessing up to 32 DRAM chips.
The Zbox includes the following physical components:

•

Rambus queues and sequencer - controls attached Rambus memories for read and
write operations. Includes scheduling table and page status.

•

Directory management and coherence protocol state machine.

•

Directory in flight table- D IFT records requests to the local memory that cannot
complete immediately because required data is "in-flight" somewhere in the system.

2.7 .4 Cache Coherency Protocol
The 21464 adopts the 21364 cache coherence protocol with small enhancements. The
protocol is a directory based CC-NUMA and tolerates out-of-order channels except for
the 1/0 channel, thereby supporting an adaptive packet routing.
2.7 .4.1 Introduction to the Protocol

The coherence protocol is the mechanism that lets numerous processors maintain a consistent image of the contents of memory, as required by the Alpha SRM.
Compaq Confidentia I
2-18

Architecture Overview

5 Janwiry 2001 ···Subject To Change

Pipeline Organization

The 21464 increases reliability and load-distribution by using multiple resources for
enforcing cache coherence. Further, the 21464 uses nondeterministic routing, which
makes the best use of available network resources. Such routing allows two messages
to take different paths and get out of order, even if they start and end at the same nodes.
The protocol is designed to ensure that all processors that cause and/or observe changes
in memory, see those changes occur in the same apparent order, even though the messages between processors and memories may get out of order. The order observed by all
processors is the order in which requests are serviced in their home memory and, in particular, in the Mbox directory in-flight table (the DIFT). Caches communicate with the
DIFT as they manipulate memory data, and the DIFT delays multiple requests for any
individual block until it has coordinated previous requests with any caches affected by
those requests.
The protocol, as managed by the DIFT, is concerned with the transitions between states,
and with performing the transitions in such a way that as much of the communication
latency as possible is kept out of the critical paths.
The memory system is designed with the expectation that a disproportionate fraction of
the memory traffic produced by any processor is addressed to its own local memory;
this is true for most multiprocessor applications, though precisely how much is highly
application-dependent. The protocol uses this fact, and the onchip communication
between a cache and its local controller, to optimize references to the local memory.
The Dcache optimizes the directory accesses for requests from local and remote processors. The onchip Dcache stores the directory information of most frequently used
cache blocks to minimize memory accesses for directory information. The Dcache is
updated by requests from the local Cbox and remote processors, thereby eliminating the
need for the LPR.
2.7 .4.2 Structures that Maintain the Cache Coherence

Cache coherence is maintained by using the following structures:
•

Miss address file (MAP)

•

System request pending queue (SRQ)

•

Victim buffer
Victim address file (VAF)
Victim data buffer (VDB)

•

Probe queue (PRQ): probe queue

•

Directory in-flight table (DIFT)

2.8 Pipeline Organization
The pipeline is organized as follows.

Compaq Confidentia I
5 Jam..1ary 2001 -· Subject To Change

Architecture Overview

2-19

Pipeline Organization

2.8.1 Pipeline Diagram
Figure 2-2 shows the 21464 pipeline stages. Note the following symbol meanings in
Figure 2-2:
Symbol

Meaning

Exception funnel timing
The cycle at which an exception kill is driven onto the Retire/Kill Bus. Its position in this diagram
is relative to the first good path instruction block after the exception kill is posted on the Retire/
Kill bus.
Completion of instructions issued from the 4 main computation pipes and caused no exceptions.
The earliest retire cycle (Retire Bus cycle) of instructions completed in CMPl.
Completion of instructions issued from the 4 main computation pipes which may cause an exception (including all the floating-point instructions).
The earliest retire cycle (Retire Bus cycle) of instructions completed in CMP2. This is also the V3
timing of a Retire Time Exception.
Completion of instructions issued from the 4 memory pipes.
The earliest retire cycle (Retire Bus cycle) of instructions completed in CMP3. This is also the V3
timing of a Retire Time Exception.

V8
CMPl
RETl
CMP2
RET2
CMP3
RET3

In Figure 2-2, alphabetic characters that follow the box letter (such as the W in the second row's PW) signify negative integers and are defined in Table 2-11.
Figure 2-2 21464 Pipeline Stage Diagram
:::c

:ii

~ ~ Z 0
~

('I)

~ 0
~

~
w
a:

~ ~ ~ ~ ~ oo ~

8 ~ ~

:E
0

~ oo ~

~ ~

e> ~ ~ ~
~

~ ~

~
~

w
a:

~ ~

Z 0 ~ 0 ~ oo ~ ~ > ~ ~ ~ N Q ~ N ~ ~ ~ ~ ~ oo ~
u u u u u u u u u u u u u u u u u u u u u u u u u u u u
~

Compaq Confidential
2-20

Architecture Overview

5 Jam.u~ry 2001 -· Subject To Change

Pipeline Organization

2.8.2 Conversion Between Negative Integer and Alphabet
Table 2-11 shows the conversion between negative integers and the alphabet.
Table 2-11 Negative Integers to Alphabetics Conversion
-26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9

-8

-6

-5

-4

-3

-2

-1

ABC

STU

DEF

-7

2.8.3 Basic Pipeline Stage Conversion Equations
The basic pipeline stage conversion equations are as follows. The conversions are tabulated in Table 2-13.
Table 2-12 Pipeline Stage Conversion Equations
To

From

P+4

Q+2

R+4

E+4

F+O

M+l

C+3

2.8.4 Conversion Table
As listed in Table 2-12, from box X<a> to box Y<b>, <b>=<a>+<intersection of boxes
X and Y>. The intersection in Table 2-13 is bolded.
Table 2-13 Pipeline Stage Conversion
To J,

From~

I
I

+10

+14

+15

+18

-6

+10

+11

+14

-10

+12

-12

-16

-4

-6

-2

-10

-6

-4

-14

-10

-8

-4

-20

-14

-10

-8

-4

-0

-20

-15

-11

-9

-5

-1

-21

c
v

-18

-14

-12

-8

-4

-3

+10

+12

+16

+20

+21

-24
+24

Compaq Confidentia I
5 January 2001 ··· Subject To Change

Architecture Overview

2-21

ln~truction Execution Pipelines and Latency

2.9 Instruction Execution Pipelines and Latency
Instruction Latency

Defines the parent-to-child issue latency. Also identifies any cross-pipeline delay associated with broadcasting the parents results to other pipelines. Instructions that are not
pipelined are also identified as "bubbling" for completion. Latency is shown in Table
2-14 in the following formats:
Format

Meaning

N cycle latency to a child in any pipeline
M cycle latency plus extra n cycle to other pipelines.
N cycle latency non-pipelined, requires bubble (B) to signal completion.

m+n
n+B

Execution Pipelines

In Table 2-14, the pipelines column identifies those of the eight pipelines in which the
instruction can execute. The actual slotting algorithm is a function of the types and
positions of the instructions in each map block. Details about instruction slotting can
be found in Section 2.10. Because an instruction is slotted to a particular pipeline does
not mean it must execute there. Follow-me capabilities in the Qbox allow instructions
for which operands are data-ready in another allowed pipeline in the same half of the
queue to issue from that pipeline. Pipelines 0, 2, 5 and 7 are in one-half of the queue,
pipes 1, 3, 4, 6 are in the other half.
Pipelines are described in Table 2-14 in the following formats:
Format

Meaning

0-7
0-3
0,3
0-1
Alt 0-3

Can execute in any pipe
Can execute in pipes 0, 1, 2, or 3.
Can execute in only pipes 0 or 3
Can execute in only pipes 0 or 1 and not both in the same cycle.
Can execute in pipes 0, 1, 2, or 3, but does not issue to the same pipe in consecutive cycles

Table 2-14 Instruction Execution Pipelines and Latency
Mnemonic

Pipelines

Latency

Mnemonic

Pipelines

Latency

113 1
11

PALcode (Opcodes as follows:)

CALL_PAL

0-1

HW_MTPR

6,7
0-1

HW_LD

6,7

HW_ST

4,5

HW_MFPR

0-1

IFETCHB

4,5

Add/Subtract/Compare (Opcode 10)

ADDL

0-7

1+ 1

S4ADDQ

0-7

1+1

ADDQ

0-7

1+1

S4SUBL

0-7

1+1

CMPBGE

0-7

1+1

S4SUBQ

0-7

1+1

Compaq Confidential
2-22

Architecture Overview

5 January 2001 ··· Subject To Change

Instruction Execution Pipelines and Latency

Table 2-14 Instruction Execution Pipelines and Latency
Mnemonic

Pipelines

Latency

Mnemonic

Pipelines

Latency

CMPEQ

0-7

1+ 1

S8ADDL

0-7

1+ 1

CMPLE

0-7

1+ 1

S8ADDQ

0-7

1+ 1

CMPLT

0-7

1+ 1

S8SUBL

0-7

1+1

CMPULE

0-7

1+ 1

S8SUBQ

0-7

1+1

CMPULT

0-7

1+ 1

SUBL

0-7

1+1

S4ADDL

0-7

1+1

SUBQ

0-7

1+1

AMA SK

0-7

1+1

CMOVLE

0-7

1+1

AND

0-7

1+ 1

CMOVLT

0-7

1+1

BIC

0-7

1+ 1

CMOVNE

0-7

1+1

BIS

0-7

1+1

CMOV2

0-7

1+1

CMOVEQ

0-7

1+1

EQV

0-7

1+1

CMOVGE

0-7

1+ 1

INOP

0-7

1+1

CMOVGf

0-7

1+1

ORNOT

0-7

1+1

CMOVLBC

0-7

1+ 1

XOR

0-7

1+1

CMOVLBS

0-7

1+ 1

extbh 2

0-3

1+ 1

INSWH

0-3

1+1

EXTBL

0-3

1+ 1

INSWL

0-3

1+1

0-3

1+1

Integer Logical (Opcode 11)

Integer Shift (Opcode 12)

EXTLH

0-3

1+1

mskbh

EXTLL

0-3

1+1

MSKBL

0-3

1+1

EXTQH

0-3

1+ 1

MSKLH

0-3

1+1

EXTQL

0-3

1+ 1

MSKLL

0-3

1+1

EXTWH

0-3

1+1

MSKQL

0-3

1+1

EXTWL

0-3

1+1

MS KWH

0-3

1+ 1

insbh 2

0-3

1+ 1

MSKWL

0-3

1+1

INSBL

0-3

1+1

SLL

0-3

1+1

INSLH

0-3

1+1

SRA

0-3

1+1

INS LL

0-3

1+1

SRL

0-3

1+1

INSQH

0-3

1+ 1

ZAP

0-3

1+1

INSQL

0-3

1+1

ZAPNOT

0-3

1+1

Compaq Confidential
5 January 2001 -· Subject To Change

Architecture Overview

2-23

Instruction Execution Pipelines and Latency

Table 2-14 Instruction Execution Pipelines and Latency
Mnemonic

Pipelines

latency

Mnemonic

Pipelines

latency

MULL

4-5

UMULH

4-5

MULQ

4-5

Integer Multiply (Opcode 13)

Integer to Floating Register Transfer (Opcode 14)

ITOFF

6,7

SQRTG

Alt 0-3

18+1

ITOFS

6,7

SQRTS

Alt 0-3

33+1

ITOFT

6,7

SQRTI

Alt 0-3

33+1

SQRTF

Alt0-3

18+1

VAX Floating-Point (Opcode 15)

ADDF

0-3

3+1

CVTQF

0-3

3+1

ADDG

0-3

3+1

CVTQG

0-3

3+1

CMPGEQ

0-3

3+1

DIVF

Alt 0-3

9+B+1

CM PGLE

0-3

3+1

DIVG

Alt 0-3

13+B+l

CMPGLT

0-3

3+1

MULF

0-3

3+1

CVTDG

0-3

3+1

MULG

0-3

3+1

CVTGD

0-3

3+1

SUBF

0-3

3+1

CVTGF

0-3

3+1

SUBG

0-3

3+1

CVTGQ

0-3

3+1

IEEE Floating-Point (Opcode 16)

ADDS

0-3

3+1

CVTIQ

0-3

3+1

ADDT

0-3

3+1

CVTTS

0-3

3+1

CMPIEQ

0-3

3+1

DIVS

Alt 0-3

9+B+1

CMPTLE

0-3

3+1

DIVT

Alt 0-3

13+B+l

CMPTLT

0-3

3+1

MULS

0-3

3+1

CMPTUN

0-3

3+1

MULT

0-3

3+1

CVTQS

0-3

3+1

SUBS

0-3

3+1

CVTQT

0-3

3+1

SUBT

0-3

3+1

Miscellaneous Floating-Point (Opcode 17)

CPYS

0-3

1+1

FCMOVGE

0-3

1+1

CPYSE

0-3

1+1

FCMOVGT

0-3

1+1

Compaq Confidentia I
2-24

Architecture Overview

5 Jc1nuc1ry 2001 ··· Subject To Change

Instruction Execution Pipelines and Latency

Table 2-14 Instruction Execution Pipelines and Latency
Mnemonic

Pipelines

Latency

Mnemonic

Pipelines

Latency

CPYSN

0-3

1+1

FCMOVLE

0-3

1+1

CVTLQ

0-3

3+1

FCMOVLT

0-3

1+1

CVTQL

0-3

3+1

FCMOVNE

0-3

1+1

FCMOV2

0-3

1+1

MF_FPCR

0,3

3+1

FCMOVEQ

0-3

1+1

MT_FPCR

0,3

Miscellaneous (Opcode 18)

CCB

4,5

QUIESCE

4,5

ECB

4,5

1+1

RPCC

0-1

4,5

1+1

EXCB
FETCH_M
FETCH

TRAPB

LDL_ARM

6,7

WH64

4,5

LDQ_ARM

6,7

WH64EN4

4,5

WMB

4,5

MB 5

Multimedia (Opcode 1C)

CMPLGE

2,3

TSQERRzzz

2,3

CMPWGE

2,3

TSUBzzz

2,3

CTLZ

2,3

UNPKBL

0' 1

CTPOP

2,3

UNPKBW

0' 1

CTTZ

2,3

UPKSBW4

0' 1

FfOIS

4,5

UPKSWL2

0' 1

FfOIT

4,5

UPKUBW4

0' 1

GPKBLB4

0' 1

UPKUWL2

0' 1

MAXzzz

2,3

VADDSL2

2,3

MINSB8

2,3

VADDUL2

2,3

MINSW4

2,3

VADDzzz

2,3

MINUB8

2,3

VMINMAXSL2

2,3

MINUW4

2,3

VMINMAXUL2

2,3

PERMB8

0' 1

VMINMAXzzz

2,3

PERR

2,3

VMULHUW4

2,3

PKLB

0' 1

VMULLUW4

2,3

PKSLW4

0' 1

VSLB8

0' 1

Compaq Confidential
5 January 2001 ···Subject To Change

Architecture Overview

2-25

Instruction Execution Pipelines and Latency

Table 2-14 Instruction Execution Pipelines and Latency
Mnemonic

Pipelines

Latency

Mnemonic

Pipelines

Latency

PKSWB8

0' 1

VSLL2

0' 1

PKULW4

0' 1

VSLW4

0' 1

PKUWB8

0' 1

VSRAB8

0' 1

PKWB

VSRAL2

0' 1

SEXTB

0' 1
0-7

1+1

VSRAW4

0' 1

SEXTW

0-7

1+1

VSRB8

0' 1

TABSERRzzz

2,3

VSRL2

0' 1

TADDzzz

2,3

VSRW4

TMULUSB8

2,3

VSUBSL2

0' 1
2,3

TMULUSW4

2,3

VSUBUL2

2,3

TMULzzz

2,3

VSUBzzz

2,3

Load and Store (Opcodes as follows:)

LDA

0-7

1+1

STS

4,5

LDAH

0-7

1+1

STT

4,5

LDBU

6,7

LDL

6,7

LDQ_U

6,7

LDQ

6,7

LDWU

6,7

LDL_L

6,7

STW

4-5

3
36

LDQ_L

6,7

STB

4-5

STL

4-5

3
36

STQ_U

4.5

STQ

LDF

6,7

STL_C

4-5
4-57

LDG

6,7

STQ_C

4-57

LDS

6,7

LDT

6,7

STF

4-5

5
36

STG

4-5

Branch and Jump (Opcodes as follows:)

lA.O

JMP

0-1

FBGE

0' 1

lA.1

JSR

0-1

FBGf

lA.2

RET

0-1

BLBC

0' 1
0-7

lA.3

JSR_CO

0-1

BEQ

0-7

0-1

BLT

0-7

Compaq Confidentia I
2-26

Architecture Overview

5 JanutAry 2001 ·- Subject To Change

Instruction Issue and Retire Rules

Table 2-14 Instruction Execution Pipelines and Latency
Mnemonic

Pipelines

BLE

0-7

0' 1

BLBS

0-7

FBLE

0' 1

BNE

0-7

BSR

0-1

BGE

0-7

FBNE

0-3

BGf

0-7

Mnemonic

Pipelines

FBEQ

0' 1

FBLT

Latency

1 HW_MTPR instructions can specify a writer class to create an issue dependency to future

HW_MxPR instructions. HW_MxPR instructions that indentify a reader class dependency are scheduled to issue no earlier than 1 cycle after the HW_MTPR instruction that wrote the class dependency.
HW_MTPR instructions can also specify writer class dependencies that are satisfied on completion,
rather than issue. HW_MxPR instructions that identify a reader class dependency against this type of
writer class are scheduled to issue no earlier than three cycles after the issue of the completion bubble
signal to the writer. The 21464 only allows specifying completion dependencies against HW_MTPR
instructions that target the Mbox; those that target the Ibox are ignored.
2

The mskbh, insbh and extbh decodes are not formally defined by the Alpha SRM because all combinations of inputs produce a zero result. The generalized decoding in the 21464 Integer Shifter does
not special case these code points and produces a zero result.

FETCHx instructions never actually issue from the Qbox but are completed immediately and therefore act as NOPs.

The WH64EN instruction is currently proposed as ECO#l27 to the Alpha SRM.

5 MB instructions never formally issue from the Qbox but are instead sent to the Mbox as soon as they

enter the Qbox. MB instructions do not complete until the Mbox notifies the Qbox that the necessary
conditions have been met.
6

Although store instructions do not produce a register result and therefore do not have normal dependents, the Ibox store-set logic can create dependency groups of loads and stores. A load that is a storeset dependent on a store instruction has an effective issue latency of three cycles from the issue of the
store.

Store conditional instructions issue as stores to pipelines 4 and 5, but bubble back completion to the
Qbox. Final completion of the STx_C instruction appears on the load pipes 6 and 7.

2.10 Instruction Issue and Retire Rules
2.10.1 Issue Rules
In order to issue from the Qbox instruction queue (the IQ), instructions must bid in, and
be granted by, a picker. Slotting determines each instruction's "preferred pipe", i.e. in
which picker it may bid. Each cycle, the oldest bidding instruction in a picker is
granted. Only instructions that are bidding in a given cycle are candidates for grants.
2.10.1.1 Bidding Rules

The following rules apply to initial issue; there are additional qualifications for instructions that must re-issue.
General Bids

In general, instructions may bid when all of their source operands are result-ready.
Additional conditions apply in the following cases.
Compaq Confidential
5 January 2001 ··· Subject To Change

Architecture Overview

2-27

Instruction Issue and Retire Rules

•

Stores and loads that are slotted for a load picker may bid if the following are true:
They are result-ready
They are below their high-water mark (which signifies that the Mbox has sufficient resources to exectute them)
They are not dependent on a DTB writer block

•

Loads must satisfy any store-set dependencies prior to being enabled to bid.

•

All loads and stores are speculatively assumed to be below their high-water mark
when they first allocate into the IQ; their actual status is available one cycle later.
Any loads or stores that are granted as the result of a bid that was based on false
high-water mark speculation are retracted and do not issue from the IQ.

•

Jwnps (JMP, JSR, JSR_COROUTINE, RET), direct branches (BR, BSR), RPCC,
CALL_PAL, HW_MTPR/HW_MTPR for lbox IPRs, and HW_LD/WrChk may
only bid if they are result-ready and their slotted picker is load-enabled in the current cycle.

•

Because floating-point divide and square root instructions are not pipelined operations, they must not issue from the same picker on subsequent cycles and are thus
enabled to bid only on every other cycle.
Unfortunately, the IQ logic does not have time to disable bids for an FDIV or
FSQRT functional unit for which an instruction has been been granted on the
immediately preceding cycle. Therefore, the 21464 globally disables all FDIV and
FSQRT bids every other cycle to give the IQ time to determine exactly which
instructions may safely bid.

•

Instructions expressly identified as NOPs do not bid or issue but are allocated into
the IQ as invalid (i.e. empty) entries.

•

MB instructions do not issue from the IQ but are subject to special retire conditions
as described in Section 2.10.2.

•

Instructions stop bidding if they are killed. Instructions that are killed after being
granted, but before being issued from the IQ, do not issue.

Follow Me Bids:

Instructions that have a cross cluster delay become "locally" result-ready in the cluster
in which their result is produced one cycle earlier than they become "globally" resultready in the rest of the IQ. Instructions that are locally result-ready, and meet all other
bid criteria, may bid in the relevant picker for that one-cycle window, even if it is not
their preferred picker. This is known as a "follow me" bid, since the dependent instructions follows their parent into a cluster.
Instructions are only enabled to make a follow me bid in pickers from which they may
actually issue - in other words, they must be of a type supported by the functional
units serviced by the picker.

Compaq Confidential
2-28

Architecture Overview

5 J(111u(1ry 2001 - Subject To Change

lmplementation..Specific Architecture Features

2.10.2 Retirement Rules
An instruction is eligible to retire if it is complete and all older, unretired instructions
within its TPU are also complete. The Qbox Completion Unit (CMP) retires instructions one INum block at a time, but signals retire eligibility to the Retire/Kill Bus on as
fine as a per-instruction granularity (see the Completion Unit descripton for more
details).
2.10.2.1 Completion Rules

In general, instructions that have passed their poison point and their trap point - that
is, the last point in time when they can cause a disruption - are completed, with the
following exceptions.

•

Some memory instructions pass their trap point very late in the pipeline and are
therefore speculatively completed and subsequently uncompleted when any disruption information becomes available.

•

Instructions identified as NOPs complete immediately upon allocation into the IQ .

•

Killed instructions are automatically completed.

•

MB and STx_C instructions are completed only when the Mbox indicates to the
CMP that it may do so.

•

The Mbox flags I/O operations for the CMP. I/O operations may complete normally, but the CMP may not retire any block containing them until the Mbox signals that this is permitted.

•

There is a facility to drain the Completion Unit pipeline in the event of an external
probe, in order in insure consistency between TPUs and/or CPUs.

Note that the time interval between an instruction's issue and completion depends on
the particular picker from which the instruction issues. Instructions issuing on the four
primary ALU pickers have a faster completion path than the others.

2.11 Implementation-Specific Architecture Features
2.11.1 New Instructions
2.11.1.1 Thread Synchronization

Using a multithreading architecture, the 21464 implements three new instructions that
enhance the performance of multithread processing.
Table 2-15 Thread Synchonization Instructions
Mnemonic

Operation

LDL_ARM

Load Longword and Arm the Watch Register

LDQ_ARM

Load Quadword and Arm the Watch Register

Quiesce

Wait on Access to the Watch Register

Compaq Confidential
5 January 2001 -·Subject To Change

Architecture Overview

2-29

lmplementation..specific Architecture Features

2.11.1.2 Short Vector SIMD (Single Instruction Stream, Multiple Data Streams)

The short vector SIMD instructions provide a complete set of vectorized integer operations for multimedia and signal processing applications. They allow the processing of
multiple elements in each machine cycle by vectoring smaller data types that are
packed into a quadword.
Table 2-16 Short Vector SIMD Instructions
Mnemonic

Operation

Tree Operations

TABSERRSB8

Tree Absolute Error Byte

TABSERRSW4

Tree Absolute Error Word

TAB SERRUBS

Unsigned Tree Absolute Error Byte

TABSERRUW4

Unsigned Tree Absolute Error Word

TADDSB8

Tree Add Byte

TADDSW4

Tree Add Word

TADDUB8

Unsigned Tree Add Byte

TADDUW4

Unsigned Tree Add Word

TMULSB8

Tree Multiply Byte

TMULSW4

Tree Multiply Word

TMULUB8

Unsigned Times Unsigned Tree Multiply Byte

TMULUSB8

Unsigned Times Signed Tree Multiply Byte

TMULUSW4

Unsigned Times Signed Tree Multiply Word

TMULUW4

Unsigned Times Unsigned Tree Multiply Word

TSQERRSB8

Tree Squared Error Byte

TSQERRSW4

Tree Squared Error Word

TSQERRUB8

Unsigned Tree Squared Error Byte

TSQERRUW4

Unsigned Tree Squared Error Word

TSUBSB8

Tree Subtract Byte

TSUBSW4

Tree Subtract Word

TSUBUB8

Unsigned Tree Subtract Byte

TSUBUW4

Unsigned Tree Subtract Word

Vector Operations

CMPLGE

Compare LongWord

CMPWGE

Compare Word

GPKBLB4

Graphics Pack Byte

PERMB8

Permute Bytes

PKSLW4

Pack Signed Longwords to Words

Compaq Confidential
2-30

Architecture Overview

5 Jc1m1c1ry 2001 -· Subject To Change

lmplementation..Specific Architecture Features

Table 2-16 Short Vector SIMD Instructions
Mnemonic

Operation

PKSWB8

Pack Signed Words to Bytes

PKULW4

Pack Unsigned Longwords to Words

PKUWB8

Pack Unsigned Words to Bytes

UPKSBW4

Unpack Signed Bytes to Words

UPKSWL2

Unpack Signed Words to Longwords

UPKUBW4

Unpack Unsigned Bytes to Words

UPKUWL2

Unpack Unsigned Words to Longwords

VADDSB8

Parallel Add Byte

VADDSL2

Parallel Add Longword

VADDSW4

Parallel Add Word

VADDUB8

Unsigned Parallel Add Byte

VADDUL2

Unsigned Parallel Add Longword

VADDUW4

Unsigned Parallel Add Word

VMINMAXSB8

Parallel MIN/MAX Byte

VMINMAXSL2

Parallel MIN/MAX Long Word

VMINMAXSW4

Parallel MIN/MAX Word

VMINMAXUB8

Parallel Unsigned MIN/MAX Byte

VMINMAXUL2

Unsigned Parallel MIN/MAX LongWord

VMINMAXUW4

Unsigned Parallel MIN/MAX Word

VMULHUW4

Parallel High Multiply Word

VMULLUW4

Parallel Multiply Word

VSLB8

Parallel Shift Left Byte

VSLL2

Parallel Shift Left Longword

VSLW4

Parallel Shift Left Word

VSRAB8

Parallel Shift Right Arithmetic Byte

VSRAL2

Parallel Shift Right Arithmetic Longword

VSRAW4

Parallel Shift Right Arithmetic Word

VSRB8

Parallel Shift Right Byte

VSRL2

Parallel Shift Right Longword

VSRW4

Parallel Shift Right Word

VSUBSB8

Parallel Subtract Byte

VSUBSL2

Parallel Subtract Longword

VSUBSW4

Parallel Subtract Word

Compaq Confidential
5 January 2001 ···Subject To Change

Architecture Overview

2-31

lmplementation..specific Architecture Features

Table 2-16 Short Vector SIMD Instructions
Mnemonic

Operation

VSUBUB8

Unsigned Parallel Subtract Byte

VSUBUL2

Unsigned Parallel Subtract Longword

VSUBUW4

Unsigned Parallel Subtract Word

2.11.2 CMOV Instruction Processing
With register renaming, the CMOV instructions must be treated as having three source
operands. A CMOVx Ra, Rb, Re instruction tests Ra for the x condition and, if true,
moves the contents of Rb into Re. If the condition is false, Re is left alone. Because of
renaming, the newly assigned Re register does not already have a copy of the old Re, so
a move has to be done in this case as well. This requires the hardware to read Ra, Rb,
and Re as sources, and to write Re as a destination as well.
Because the Pbox can only map two source registers and one destination register per
instruction, the third source is a problem.The 21264 solved the problem by breaking
the CMOV instruction into two separate instructions, CMOVl and CMOV2.
The 21464 adopts a similar solution -when the 21464 encounters a CMOV, it inserts
an additional instruction, CMOV2, into the instruction stream. However, unlike the
21264, if the instruction following the CMOV is the NOP that is described in Section
2.11.2.2, the 21464 replaces that NOP with the CMOV2, instead of creating a new
space. That allows the 21464 to map up to four CMOV instructions per cycle. This pair
of instructions is called the native CMOV; its implementation is described in Section
2.11.2.5. The pair of native CMOV instructions is mapped at full bandwidth and they
require no further treatment in the 21464 pipeline.
2.11.2.1 Integer CMOV Specification

CMOV instructions use the architected integer operate instruction format:
CMOVxx Ra.rq, Rb.rq, Rc.wq
Ra.rq, #b.ib, Rc.wq
The operation consists of testing Ra for the condition specified by the xx condition and,
if true, the value in Rb is written to register Re, as follows:
IF TEST(Rav, Condition_based_on_Opcode) THEN Re<-- Rbv
The different conditions specified by the function field are:
CMOVxx

Opcode.Function Field

Condition Under Which Re<- Rbv

CMOVEQ

11.24

CMOVGE

11.46

Re <-- Rbv if Rav = 0
Re <- Rbv if Rav ;;::: 0

CMOVGT

11.66

Re <-- Rbv if Rav > 0

CMOVLBC

11.16

Re <-- Rbv if Rav bit 0 is clear

CMOVLBS
CMOVLE

11.14
11.64

Re <-- Rbv if Rav bit 0 is set
Re <-- Rbv if Rav ::;; 0

CMOVLT

11.44

Re <-- Rbv if Rav < 0

CMOVNE

11.26

Re <- Rbv if Rav -:F- 0

Compaq Confidential
2-32

Architecture Overview

5 January 2001 - Subject To Change

lmplementath.·:mMSpecific Architecture Features

As described in Section 2.11.2, the 21464 breaks CMOV instructions into CMOVxxl
and CMOV2. For each of these instructions, the CMOVxxl instruction has the form:
CMOVxx Ra.rq, Rc.rq, Rc.wq
For each of these instructions, the CMOV2 instruction has the form:
CMOV2 Rc.rq,Rb.rq,Rc.wq
Rc.rq, #b.ib, Re. wq
CMOV2 has opcode/function field 11.68, which is currently an unused function field
in the Alpha architecture. Because the architecture does not require that unused function code to trap, there is no conflict with the 21464 opcode detector.
2.11.2.2 Native CMOV

The native CMOV-nop that is recognized and replaced with CMOV2 is:
NOP

R31, R31, R31

NOP has opcode/function field 11.20 (same as BIS).
2.11.2.3 Floating-Point FCMOVxx Specification

Floating-point CMOV instructions use the architected floating-point operate instruction
format:
FCMOVxx Fa.rq, Fb.rq, Fe. wq
The operation consists of testing Fa for the condition specified by the xx condition and,
if true, the value in Fb is written to register Fe, as follows:
IF TEST(Fav, Condition_based_on_Opcode) THEN Fe <--Fbv
The different conditions specified by the function field are:
FCMOVxx

Opcode.function field Condition Under Which Fe <- Fbv

FCMOVEQ
FCMOVGE
FCMOVGf
FCMOVLE
FCMOVLT
FCMOVNE

17.02A
17.020
17.02F
17.02E
17.02C
11.02

Fe<- Fbv if Fav = 0
Fe<- Fbv if Fav;;::: 0
Fe<- Fbv if Fav > 0
Fe<- Fbv if Fav:::;; 0
Fe<- Fbv if Fav < 0
Fe <-- Fbv if Fav "# 0

As described in Section 2.11.2, the 21464 breaks FCMOV instructions into
FCMOVxxl and FCMOV2. For each of these instructions, the FCMOVxxl instruction
has the form:
FCMOVxxl

Fc.rq, Fc.rq, Fe. wq

The FCMOV2 instruction has the foon:
FCMOV2 Fc.rq, Fb.rq, Fc.wq
FCMOV2 has opcode/function field 17.068, which is currently an unused field in the
Alpha architecture for that opcode. Because the architecture does not require that
unused function code to trap, there is no conflict with the 21464 opcode detector.

Compaq Confidentia I
5 January 2001 -· Subject To Change

Architecture Overview

2-33

lmplementation~Specific Architecture Features

2.11.2.4 Native FCMOV

The native FCMOV-nop that is recognized and replaced with the above FCMOV2 is:
FNOP F31, F31, F31
FNOP has opcode/function field 17.020 (same as CPYS).
2.11.2.5 Implementation
2.11.2.5.1 Native CMOV

At Icache fill time, the Ibox does a partial decode of the 16 instructions being loaded
into the !cache. Within each halfblock of eight instructions, pairs of CMOVs and
CMOV-nops are detected, and the CMOV-nop is replaced by the CMOV2 instruction.
The CMOV-nop instruction is only decoded to a degree sufficient to guarantee that it is
an effective NOP. This includes detecting that the destination register is number 31 and
making sure that the opcode is 11 or 17.
Predicate Bit

When the Ebox (or Fbox in the case of FCMOV) sees the CMOVxx Ra, Re, Re instruction, it tests Rav for the xx condition, copies Rev into the low 64 bits of the renamed Re
register, and if the xx condition is true, sets a sixty-fifth bit (the predicate bit) in the register. If the condition is false, the bit is cleared.
When the Ebox (or Fbox in the case of FCMOV) sees the CMOV2 Re, Rb, Re instruction, it tests the predicate bit in Re, and if set, copies Rbv into the new Re. If the predicate bit is not set, the Ebox (or Fbox) copies Rev into the new Re.
The predicate bit is never set unless the 21464 is in the middle of executing the two
parts of a CMOV instruction. A CMOV2 with the predicate bit clear is a NOP, since it
copies Rev into Re. Since interrupts are taken on aligned eight-instruction boundaries,
and CMOV does not cause exceptions, the 21464 never takes an interrupt or exception
with the predicate bit set.
A CMOV2 instruction can be executed in isolation if software branches to the CMOV2
half of a native CMOV sequence. The original placeholder with a destination of R31/
F31 has been remapped to a CMOV2 with the same destination as the original
CMOVxx instruction. Since the predicate is guaranteed to be false, the CMOV2
instruction is effectively a NOP that just copies Re to Re.
Execution within PALmode

Because the shadow register replacement process in PALmode is keyed to different
registers numbers for Rb and Re, the 21464 does not correctly replace the inserted reference to Re for native CMOVxxl instructions in PAL mode. See Section 17.4 for
information.
2.11.2.5.2 Legacy CMOV

Legacy CMOV s are CMOV instructions not followed by the designated native
CMOV-nop instruction. Legacy CMOV instructions are detected at !cache fill time,
and a predecode bit is set for each such instruction. When this instruction is fetched, the
Collapsing Buffer notices the set bit and create~ a CMOV2 instruction by making a
whole new instruction chunk. This new chunk can still be merged with the next fetchchunk, but this method is limited to mapping at-most one CMOV per cycle.

Compaq Confidential
2-34

Architecture Overview

5 Jam1c1ry 2001 ··· Subject To Change

Interrupts

2.11.3 Mapper Alignment
Although the 21464 hardware tries to schedule instructions in an optimal way, there are
occasions where software would like some control of how instructions are mapped and
assigned to functional units. For this purpose, the 21464 defines the MAP_ALIGN
instruction. When MAP_ALIGN is placed in the last slot of an aligned half-block of
eight instructions, it causes that chunk to start a new map-chunk when mapped. That is,
the last chunk is not merged in the collapsing buffer with the previous fetch chunk.
The encoding for the MAP_ALIGN instruction is:
XOR

R31, R31, R31

The Opcode/function field is 11.40
Implementation

At !cache fill time, the Ibox looks for the MAP_ALIGN instruction in the last slot of
the aligned fetch chunk. If found, it sets the MA predecode bit. When this chunk is
fetched, the Collapsing Buffer sees the set predecode bit and starts a new map-chunk,
beginning with the current fetch chunk. See Table 3-17.
This instruction is only partly decoded. Probably all instructions of the type
XOR *, *, R3 l have the effect of starting a new map block when fetched as the last
instruction in a fetch chunk.

2.12 Interrupts
Interrupt handling in the 21464 is unlike most earlier processors in three important
respects:

•

It has no external mechanism for continuously-asserted interrupt requests; all
requests are made as network transactions, and held in the processor awaiting service. This implies a requirement for handshaking around the clearing of interrupt
requests, to ensure that future requests are propagated to the processor.

•

The processor has multiple threads, each capable of running PAL code, interruptlevel service, or user code while the others are active. This implies a requirement
for interlocking among the threads which might be servicing interrupts which was
not necessary in earlier uniprocessors.

•

1/0 devices can implement a programmable Interrupt ID register, whose value can
be sent with an interrupt request to permit PALcode to vector directly to the appropriate service routine.

External interrupt requests are transmitted through the network as IOWr messages to a
block of processor-specific registers. The requestor will receive WrIOAck, except in
the case that the message is directed to the Interrupt ID (IID) queue, and that queue is
full, when the response will be WrIONack. After receiving WrIOAck, the requestor is
expected not to retransmit the request until it has received an explicit release from interrupt software, or it times out. After WrlONack, the requester can choose to send the
request to another processor, retry the same one, or wait for a software timeout.

Compaq Confidential
5 January 2001 -~Subject To Change

Architecture Overview

2-35

AMASK and IMPLVER Instruction Processing and Values

2.12.1 IPR Access Mechanism
2.12.1.1 HW_MFPR and HW_MTPR PALcode Instructions

PALcode uses the HW_MFPR and HW _MTPR instructions to access the internal processor registers. The HW_MFPR instruction reads the value from the specified IPR
into the integer register specified by the Ra field. The HW_MTPR instruction writes
the value from the integer register specified by the Rb field into the specified IPR. See
Section 17 .2 for information.

2.13 AMASK and IMPLVER Instruction Processing and Values
The AMASK and IMPLVER instructions appear to the rest of the 21464 as normal
Integer Logic Box (Opcode 11) instructions, but are handled specially by the Ebox.
The Ebox ignores the registers specified in the instruction and forces the CPU feature
mask constant onto the Ra operand bus whenever an AMASK instruction is decoded
and the implementation version constant onto the Rb operand bus whenever the
IMPLVER instruction is decoded. For both these instructions the logic box performs
the following operation:
Re = Rb & -Ra;

Given that the Alpha SRM requires Ra== R31, the equations reduce to:
AMASK
IMPLVER

Re= Rb &-CPU_feature_mask
Re= Implementation_version

The current constant values are:
CPU_feature_mask (AMASK)
Implementation_version (IMPLVER)

Ox1F07
Ox04

2.14 Performance Monitoring
Performance monitoring hardware provides information about the running CPU in
order to:
•

Drive profiling-directed-feedback optimizations to improve application performance.

•

Guide the OS Scheduler to better utilize the TPU contexts.

•

Provide architectural feedback for future alpha microprocessor and system implementations.

To satisfy those goals, the 21464 supports three types of performance monitoring:
•

An instruction-based profiling algorithm called ProfileMe.
Instruction-based profiling is performed by sampling the dynamic instruction
stream running on the 21464. Sampled instructions are chosen at fetch time based
upon a software-programmable IPR and are monitored while in-flight in the CPU.
Latencies and events are recorded for two separate instructions into a set of profile
record IPRs. When both instructions have finished utilizing CPU resources, a general interrupt to PALcode is triggered.
Compaq Confidentia I

2-36

Architecture Overview

5 J,1nu,1ry 2001 -·Subject To Cfumge

Periormance Monitoring
!

The general interrupt service routine reads the INTERRUPT-SUMMARY IPR to
detennine that the interrupt was caused by an instruction profile event. A privileged
PAL routine can then read out the associated data for each profiled instruction by
issuing MFPRs to the profile record IPRs. In continuous sampling, software would
record the data from the current sample and reinitialize the software-programmable
IPR to begin the process for selecting the next pair of sampled instructions.
•

Aggregate event-based performance counters for monitoring IPC per TPU, as well
as intra-thread resource contention of Caches, TBs, and the branch predictor.
Aggregate performance counters provide expedient insight into chip resource contention problems, especially among processes running on the different TPUs simultaneously. The most potentially problematic resources are the caches (!cache,
Dcache and Scache ), the translation buffers (ITB, DTB) and the branch predictor.
Misses/mispredicts in each of these structures can be counted. Overall performance
can also be monitored by using the cycle counter and the retired instructions
counter to obtain retired instructions per cycle per TPU.
There are three aggregate performance counters: the cycle counter, the retired
instructions counter, and one general event counter that can count one of the other
specified events (!cache miss, Dcache miss, Scache miss, ITB miss, DTB miss or
Branch mispredict). The retired instructions, and general event counter are actually
four counters that count events per TPU simultaneously.

•

Hardware for monitoring memory addresses that was developed for the 21364 and
is being supported by the 21464.
Memory reference performance monitoring hardware is identical to that of the
21364. While the 21464 designers intend to support the same functionality, this
specification may change to reflect architectural differences in the memory subsystem of the two processors.
Instead of IPRs, this performance monitoring hardware is controlled and collected
via IO mapped CSRs. There are separate sections for the Cbox, Zbox and Rbox.

Compaq Confidential
5 Jam.u1ry 2001 - Subject To Change

Architecture Overview

2-37

Performance Monitoring

Compaq Confidential
2-38

Architecture Overview

5 Jc1nw1ry 2001 -- Subject To Change

Features

3
Instruction Fetch Unit -

the lbox

The Ibox is the instruction fetch engine for the 21464. It is responsible for providing
high instruction stream bandwidth to the remainder of the chip. Specifically, the Ibox
delivers instructions directly to the Pbox, which is responsible for instruction number
(INum) resource management, dependence analysis, and register renaming. From there,
instructions proceed to the Qbox, where they await the resolution of their source register dependencies. Once an instruction's register dependencies have been resolved, it is
issued, provided that it wins arbitration for an appropriate functional unit in the Ebox
(arithmetic and logic integer operations), Fbox (arithmetic floating point operations), or
Mbox (memory operations). When an instruction has completed execution, it retires
when it is the oldest non-retired instruction in the machine for the appropriate Thread
Processing Unit (TPU) context.
Instruction stream bandwidth is one of the major factors in overall chip performance. A
program cannot execute faster than the rate of instructions entering the machine.
Achieving sufficient instruction bandwidth for a machine that can execute up to eight
instructions per cycle poses several challenges. In order to meet those challanges, the
Ibox contains many new features that were not designed into prior Alpha implementations.

3.1 Features
The Ibox is responsible for:
•

Delivering up to eight instructions per cycle to the remainder of the machine

•

Maintaining the correct program counter (PC) while the CPU executes programs

•

Receiving interrupts and exceptions to properly redirect the machine

The following new features have been added to the Ibox to support high bandwidth
instruction stream fetching, advanced control flow prediction, simultaneous multithreading (SMT), and memory dependence prediction:

•
•
•
•
•
•

Fetching up to two potentially non-contiguous cache blocks per cycle
Fetch TPU Chooser - to create a resource-balanced SMT fetch engine
Advanced Branch Prediction - predicting up to 16 branches per cycle
History based Jump Target Prediction .
Collapsing Buffer- to facilitate over-fetching and merging fetch blocks
Memory Dependence Prediction using Store Sets
Compaq Confidential

5 January 2001 ··· Subject To Change

Instruction Fetch Unit -

the lbox

3-1

Major Sections

•

Advanced Hardware I-Stream pre-fetching

•

Simultaneous Multithreaded Fill Unit

•

Anti-thrashing Instruction Cache fill policy

3.2 Major Sections
Figure 3-1 Shows the Ibox block diagram. Following the figure is a list of the major
sections.
Figure 3-1 lbox Block Diagram
FromEBox

FromQBox

Checkpoint Unit

Index
Unit

Control
Prediction Unit

PC Unit

Instruction Unit

ToPBox
8 instructions

Fill Unit

ToMBox
FromCbox

Compaq Cordidentia l
3-2

Instruction Fetch Unit-the lbox

5 January 2001 ···Subject To Clumge

Major Sections

The Ibox can be thought of as containing the following major sections:
Table 3-1 lbox Major Sections
Name

Description

Section

Checkpoint
Unit

The Checkpoint Unit maintains state for restarting the CPU in the event of an exception, and trains the control flow predictors and the memory dependence predictor.

3.9

Upon an exception, the Checkpoint Unit resets the following to the state that existed
just before the fetch of the instruction that caused an exception: the PC, branch predictor, jump target predictor, and return stack . The Checkpoint Unit also keeps training
information for the branch and jump target predictors, to train the predictors at the
retirement time of branch or jump instructions.
Control Flow
Prediction
Unit

The Control Flow Prediction Unit predicts PC changes at fetch-time for instructions
that can change control flow when executed.

3.6

Control flow instructions are conditional branches, computed jumps, and subroutine
returns. There is a dedicated predictor for each: the conditional branch predictor, the
jump target predictor, and the return address stack.
Fill Unit

3.8

The Fill Unit fetches instructions from lower-level memory.
The Fill Unit can simultaneously fetch instruction blocks for multiple TPUs. The Fill
Unit also maintains a dynamic hardware prefetcher that attempts to fill the !cache
with blocks that would have missed in the future. The Fill Unit also contains the
!cache Translation Buffer (ITB) that must translate virtual PC miss addresses to physical addresses before making memory requests.

Index Unit

3.4

The Index Unit produces up to two indices per cycle.
The indices are usually predictions from the Line Predictor that are used to access the
!cache, Branch Predictor, and Store Sets Array. The index unit also contains the Fetch
TPU Chooser that arbitrates among multiple TPU s that are ready to fetch instructions.
The produced indices have an associated TPU that is sent along with them down the
Ibox pipeline. The Line Predictor itself consists of a sequential and non-sequential
component, to address the sequential and non-sequential code sequences of the running programs.

Instruction
Processing
Unit

The Instruction Processing Unit stores and retrieves instructions and their associated
tags and data, and contains the following:

3 .5

1 The 64KB !cache and it's associated tag array. Instruction pre-decode bits are also
stored in the !cache Data and Tag Arrays to speed instruction processing in the
Ibox and instruction format decoding in the Pbox.
2 The Store Sets Array, which produces memory synchronization identifiers (store
sets) for potentially every load and store operation. The store sets instruct the Pbox
to create explicit dependencies between certain loads and stores.
3 The Collapsing Buffer, which stores instruction blocks that are driven by the
!cache and collapses up to two instruction blocks per cycle to deliver up to 8
instructions per cycle to the Pbox.
PC Unit

3.7

The PC Unit maintains the program counters for each TPU.
Mostly it calculates PCs based upon the exiting instructions of the fetch blocks (for
example, branches, jumps, returns, fall-through, and so forth), but it also can be reset
by interrupts and exceptions. The PC Unit is also determines !cache misses, index
mispredicts, and way mispredicts in the Ibox pipeline.

Compaq Confidential
5 January 2001 ···Subject To Change

Instruction Fetch Unit -

the lbox

3-3

Forward Path Pipeline

3.3 Forward Path Pipeline
The main Ibox pipeline is shown in Table 3-2:
Table 3-2 lbox Main Pipeline
10

TPU Select

Index Gen.

Icache Access

Collapse

Drive to Pbox

BPR Predict

PC Cale

JPR Predict
RPR Predict
IO The Index Unit comprises the functionality in stages IO and Il. In IO, the Fetch TPU
Chooser arbitrates among TPU's that are ready to fetch instructions, and selects one each
cycle.
11 The Line Predictor generates up to two valid Icache indices for the selected thread. The
Icache indices are predicted because the accessing PCs are not known this early in the pipeline.
I2 The Icache is accessed, and two blocks of up to eight instructions each are read out, along
with their corresponding tags and other information. In parallel with the Icache access, the
control flow predictors operate to provide conditional branch, jump target or return address
predictions for branch, jump or return instructions that are being read out of the Icache
simultaneously.
13 The instructions, along with the control flow predictions, provide enough information to
calculate two PCs. The PCs are compared with Icache tags and Line Predictor indices to
determine whether the fetches hit in the Icache and the predicted indices were correct. If the
Icache accesses were correct, the instructions are buffered in the Collapsing Buffer, which
reads out up to two fetch blocks per cycle and collapses the instructions into an eightinstruction map block.
I4 The map blocks are sent on to the Pbox for mapping.

3.4 Index Unit
3.4.1

Fetch TPU Chooser
The Fetch TPU Chooser (FTC) is responsible for choosing one of the TPUs each cycle.
The chosen TPU's indices will be driven to the !cache and Branch Predictor. Each
cycle, the FTC will choose the TPU that is consuming the fewest Ibox pipeline
resources, and is ready to fetch instructions. Ties can occur, and are broken using a
round robin algorithm.
In order to monitor Ibox pipeline utilization, the FTC receives input from the collapsing
buffer each cycle that indicates the number of entries consumed. The FTC also receives
input from the pipeline latches at each stage to monitor the number of in-flight fetch
chunks that may eventually consume entries in the instruction buffer. The FTC is
responsible for not selecting a TPU if fetching its' corresponding instructions will overflow the instruction buffer. The FTC evaluates whether a TPU is ready to fetch instructions each cycle by receiving input each cycle that either enable or disable a TPU for
arbitration. A TPU will be disabled from arbitration if it is awaiting a pending !cache

Compaq Confidential
3-4

Instruction Fetch Unit - the lbox

5 Jc1nwiry 2001 - Subject To Change

Index Unit

fill, or if that TPU's collapsing buffer is about to become full. The TPU will be reenabled for arbitration when the pending fill returns or when some collapsing buffer
entries are freed.
The FTC is also responsible for ensuring that every TPU makes forward progress. It
does this by detecting when a TPU has not mapped real instructions for a very long
time, and stalling the other TPUs until the starving TPU maps instructions.
The CPU can be configured to run 1,2,3 or 4 TPUs. An IPR will indicate whether each
of the TPUs is "alive" or not. Clearing a TPU's "alive" bit will disable that TPU, that is,
the machine will no longer fetch instructions from that context.

3.4.2

Line Predictor
Figure 3-2 Line Predictor Block Diagram
PC Table
Line Predictor

PC Cale

!cache

The primary function of the Line Predictor is to provide two indices to the Icache by
which it can look up instructions. The index for the Icache are bits <15:2> of the full
address. Bit <15> in the Icache index is a way bit; it selects which way or set stores the
instructions. One way bit implies that there are only two ways a fetch block can be
stored in the Icache. The way bit <15> can be inverted in contrast to the original bit 15
of the PC to place it the other way inside the Icache. This mechanism stops two different addresses that have the same lower bits <15:2> from occupying the same cache slot.
This is also known as thrashing.
To maximize effectiveness, there are three different prediction arrays. Already it can be
seen that there is a need to predict two indices. One array could be used to predict both
indices, but performance can be dramatically increased if some optimizations are done.
First, predicting the two indices separately allows different index schemes and thus better independent predictions for the two slots. Secondly, a sequential predictor requires
less area by storing a single bit that indicates a sequential index is to be predicted (a
Compaq Confidential
5 January 2001 --· Subject To Change

Instruction Fetch Unit -

the lbox

3-5

Index Unit

sequential index can be generated with an adder and the current index). A non-sequential array can then have more room for storing purely non-sequential indices while the
sequential array can be made quite large due to its small storage requirements. It is
infeasible to implement a sequential index generator for Slotl due to timing constraints,
therefore only SlotO prediction will have a sequential predictor in addition to a nonsequential predictor.
While the line predictor is in a free-running state, meaning that it's prediction is perfect, the line predictor can simply obtain its input index from its own output. This is the
secondary function of the Line Predictor, to provide itself with a lookup index. The
index that the Line Predictor will use to index itself happens to be the second index it
sends to the Icache (Slotl). The read index for the Line Predictor is actually broken up
into three indices to access each of it's three arrays. Additionally, the Line Predictor
uses a "squash" bit to index itself. There is some hashing involved of the bits to get the
final read indices. The three arrays are: slotO nonsequential array, slotO sequential array,
and the slotl nonsequential array. Each array is indexed slightly differently, but they all
use the same bits. Since the two non-sequential arrays are actually smaller than the
addressable index space, hashing is employed to yield the best performance.
For SlotO, there are two hash indices, sequential and nonsequential.
•

Sequential - <14:5>,<15>,<4:2>

•

Nonsequential - <14:5>,<15>,(<4> I <3> I <2>)

For Slotl there is just one hashed nonsequential index
•

<14:5><15>,(<3> I <2>),(<16> /\ <4>)

Bits <14:5> are commonly decoded for all three predictors. Since the arrays can do the
hashing on the fly, only non-hashed index bits need to be stored in the nonsequential
arrays.
The SlotO sequential array has 16k entries and the SlotO nonsequential array has 4k
entries. The Slotl nonsequential array has 8k entries - more than SlotO nonsequential
because Slotl does not have the benefit of a sequential predictor.
The Line Predictor index also has a additional "squash" bit. Sometimes the backend of
the Ibox pipe can't handle slotO and slotl at the same time. Without a squash mechanism, the Line Predictor would have restart Slotl via a line mispredict (which costs 2-3
cycles for that TPU). Instead, the Line Preditor will be indexed by the squash bit instead
of bit <15>. In the normal, no-squash case, the squash bit is the same as bit <15>.
When PCC detects the squash case, it will simply invert the squash bit from the normal
case. So now the Line Predictor can be trained with an "alternate" index by inverting
the squash bit so that it now is the inverse of bit <15>. This new "alternate" index will
be trained to re-predict slotl (in the slotO position) again. It's up to the PC calc section
to flip the squash bit for the Line Predictor after it first realizes it can't handle both slots.
This way the Line Predictor can keep moving without taking a mispredict for Slotl
every time the backend can't process it.

Compaq Confidential
3-6

Instruction Fetch Unit-the lbox

5 January 2001 - Subject To Change

Index Unit

3.4.3

Thread Index Latches

3.4.3.1 {Re)Starting/Resuming the Pipe

When the Line Predictor mis-predicts, it needs to be restarted with an index other than
its own output (because it's bad path now). There needs to be some mechanism of generating an index for the !cache and Line Predictor from somewhere other than the Line
Predictor itself. The simplest way to provide this capability is to put a mux on the lookup index that picks between the Line Predictor output and an alternate PC. Old predictions for a sleeping thread also need to be stored until the thread is awakened. We have
many different sources for an alternate PC. This forms the basis for the thread index
latches. In general, PCs from all restarts come from one of three places :
•

PAL BASE+ OFFSET

•

Checkpoint Tables. Jump/Return addresses, alternate PC's, etc. that are stored here.

•

PCO and PC 1 - Calculated values for the PC from PC Cale section

There are correspondingly three types of restarts: exceptions, misprediction, and thread
resume.
3.4.3.1.1 Exceptions

There are three types of exceptions that can change the PC: Post Map, Ibox internal,
and interrupts.

•

Post Map
Post Map exceptions have top priority. All indices for Post Map exceptions are
received from the Checkpoint Table, while the signaling of the event can come to
the Ibox through two interfaces: the fast path and the Efunnel (Exception Funnel).
Fast path exceptions are signaled by the Ebox and Qbox, while the Exception Funnel is entirely contained within the Pbox.
Exceptions that are caused by mispredicted conditional branches can use the special
fast path bypass, which reduces the mispredicting branch penalty. These exceptions
can only be acted upon if nothing is coming from the exception funnel that cycle.
The Qbox sends the Ibox the INum, TPU, and prediction of the oldest issued CBR
every cycle while the Checkpoint Table sends a restart index to the Line Predictor.
Two cycles later, the Ebox will send the result of the prediction for that CBR and if
there is a mispredict, the exception is taken and the new index is ready to load into
the index latches.
The Exception Funnel is a Pbox widget that filters exceptions such that only the
oldest exceptions are signaled to the Ibox. It works by the Pbox sending the Ibox a
signal indicating what restart address to use and which TPU is excepting. The
Checkpoint Table will use this information to select an index to send to the Line
Predictor.
Since there are all sorts of delays between boxes inside the 21464, there must be
some kind of guard against taking a bad-path (younger) exception after an exception has already occurred until the kill has destroyed all remnants of the bad path.
To take a bad-path exception is bad, very bad. Older exceptions are ok, however,
since they are on the good path by definition. The main problem faced here is that it
is not known when all bad-path instructions have been killed. Luckily, there is a
large window of opportunity.

Compaq Confidential
5 January 2001 -·Subject To Change

Instruction Fetch Unit -

the lbox

3-7

Index Unit

Kills happen relatively quickly compared to how long it takes for the first good path
instructions to get issued. What's needed is a window of time after an exception is
taken during which younger exceptions will be masked out. This is just implemented as a counter. The count wil be determined as follows: Ibox pipeline stages
+ Pbox pipeline stages+ Qbox pipeline stages until earlist possible issue, which is
hopefully longer than the kill time for bad-path instructions. Note that this covers
both the fast-path CBR exception interface and the Efunnel interface

•

lbox internal
Ibox internal exceptions are medium priority, only losing priority to the Post Map
exceptions. All indices for these exceptions come to the thread index latches by
way of line mispredicts and non-index faults (see below). This means that the Ibox
internal exception indices are not directly fed to the thread index latches. Instead
they are sent to PC Cale where they will cause an index mispredict. Pipe control
will guarantee that only one TPU can take an Ibox internal exception per cycle.
There are five types of internal exceptions: Reset, Warm Reset, Uncorrectable ECC
error, ITB miss, and Read Access Violation. Each of these will be described in
more detail in another section. Although the indices physically from PC Cale, they
in fact originate from the Checkpoint Tables as a PAL BASE + OFFSET.

•

Interrupts
Interrupts have the lowest priority. The Cbox sends the Ibox a 4 bit TPU vector
indicating an interrupt on that TPU. Then pipe control will load Pal Base + Offset
into the PC latches. This will cause a line mispredict and PCC will send the correct
index to the Line Predictor to start on. This is the same mechanism as for Ibox
internal interrupts. It's important to note that even though Ibox internal exceptions
and interrupts have a priority ordering, the Line Predictor can not distinguish the
difference between the two. It is up to pipe control to prioritize these.

3.4.3.1.2 Misprediction - PC Cale
When the Line Predictor mispredicts, the new start index comes from the PCC (PC
Cale) section. There are two possiblites: PCO and PCl, depending on which slot
mispredicted. Additionally, a restart can ocurr a cycle later because of a tag problem this is called a non-index fault. In this case, the PC must be piped one cycle to line up
with the restart indication. Here are all the restart cases signaled to the line predictor
from PCC in order from youngest to oldest:
1. SLOTO mispredicts
This mispredict is the youngest mispredict and therefore has the least priority. The
index comes from PC CALC which must go directly into the thread latch. PCC will
signal this case late in 13 which makes it a critical path into the thread latches.
2. SLOTl mispredict
Again, the index comes from PC CALC which must go directly into the thread
latch and is signaled late in I4A so this is also a critical path to write PCl into the
thread latch.
3. SLOTO non-index fault

Compaq Confidential
3-8

Instruction Fetch Unit - the lbox

5 Jc1nuary 2001

Subject To Chtmge

Index Unit

The index here is the same as SLOTO mispredict, but is piped by one cycle (I4A) in
the index latches. A way mispredict can be signaled additionally by PCC in I4A. In
this case, bit <1 S> of the PC needs to be inverted before writing into the thread
index latch. Also, a bit is sent with this new "inverted" index to tell the PC CALC
section not to way mispredict again.
4.

SLOTl non-index fault
The index is the same as SLOTl mispredict, but is piped by one cycle (ISA) in the
index latches. Again, a way mispredict can be signaled additionally by PCC in
ISA. In this case, bit <1 S> of the PC needs to be inverted before writing into the
thread index latch. Also, a bit is sent with this new "inverted" index to tell the PC
CALC section not to way mispredict again.

3.4.3.1.3

Thread Resume - Line Predictor (two indexes)

When the Fetch Thread Chooser switches threads, the predictions for the previously
active thread need to be saved so that when that thread is reselected in the future, the
indices for SLOTO and SLOTl are ready to index the Line Predictor and !cache. Other
wise, you would lose performance, as the Line Predictor would be generating indices
for the thread that just stopped and not the thread that just started. This is the default
index source if the thread is selected and no other exceptions have happened.
3.4.3.2 Other Index Latch Tracking Functions

There are few more things the index latches need to track besides indexes: Slotl Valid,
bank conflict, ITB enable, squash prediction, way mispredict restarts, and a guard
mechanism:
•

Slotl Valid
Slotl valid is set if the index came from the line predictor. For all other cases it is
invalidated.

•

Bank Conflict
This means a read and write to the same bank has occurred. Writes take precedence
so the predictions that come out of the line predictor in the next cycle are not valid.
A bank conflict signal is sent to pipe control (PCC) so that it can invalidate the
cycle and the next cycle the index is retried.

•

ITB enable
When a uITB miss has ocurred, the pipe has to be restarted and the ITB n~ds to be
enabled to process the !cache miss. The thread latches hold on to the ITB enable
state so that when the TPU is selected, the ITB can be enabled.

•

Squash Prediction
As explained previously, the line predictor arrays hold a squash bit for squash prediction. Squash prediction is calculated by XORing bit <14> from the prediction
with the squash bit. This prediction is then stored in the index latches so that the
prediction can be sent down the pipe when the TPU is resumed from sleep.

•

Way Mispredict Restart
When a way mispredict is signaled from PCC, a bit of state needs to accompany the
restart index to indicate that a way mispredict is to be ignored.

•

Guard
Compaq Confidential

5 January 2001 -- Subject To Change

Instruction Fetch Unit -

the lbox

3-9

Index Unit

As a safety precaution, there is a guard mechanism set for one cycle after any
restart. The guard causes the index latches to ignore any PC Cale exceptions (like
mispredicts ). In theory, PC calc should not signal a exception after a restart since all
pipe stages are killed. The first possible index that could cause an exception is the
restart index. The guard is in place only in case pipe control can't kill it's pipe stages
in time.

3.4.4

Thread Training Latches
In order for the Line Predictor to actually predict correctly, it needs to be trained to predict the correct indices when it is wrong. Training will require the corrected index to
write into the array, the index to write this new data with, and a write enable signal. In
truth, there are three separate arrays that are trained independently.
SlotO Sequential, SlotO nonsequential, and Slotl nonsequential are all read simulataneously. This means that the training index (the write index) for all three arrays is the
same. In fact, word lines are shared between all three arrays for both reads and writes.
Writes, however, are exclusive between SlotO and Slotl. This is true because if SlotO
mispredicts, Slotl is killed so it is not known if it's prediction was correct. Slotl could
be trained with the data that was already in the array but this is difficult to implement so
Slotl and SlotO will have exclusive write enables. Similarly, if Slotl mispredicts it must
mean that SlotO was predicted correctly so SlotO doens't need to be trained.
SlotO also has another case: sequential/nonsequential training. There are three training
cases:
•

SlotO predicted sequential and mispredicts (nonsequential) -The sequential array
must be written with a 0, or nonsequential prediction. The nonsequential array must
be trained with the nonsequential index. Two arrays are trained at the same time.

•

SlotO predicted nonsequential and mispredicts nonsequentially - The nonsequential
prediction was wrong and needs to be trained with a new nonsequential index.

•

SlotO predicted nonsequential and mispredicts sequentially - The sequential array
needs to be trained for a sequential prediction. The nonsequential array is left alone.
It is important that the nonsequential isn't written in this case even though it may
seem harmless. The truth is that the sequential array has many more entries than the
nonsequential array. The nonsequential prediction may have actually been the prediction for a different index that happened to alias to the same entry in the array. In
this case, the nonsequential prediction should be left alone since it may be an accurate prediction for a different index.

So now it can be seen that three write enables are needed. SlotO Seq, SlotO Nonseq, and
Slotl (nonseq). There are two more pieces to training. The training index and the training data. The training index is just the index used to access the predictions that were
wrong. The job of the training latches is to hold on to this index for each thread. The
training data is just the restart PC index bits sent back by PCC plus the squash bit.
Training will only ocur for a thread when training data is available, an index mispredict
or way mispredict has ocurred, and that thread is selected by the thread chooser. Therefore, it is the training latches job to keep the write enables and write index for each
thread until the thread is selected. The training data comes from the thread index latches
since the write data is the same as the current read index in the line pred arrays.

Compaq Confidential
3-10

Instruction Fetch Unit- the lbox

5 Jc·muary 2001 ·-Subject To Change

Instruction Processing Unit

If it happens that the index being restarted is the same as the write index used for the

train then a condition known as bank conflict will ocurr. This means that both a read
and a write are trying to access the same bank and the line predictor array can't handle
both at the same time. When this happens, writes will take priority over reads. The
cycle of the write there will be no vaild indices coming out of the line predictor. The
pipe control must insert a bubble in the pipe for this thread since there are no valid
!cache read indices. During the bubble cycle what will happen is that the read index that
caused the bank conflict will be tried again so that the next cycle two valid indices will
be read out of the line predictor and sent to the !cache.

3.5 Instruction Processing Unit
The Instruction Processing Unit consists of the !cache data array, the !cache tag array,
store sets based memory dependence predictor, and the collapsing buffer.

3.5.1

lcache Data Array
The !cache is 64KB. It is pseudo 2-way associative, with a thrash-remap fill policy. A
cache block can be stored in one of two possible locations. Most blocks will be stored
using direct mapped indexing. However, if two blocks are detected as repeatedly competing for the same direct mapped cache location then one of the blocks will be
remapped by inverting the MSB of its index. This condition is detected in the Fill Unit
using a thrash detector (see Section 3.8).
The !cache array is made up of 8 banks. Each cycle the Ibox attempts to fetch two half
blocks, or fetch chunks from the !cache, one for slotO, the other for slotl. If there isn't
an !cache fill occurring in a given cycle, the slotO fetch is always allowed. The slotl
fetch is only allowed if its fetch is for an entry in a different bank from the slotO fetch,
or if it is for the exact same block (either half) in the !cache as the slotO fetch. This
allows fetching two "fetch chunks" per cycle without double pumping the !cache, as
long as the two fetch chunks are not for two different cache blocks in the same cache
bank. It has been observed that pairs of consecutive fetch blocks in a variety of benchmarks are about 50% likely to be consecutive. Since consective fetch chunks are either
on the exact same cache block, or are in the next cache bank (due to bank interleaving),
sequential program access to the cache is guaranteed not to have a read bank conflict
and should always be capable of reading two fetch chunks per cycle. Non-sequential
fetch chunks that are separated by a multiple of 8 cache blocks will attempt to access
multiple cache blocks in the same cache bank, in this case the SlotO read will be given
priority over the slotl read.
The 21464 provides the MAP_ALIGN instruction, which allows software some control
over mapping fetch chunks, in disregard for the efficiencies just described. See Section
2.11.3 for information.
When a cache miss occurs, a cache fill operation fetches and writes a full cache block of
16 instructions through the Fill Unit. Fills are given higher priority than either a slot 0
or slot 1 read. If a fill is occurring to the same bank as either a slot 0 or slot 1 read in a
given cycle, neither read will be allowed, and the two reads could be replayed the following cycle. The Index Unit provides the two read indices (slot 0 and slot 1) to the
!cache along with a valid bit per read index. The Fill Unit provides the write index during an !cache fill along with a valid bit per !cache bank. The !cache decoders contain
logic that arbitrates slotO/slotl read conflicts, and the read/write conflicts.
Compaq Confidential

5 January 2001 ···Subject To Change

Instruction Fetch Unit- the lbox

3-11

Instruction Processing Unit

The Icache Data Array is parity protected.
The bits in an entire cache block of the Icache data array would consist of:
Table 3-3 lcache Data Array Cache Block Contents

3.5.2

Bits

Description

1(15:0]<31:0>

Sixteen 32-bit modified instructions

CY[l5:0]

One overflow bit for each of the 16 instructions

CI[15:0]

One incremented branch target carry bit for each of the 16 instructions

CM[15:0]

One CMOV/FCMOV predecode bit for the CBF, for each of the 16
instructions

PQ[15:0]<3:0>

Four Predecode bits for the P box, for each of the 16 instructions

DP<9:0>

10 Parity bits

lcache Tag Array
The !cache is virtually indexed, and virtually and physically tagged. The primary function of the !cache Tag Array is to hold and deliver the virtual address tags for corresponding instruction blocks in the !cache Data Array. These tags are compared with the
full virtual PC calculated in the PC Unit to determine if !cache accesses were hits.
Address space numbers (ASNs) and the address space match bit (ASM) are also stored
in the tag array to determine whether the virtually addressed block that was filled into
the cache can be used by the current process that is accessing the cache, which has its
own ASN assignment. Physical tags are also stored in the cache to facilitate !cache
sharing between two processes that are addressing the same physical memory but were
not assigned the same ASN. The two processes must also be using the same virtual
index bits to be able to share the !cache. If the physical tag stored in the !cache Tag
Array is the same as the physical address of the translated virtual PC, the Ibox allows
physical Icache hits. The reason for this is to allow multiple threads to share the !cache
when running on different TPUs. There is more explanation about how !cache hits are
determined in the PC Compare section of the PC Unit documentation.
As described above in the !cache Data Array section, the !cache is pseudo 2-way set
associative. Occasionally, when a thrash is detected by the thrash detector in the Fill
Unit, the cache will be filled using the alternate location (same as the direct location's
index, except the top bit is inverted). The Line Predictor in the Index Unit will learn to
predict the alternate way for that fetch block. In order to do this, when the wrong "way"
is accidentally fetched, the PC compare logic needs to determine that the wrong way
was fetched, and that instead of taking an !cache miss, simply try to fetch the instructions and tags at the alternate way's location. Instead of reading out two full tags, the
!cache tag array stores a partial tag for the alternate way. If the partial tag matches the
PC, but the primary tag does not, the Ibox attempts to fetch from the alternate way
before initiating a cache miss. The alternate tag is 9 bits in the tag array: VA<23:15>.
The Tag Array also stores and retrieves a variety of other information to support several
Ibox design choices and features:

Compaq Confidentia I
3-12

Instruction Fetch Unit - the lbox

5 J,1nuary 2001 -· Subject To Chtmge

Instruction Processing Unit

•

Each TPU belongs to one of four TPU groups. Instead of having one valid bit per
TPU in the !cache, there is one valid bit per TPU group. Software can configure
multiple TPUs to be part of the same group, in which case they can virtually hit on
each others Icache blocks, provided the ASN s match or the ASM bit is set.

•

For !cache use by PALcode, a physical fill bit is stored. When the accessing TPU is
in fetching PALcode, it can hit on !cache blocks that have the physical-fill bit set
and bypass ASN comparison. This is needed because PALcode is always physically
mapped.

•

When a block is filled, its corresponding protection level is written into the !cache
tags. The protection level is either U,S,E,or K as specified in the Alpha SRM.

•

The Fill Unit can write an istream block into the !cache if it suffered an uncorrectable ECC error while residing in the Scache. Once the uncorrectible ECC error is
detected it will trigger an interrupt, but to keep the Ibox from fetching and processing the bogus block, a bit indicating the uncorrectable error is set in the !cache
Tags.

•

In order to expidite pc calculation and fetch block processing, a number of instruction attributes are predecoded during !cache fills and then stored into the !cache Tag
Array. These bits determine whether each instruction that was filled into the !cache
was a conditional branch, unconditional branch, computed jump, or other instruction. It also determines whether the return predictor should be used (ie does the
instruction perform a push or pop operation on the stack), and whether the jump target predictor needs to be used.

•

The !cache Tag Array is protected by parity.

Here is a complete list of the contents of the Icache Tag Array for a 16-instruction
block:

•
•
•
•
•

Physical address <47:13>

•

Alternate virtual address <23:15>

•
•
•
•
•

Physical fill bit

TPU group valid<3:0>
Virtual address <51: 16>
ASN<7:0>
ASM

USEK<3:0>
ECC uncorrectable bit
!cache Tag Predecodes [15:0]<3:0>
Parity<3 :0>

Compaq Confidential
5 January 2001 --·Subject To Change

Instruction Fetch Unit- the lbox 3-13

Instruction Processing Unit

Predecodes in !cache Tag for each instruction of each fetch block:
Table 3-4 lcache Tag Array Predecode for Fetch Blocks

PUSH POP

Fall through (CBR o JMP in PALMODE)

·o

x
x
x
x

Not used - don't care

x
x
x
x

CBR (conditional branch)

x
x

Not used - don't care

x
x

Not used - don't care

JMP

BR (unconditional branch)

RET (pops return stack)

IFETCHB

JSR (pushes return stack)

BSR (pushes return stack) (JSR in PALMODE)

JCR (pops and pushes return stack)

CALLPAL (pushes return stack)

0
1

Meaning

Not used - don't care
Not used - don't care
Not used - don't care

The lcache Tag array has 8 banks. The slotO, slotl and fill arbitration happens exactly
as described in the Icache Data array section above.

3.5.3

Store-Sets Based Memory Dependence Predictor
The 21464's out-of-order core could execute a load before a prior store that writes to
the same memory location. If this happens the load will get the wrong value. When the
store finally executes, this memory order violation will be detected and the load and all
subsequent dependent instructions will be aborted and re-executed, resulting in a performance penalty. This dilemma has created the need for memory dependence prediction. The goals of memory dependence prediction are 1) to predict the load instructions
that if allowed to execute would cause a memory-order violation and 2) to delay the
execution of these loads only as long as is necessary to avoid a such a violation.
Our memory dependence predictor is based upon the concept of store sets. A store set
for a specific load is the set of all stores upon which the load has ever depended. A
load's store set can be approximated in hardware by first allowing speculation of all
loads around older stores. If a load executes before a store upon which it depends, the
processor detects a memory-order violation when the store is executed and adds the
store to that load's store set. Essentially the processor discovers and remembers a load's
store set during program execution. The store set is then used to predict which stores a
load must wait for before executing. The table that holds store set IDs is in the Ibox.

Compaq Confidential
3-14

Instruction Fetch Unit - the lbox

5 Jc1nuary 2001 ·- Subject To Change

Instruction Processing Unit

For more information about store sets see: George Chrysos and Joel Erner. Memory
Dependence Prediction using Store Sets. In Proc. ISCA25, July 1998
Store-sets based prediction replaces the load wait table in the 21264.
The store sets predictor implementation has 16 store set identifiers, and has 4K entries
and the table is cleared periodically based upon an IPR. The 4k entry store sets array
and the Icache array are read simultaneously based upon an index generated by the
Index Unit. Each store set array entry is 5 bits long, one valid bit and four bits for the
store set identifier. 8 store set id's are read out contiguously for each fetch chunk. Logic
in the Pbox determines whether the instructions are loads or stores to know whether to
utilize the store set id or not. The store set id's then create predicted dependencies
between loads and stores in the Pbox dependence mapper.
The store set entry table in the Ibox is trained when a store/load order violation is
broadcast from the Mbox. The training of store set entries is governed by the following
rules:
1. If neither the load nor the store has been assigned a store set id, one is created and
assigned to the store instruction.
2. If the load has been assigned a store set id, but the store has not, the store is
assigned the load's store set id.
3. If the store has been assigned a store set id, but the load has not, the load is assigned
the store's store set id.
4. If both the load and the store have already been assigned store set ids, one of the
two store set ids is declared the "winner". The instruction assigned the losing store
set id is assigned the winning store set id. The winner is the lower numbered store
set ID.
Rule one mentions that when neither the store nor the load involved in the load/store
order violation that a store set id is created. The store set id is created by hashing the
lower bits of the load's PC:
XOR
XOR
XOR
XOR

21 20 19 18
17 16 15 14
13 12 11 10
9 8 7 6
5 4 3 2

As mentioned above, the store set entry table's valid bits are cleared periodically based
upon an IPR. The bits in the IPR that govern the store set tables clearing frequency are
defined here:
IPR Bits

Clear Freq

0000

Every Cycle (Store Sets Disabled)

0001

Every lk cycles

0010

Every 2k cycles

0011

Every 4k cycles

0100

Every 8k cycles

Compaq Confidential
5 January 2001 ···Subject To Change

Instruction Fetch Unit-the lbox

3-15

Instruction Processing Unit

3.5.4

IPR Bits

Clear Freq

0101

Every l 6k cycles

0110

Every 3 2k cycles

0111

Every 64k cycles

1000

Every 128k cycles

1001

Every 256k cycles

1010

Every 512k cycles * - recommended setting

1011

Every lm cycles

1100

Every 2m cycles

1100

Every 4m cycles

1101

Every 8m cycles

1111

Every 16m cycles

Collapsing Buffer
The job of the collapsing buffer is two-fold. It buffers instruction chunks (called fetch
chunks) from the lcache and merges usable instructions from these buffered chunks into
map-able chunks (called map chunks) that are sent to the Pbox. The collapsing buffer is
capable storing and merging 2 fetch chunks per cycle. Map chunks are only sent to the
Pbox upon request of the Pbox.

3.5.4.1

Instruction Buffer
3.5.4.1.1 Data Path

Each TPU has it's own buffer (implemented as a queue) of 16 entries with each entry
holding one fetch chunk (8 instructions). The IC ache sends the instruction buffer up to
two fetch chunks every cycle during 13 from a single TPU (slot 0 and slot 1). If the corresponding TPU's buffer is empty and the Pbox requests a map chunk from that TPU,
then slot 0 can be bypassed as a map chunk during 13. Slot 1 cannot be bypassed. Both
fetch chunks are written during 13 into the instruction buffer queue corresponding to the
TPU they were fetched from (regardless of whether one slot was bypassed or not).
The queue addressing logic keeps track of the head and tail of each buffer by using two
single bit pointers arranged in a ring. Each buffer is addressed for write by taking into
account the position of the tail pointer, and whether there are one or two slots being
written. The queue read addressing is similar to the write addressing, except that the
head and tail pointers and their relative locations determine whether one or two slots are
read. Normally, two fetch chunks will be read from the buffer on each cycle during 13,
except in the case there is only one fetch chunk available in the buffer.
The 21464 provides the MAP_ALIGN instruction, which allows software some control
over mapping fetch chunks, in disregard for the efficiencies just described. See Section
2.11.3 for information.

Compaq Confidentia I
3-16

Instruction Fetch Unit - the lbox

5 Jc1nuary 2001 -~Subject To CJumge

Instruction Processing Unit

In the event that there has been an internal Ibox fault (misprediction, cache miss, etc ...
Not a branch or jump mispredict) corresponding to slot 0, then the collapsing buffer can
catch this by not advancing the write pointers, although the fetch chunks are written. If
the buffer was empty before the write, a bypass occurs and all of the valid bits sent to
the Pbox are pulled low, as that slot is invalid.
Unfortunately, faults for Slotl occurs one cycle after SlotO events. This leads to trouble
because slotl can be written and then read out of the buffer at the same time a Slotlfault
happens. In the event that there has been an internal Ibox fault corresponding to slot 1,
each buffer has the ability to "undo" the last write by backing up the write pointers.
Additionally, all instructions in the map chunk are invalidated just as they are sent to the
Pbox. This undo ability allows the collapsing buffer to capture wrong instructions
before they get sent to the Pbox. This is vital, since it means that the Ibox will not have
to hunt down and kill instructions in other boxes.
3.5.4.1.2 Control Path

To ensure that only map chunks with valid instructions get sent to the Pbox, some additional signaling is needed. First, the Pbox must tell the collapsing buffer some information.
The first thing the Pbox needs to do is request a map chunk. It does so by selecting a
TPU in 13 and informing the collapsing buffer of the TPU choice. Sometimes, however, the Pbox selects a TPU and finds that it can't map the map chunk it received from
the collapsing buffer. To ensure that this map chunk is not lost, the Pbox will tell the
collapsing buffer when to advance the read pointers. When the Pbox is able to map the
chunk successfully, it will tell the collapsing buffer to advance the read pointers. Otherwise, the pointers are left where they are so that the map chunk can be retried later.
In order for the Pbox to make choices about which TPU to map, the collapsing buffer
will send the Pbox a signal indicating which TPU has instructions in its buffer. Unfortunately, there exists a lag between detecting the emptiness of a buffer and the Pbox
actually requesting map chunk. One reason for this is that the Pbox needs to know the
state of the buffer one cycle before it can be calculated! This can lead to two problems:
•

The Pbox overlooks a TPU with valid instructions.

•

The Pbox requests a map chunk from a TPU that it was told had instructions but is
actually empty when the buffer gets read. This causes the Pbox to map a chunk of
instructions that are all invalid.

A similar case occurs when there is a late kill and the buffer has only bad path instructions. The late kill, as mentioned earlier, clears the valid mask and will cause the buffer
to be emptied. In the next cycle, the write pointers will be fixed. In the cycle after this,
the buffer is finally declared empty. This will cause the TPU to indicate it is not empty
for the kill cycle, the cycle the pointers get fixed, and the cycle after that (remember the
bid signal is actually on cycle stale). In this case, the collapsing buffer will tell the
Pbox to abort any future map attempts on this TPU after the kill is detected so that no
more invalid map chunks get mapped in the mapper.
As mentioned earlier, there is a valid signal sent to the Pbox. This signal contains a
valid bit for each instruction in the collapsing buffer which indicates that the instruction
is valid for mapping. The collapsing buffer sometimes uses this valid mask for last
minute kills as described previously.
Compaq Confidential
5 January 2001 ···Subject To Change

Instruction Fetch Unit - the lbox

3-17

Instruction Processing Unit

3.5.4.2 Collapser
3.5.4.2.1 Data Path

The collapser's overall job is merge two fetch chunks into one 8 instruction map chunk.
The collapsing buffer collapser receives two fetch chunks from the instruction buffer in
13. To collapse, the first valid instruction (START) and exit point instruction (END) for
each fetch chunk are read from the start/end buffer. The invalid instructions (instructions outside START and END) are stripped off of the two fetch chunks and then the
second fetch chunk is appended to first. Finally, the instructions are left justified within
the map chunk such that the first valid instruction is always the first instruction.
The operation of the collapser is fairly simple. However, when you fold in the fact that
there may be more than 8 valid instructions in two 8 instruction chunks. In this case,
the collapsing buffer needs to modify and store the START position for the left over
fetch chunk. In the case that a legacy CMOV instruction causes the end of the block, an
additional bit needs to be stored that indicates that the instruction corresponding to the
modified START is a CMOV2 instruction.
3.5.4.2.2 Start/End Buffer

The start/end buffer not only stores the start and end of the valid instructions in a line,
but also the CMOV predecode bits. This buffer is broken up into 4 queues, much like
the instruction buffer. Each queue holds 16 entries. Each entry is 25 bits.
Table 3-5 Fields in the Start/End Buffer
Field

Contents

CMOV_PRE<7:0> CMOV mask for the corresponding FETCH_CHUNK
START<7:0>

1-hot start of valid instruction in the FETCH_CHUNK

END<7:0>

1-hot end of valid instruction in the FETCH_CHUNK

PAL_MODE

Single bit field that indicates if the fetch chunk on this line is a PAL mode
chunk or not.

The START data is written into the buffer from the Line Predictor and the END data is
written in from the Branch Predictor exit logic. The CMOV predecodes com from the
lcache Tags and Pal mode will come from Pipe Control.
3.5.4.2.3 New Start Calculation

When two 8-instruction fetch chunks are collapsed into one 8-instruction map chunk,
there is always the possibility that there will be left over instructions in the second fetch
chunk. So, the start of FETCH_CHUNKl needs to be modified to the new start of the
valid instructions in that chunk. This is actually a rather simple calculation, and there is
plenty of time to do it.
A wrench gets thrown into the works if there is a CMOV in the 8 instruction map
chunk. In this case, the new start will correspond with the location of the CMOV in the
fetch chunk.
3.5.4.2.4 CMov

It is assumed that the reader of this document has previously read Handling CMov.

Compaq Confidential
3-18

Instruction Fetch Unit -the lbox

5 J('intiary 2001 -·Subject To Change

Control Flow Prediction Unit

The CMOV instruction needs to be spit into two halves CMOVl, and CMOV2. To
accommodate the CMOV2 instruction, map chunks are always ended on the CMOVl
(which sits in the same position as the original CMOV). The next map chunk will then
begin with a CMOV2. The remaining 7 instructions in the map chunk will be collapsed
as normal. The legacy CMOV will also require that the CMOV predecode mask be
modified and stored for the next collapsing.

3.6 Control Flow Prediction Unit
3.6.1

Conditional Branch Prediction
Conditional branches are ubiquitous in most programs. However, it takes at least 13
cycles in the deeply pipelined 21464 before the outcome of the branch is known.
Hence, the 21464 processor utilizes an aggressive branch predictor to provide the ability to speculatively fetch beyond conditional branches
The 21464 branch predictor belongs to the class of skewed branch predictors. In this
class of predictors, multiple prediction tables are used that operate independently to
generate a prediction of their own while a majority vote decides the final branch rediction .. For more details, please refer to the technical report by Sezec and Michaud . The
21464 used a modified form of a skewed predictor in which an additional level of prediction is used to choose between the majority vote and one of the prediction tables.
There are four tables that constitute the branch predictor that are termed the GO, GI,
BM and CH arrays whose sizes are 64K, 64K, 16K and 64K bits respectively. Tables
GO, Gland BM serve as prediction tables while CH serves as the chooser. Each entry in
the table has 8 prediction bits corresponding to 8 instructions in the fetch chunk. An
entry for each of these tables is indexed using a unique function that is based on a combination of the branch history bits as well as the address bits used for accessing the
instruction cache in the current and previous fetch slot. The 8 prediction bits from each
of these tables are further rearranged (unshuffled) using another function of the history
and address bits. The final set of 8 predictions for the fetch slot is thus derived after the
unshuffle which is followed by choosing between the majority of GO, G 1 and BM or the
prediction bits of BM itself using the CH (chooser) bits. The overall block diagram of
the prediction mechanism is illustrated in the figure 1.

A. Seznec and P. Michaud, "Dealiased hybrid branch predictors", IRISA report, Feb 1999, http://
www.irisa.fr/caps/PROJECTS/Architecture

Compaq Confidential
5 January 2001 ··· Subject To Change

Instruction Fetch Unit -

the lbox

3-19

Control Flow Prediction Unit

Figure 3-3 High level diagram of the 21464 branch predictor

The prediction tables are further complemented with additional hysteresis tables. The
sizes of the individual hysteresis tables for GO, Gl, CH and BM are 32K, 64K, 32K and
16K bits respectively. It must be noted that unlike traditional schemes, the 21464 predictor does not have a unique hysteresis bit associated with every prediction bit. Rather,
the prediction entries are permitted to share hysteresis bits as can be seen from the GO
table that has 64K prediction bits but only 32K hysteresis bits. A reduction in the number of hysteresis bits was shown to have little performance impact while saving valuable die-space. The hysteresis bits prevent the prediction bits to change on the first
incorrect prediction thereby disregarding transient changes in branch behavior. The
hysteresis bits may be strengthened on a correct prediction and weakened on an incorrect prediction. Unlike the hysteresis bits, the prediction bits may change only on incorrect predictions based upon the state of their corresponding hysteresis bits. The
complete training of the hysteresis and predictor arrays is done at instruction retire time.
In the following sections, we describe in greater detail the different components and
functionality of the branch predictor and training mechanism.
3.6.1.1

Branch Prediction Components
3.6.1.1.1 Branch History(LGHist)

It has been shown that using the past behavior of the branches is extremely helpful in
predicting future branches. The traditional method of maintaining the global history
(also known as ghist) is to record the outcome of each and every branch that is executed
in the program.

To ease implementation, the 21464 uses a modified version of ghist called the linebased ghist (or lghist for short) that records branch history on a fetch line basis. The
lghist scheme takes into consideration only the behavior of the last branch of the fetch
line. If the branch is in the first half of the fetch line (words 0, 1, 2 or 3), a 0 is entered
for a not taken branch while a 1 is entered for a taken branch. On the other hand, if the
last branch happens to be in the second half of the fetch line (words 4, 5, 6 or 7), a 1 is
entered for a not taken branch while a 0 is entered for a taken branch.

Compaq Confidential
3-20

Instruction Fetch Unit - the lbox

5 Janwiry 2001 ·- Subject To Change

Control Flow Prediction Unit

It must be mentioned that the branch history used for predicting the branches in the slot
that is currently being fetched would not include the information of the slots fetched in
the previous cycle. This is because an extra pipeline stage is required to record the predicted outcome of the branches in the lghist. The predicted outcome can be recorded
only after discarding those predictions that would play no role when the instructions in
the fetch slot are executed. This is achieved by considering (a) entry point in the fetch
slot (b) identifying the true conditional branches among the 8 instructions in the fetch
slot using pre-decode information and (c) unconditional exit instructions (such as jump,
return or unconditional branch instructions). Furthermore, in a given cycle, if two slots
are being fetched, the second slot would not only lack the history of the slots that were
fetched in the previous cycle but also of the first one that is currently being fetched.
Hence, to maintain consistency, the history information used for prediction is always
made three slots old for both the slots in a given cycle. The fact that the lghist was modified for a particular fetch slot is maintained using the shift distance bits. Note that the
lghist would change only in the presence of valid conditional branches in the fetch slot.
The shift distance information is particularly valuable on restarts when the restored
lghist has to be aged by three slots before being used to access the branch predictor. A
maximum of 3 shift distance bits needs to be maintained corresponding to the three-slot
aging factor.
In addition to the shift distance bits, another bit called the no shift bit is used. This bit
prevents the shift distance bits from being modified more than once for the same fetch
slot when it is restarted (on an exception). On a restart from an exception, the checkpoint table restores the lghist and shift distance after updating them appropriately
depending upon the presence of a conditional branch before and after the restart position in the fetch slot. If the restart position occurs in the same fetch slot and no branches
are present after the restart position, the lghist (as well as the shift distance bits) needs
to be updated to incorporate any branches before the restart. To prevent future updates
to the shift distance for the remainder of the fetch slot that has no conditional branches,
the checkpoint table also sets the no shift bit. It is this bit that is used to determine if the
shift distance bits needs to be updated for the current fetch slot. Note that the no shift bit
is applicable only for the first fetch slot as a restart on an exception always results in
only one slot to be fetched.
Even though all prediction tables are common to all threads, the lghist, shift distance
and no shift bit are maintained on a per-thread basis.
3.6.1.1.2 Prediction Tables

As mentioned before, the branch predictor logically consists of four tables: GO, Gl, BM
and CH. However, this is implemented as one array where each word line in a bank is
made up of the four different components. The array is further sub-divided into four
single-ported banks with each bank containing 64 word lines. Even though logically
each table entry contains 8 prediction bits, implementation constrains each wordline to
have several 8-bit entries clustered together. The address bits for indexing the table
allow us to select from among the different clusters or "columns" of 8-bit entries.
The address bits that are used to access each of the tables is generated using bits <14:5>
of the line predictor index (denoted as A) and bits <20:0> of the three slot old lghist
(denoted as H). The lower bits <4:2> of the line predictor index are not used for the
array access. These bits, which denote the entry point in a fetch slot, are used solely to
discard predictions prior to the starting point. The address bits are as follows:
Compaq Confidential
5 January 2001 ··· Subject To Change

Instruction Fetch Unit -

the Ibex

3-21

Control Flow Prediction Unit

Word address (6 bits): This is to access one of the 64 wordlines in the array.
Since each component resides in the same wordline, these bits are common to
all the tables. Moreover, since the address bits must be available at the beginning of the cycle, the address bits are generated directly without any hashing
involved. The 6 address bits are: H<3 :0>, A<8:7>

b. Column select address: Each wordline consists of multiple entries of GO, Gl,
BM and CH. These address bits choose from among the many 8-bit "columns"
present in a wordline. Since 8 bits are to be chosen from 256 bits of GO, G1 and
CH, each of these tables need 5 address bits to choose the appropriate one from
a 32-1 column multiplexer. Only 3 address bits are needed for selecting 8 bits
from 64 bits of BM (which requires only an 8-1 column multiplexer). The column select bits for each table are as follows:

H<7>$H<ll>

H<8>$H<12>

H<4>$H<5>

A<9>$H<9>

H<l0>$ H<6>

H<l9>$ H<12>

H<18>$ H<ll>

H<17>$ H<lO>

H<16>$H<4>

H<15>E1' H<20>

H<7>$H<ll>

H<8>E1'H<12>

H<5>$ H<13>

H<4>$H<9>

A<9>$H<6>

N.A

A<ll>

A<9>$A<5>

A<10>$A<6>

3.6.1.1.3 Bank Selection

As mentioned before, the predictor arrays are sub-divided into four banks, each of
which has only one read port. Since the branch predictor must be able to predict two
fetch slots every cycle, it is necessary that the two slots do not access the same bank in
a given cycle. To achieve this, the bank identifier is constructed in such a way that no
two consecutive slots would access the same bank.
Since the bank identifier must be available at the beginning of the cycle when the array
access is performed, it would not be possible to use any information from the current to
slots to generate the bank identifier. For this reason, we use bits 5 and 6 of the line predictor index and the bank identifier of previous slots. Assume that, in the current cycle,
the predictor array is being accessed for slots N-2 and N-1 with bank identifiers BN_ 2
and BN-l used for the access. Also, let ZN_2 and ZN-l be bits 5 and 6 of the line predictor index used to access slots N-2 and N-1 respectively. The generation of bank identifiers for slots N and N+1 for accessing the array in the following cycle is done as follows:
To generate the bank identifier for slot N (BN), ZN_2 is compared to BN-l· If they match,
BN is set to (ZN_2+1) otherwise, it is set to AN_2.The generation of the bank identifier
for slot N+ 1 is also similar; the newly calculated BN is compared to ZN-l · In this fashion, the bank identifiers for the subsequent two slots are generated in advance by using
information that is available in the current cycle.
3.6.1.1.4

Unshuffle Network

Imagine that the branch predictor is implemented such that each entry in the array has
only I-prediction bit. In this case, we would hash bits <14:2> of the line predictor index
with the lghist to generate an index for each table entry. In reality, however, each entry
stores a set of 8 predictions for each table. Hence the low bits <4:2> of the line predictor index is not used for the access. But these bits are eventually used as they denote the
entry point in the fetch slot, and in conjunction with the instruction pre-decode bits that
denote the actual conditional branches in the fetch slot, allow us to choose only a subCompaq Confidential
3-22

Instruction Fetch Unit-the lbox

5 Jc1nuary 2001 ·-Subject To Change

Control Flow Prediction Unit

set of the 8 predictions. For performance reasons, however, it is desirable that the low
bits are also hashed with the lghist bits when accessing the predictor arrays. Thus, the
set of 8 predictions would need to be rearranged (unshuffled) to give the same set of 8
predictions that we would have got in the event of indexing a I-prediction bit based
tables 8 times (to span the low bits<4:2> ranging from 000 to 111)
Let f 2f 1f 0 be the bits that are used for XOR-ing with the low bits of the line predictor
index while a6a 5a4 denote the position of a particular branch prediction bit. After the
unshuffling, the prediction bit occupies the new position a6E9 f 2 , a 5E9f1, a4 E9f0. For
instance, if f 2f 1f 0 = 101 and the 8 prediction bits are b 7b 6b 5b 4bJb2b 1b 0, the new positions after the unshuffling would be b1b3bob1b6b7b4bs.
The hash function used here can be quite complex as the unshuffling is performed only
in the later part of the cycle after the array access and column selection has been performed. The address bits used for the hashing include the line predictor index (denoted
as A), lghist (denoted as H) and bits 5 and 6 of the index used to access the previous slot
(denoted as Z). The function used for the different tables are listed below:
Table

Unshuffle bits <2:0>

A<9>© A<12>© A<l3>© H<5>© H<8>© H<l l>©Z<5>
A<5>© A<l 1>© H<9>© H<lO>© H<l2>$ Z<6>
A<6>© A<lO>© A<l4>© H<4>© H<6>© H<7>

A<6>© A<l 1>© A<14>© H<4>© H<6>© H<9>© H<14>© H<15>© H<16>© Z<6>
A<lO>© A<13>© H<5>© H<ll>© H<13>© H<18>© H<19>© H<20>© Z<5>
A<5>© A<9>© H<4>© H<7>© H<lO>© H<12>© H<13>© H<14>© H<17>

A<5>© A<lO>© H<7>© H<lO>© H<13>© H<14>© Z<5>
A<6>© A<12>Efl A<l4>© H<4>© H<6>© H<8>© H<14>
A<9>© A<l 1>© A<13>© H<5>© H<9>© H<l 1>© H<12>© Z<6>

NONE
Z<6>
Z<5>

3.6.1.1.5 Backend logic and checkpoint information
The final set of 8 branch predictions for each fetch slot is available after the chooser is
used to decide between the majority of GO, Gl, BM and the BM predictions. However,
not all of the final 8 predictions may be useful the following reasons:

1. The instruction for which a branch was predicted taken may not be a conditional
branch instruction
2.

The entry point in the fetch slot may not be the first instruction

Not all instructions may be executed in the fetch slot due to a taken prediction for a
conditional branch or the presence of an unconditional exit point in the fetch slot
(for instance, a jump or a return instruction)

Compaq Confidential
5 January 2001 ···Subject To Change

Instruction Fetch Unit -

the lbox

3-23

Control Flow Prediction Unit

The branch predictor backend logic incorporates additional information using the low
bits <4:2> of the line predictor index, instruction types using predecode information as
well as the branch predictions itself to calculate the exact exit position in the fetch slot.
This information is used by the PC calculation logic to determine the PC of the following fetch slot.
The information that needs to be check-pointed includes the following: lghist, shift distance, no shift, bank identifiers and bits <6,5> of the previous line predictor indices.
Furthermore, on restarts, branches after the restart point would have to be considered if
the restart occurs in the same fetch slot. If there are branches prior to the restart position
but none after, then it must be incorporated in the restored lghist and shift distance.
Hence, we also checkpoint the conditional branch attributes that spans all instructions
until the first unconditional exit point. Finally, the 8 bits read from each of the predictor
tables (GO, Gl, CH and BM) are also stored in the checkpoint table for training the
branch predictor at retire time. This avoids reading the single-ported predictor array at
training time as doing so may result in conflict with the accesses made during fetch
time.
3.6.1.2 Branch Training

The validity of the branch predictions is known when the branches are executed in the
Ebox. A misprediction causes the branch predictor states such as lghist to be restored.
However, the actual branch training does not take place until the Pbox retires the
instruction.
As mentioned earlier, the predictor makes use of hysteresis tables to prevent modification of the prediction bits on transient branch behavior. Each prediction table has a corresponding hysteresis table with sizes of the individual tables for GO, G1, CH and BM
being 32K, 64K, 32K and l 6K bits respectively. Note that the sizes of GO and CH hysteresis tables are half the size of the corresponding predictor tables. This results in two
entries in the predictor table to share an entry in the hysteresis table. As with the predictor tables, the hysteresis tables are implemented as a single array that is interleaved
between four single-ported banks. The only difference is the size of the wordline that
results due to the reduction in the sizes of GO and CH tables. Consequently, one partition in the wordline contains 256 bits that comprise of 128 bits each of GO and CH
while the other partition contains 320 bits that consists of 256 bits of G 1 and 64 bits of
BM. The address bits used to access the wordline, column select and for performing the
''unshuffle" are the same as that for the predictor tables except that the high order bit for
the column select is no longer applicable for GO and CH due to their reduced sizes.
When a map chunk is retired, the checkpoint table produces the relevant information
regarding the fetch slots comprising the map chunk. This includes information on
whether a branch was mispredicted for the fetch slot and if so, the mispredicted position. To avoid reading the single-ported predictor array during training, the predictions
that were read from each table at fetch time is also available from the checkpoint table.
Using this information, both the prediction and the actual outcome for the fetch slot can
be reconstructed.
3.6.1.2.1 Predictor Training

The predictor tables need not be updated on a correct prediction. On an incorrect prediction, only one of the fetch slots that is retired would have an incorrect prediction.
This implies that only one of four predictor banks would be accessed for writing the
Compaq Confidentia I
3-24

Instruction Fetch Unit - the lbox

5 Jc1m1ary 2001 - Subject To Change

Control Flow Prediction Unit

training information. However, the write to the predictor table may conflict with a read
access performed at fetch time. To minimize this conflict, each of the banks has a oneentry write buffer to hold the write data whenever it conflicts with a read to the bank.
However, this may not be sufficient when there are back to back retirement of map
chunks with a mispredicted branch. Dropping one of the writes to the predictor array is
not preferred as it may impair the performance of the predictor. To accommodate this
situation, the predictor bank-conflict detection mechanism keeps track of pending
writes to each bank. If necessary, a bubble is inserted during the fetch stage to put future
reads on hold so as to allow a write pending in the buffer to be cleared.
The predictor tables are trained using the following rules:
1. Nothing to be done on a correct prediction
2. If either the majority or BM is correct, update chooser to the correct state provided
the hysteresis is weak
3. For each of GO, Gl and BM, modify entry when the table's prediction is incorrect
and its hysteresis is weak provided also that:
a.

Neither the majority or BM is correct

b. Either the majority or BM is correct but the chooser continues to point to the
wrong predictor after the update (i.e. the chooser had a strong hysteresis)
3.6.1.2.2 Hysteresis Training

Unlike the predictor tables, the hysteresis tables would have to be updated for both correct and incorrect predictions. For correct predictions, the hysteresis tables can be written without being read as they are always strengthened. However, for incorrect
predictions, we need to perform a read-modify-write of the hysteresis bits for the fetch
slot with the mispredicted branch. As with the predictor arrays, a write buffer holds
pending data for each bank. This still does not prevent bank conflicts due to read and
writes occurring at the same time. Overall, there are three different types of accesses to
the hysteresis array that may lead to bank conflicts:
1. Writes for a fetch slot with a mispredicted branch
2. Read table for a mispredicted branch
3. Writes to strengthen the hysteresis bits for fetch slots correctly predicted branches
Unlike the predictor training where a bubble inserted at the fetch point permits reads to
be put on hold, we cannot stall retires to avoid hysteresis bank conflicts. Hence, we prioritize accesses and drop the access with the lower priority in favor of the higher one.
For the three types of accesses mentioned above, type (i) has the highest priority followed by (ii) and finally (iii). The ordering is such that no training done for a mispredicted branch is dropped. If a bank's write buffer holds a mispredicted hysteresis write
with another mispredicted write to the same bank to follow, the read is disabled and a
weak hysteresis is assumed as the default read value. If, on the other hand, the incoming
write is for a correctly predicted branch, it is dropped in favor of the read access.
The hysteresis tables are trained using the following set of rules:
1. Incorrect prediction
a.

If the majority and BM differ, the chooser hysteresis is weakened

Compaq Confidential
5 January 2001 - Subject To Change

Instruction Fetch Unit -

the lbox

3-25

Control Flow Prediction Unit

b. For the GO, G 1 and BM hysteresis, strengthen if table prediction is correct. If
the table prediction is incorrect, then weaken the hysteresis, provided:
Neither the majority or BM is correct
Either the majority or BM is correct but the chooser continues to point to the
wrong predictor after the update (i.e. the chooser had a strong hysteresis)
2. Correct prediction
a. If GO, Gl and BM are all correct, hysteresis is unchanged for all tables
b. Strengthen BM if it is correct
c.

Strengthen GO or G 1 if it is correct and majority was used by the chooser for
prediction

3.6.1.3 PAL mode

In PAL mode, the predecode bits for conditional branches are not set by the instruction
fill unit. This implies that the branch predictor is not utilized during PAL code and all
branches are predicted as not taken. Since branches in PAL mode are rare, this would
have little effect on performance. Moreover, we do not want the application specific
branch history (lghist) to be corrupted by PAL code branches.

3.6.2

Jump Target Predictor
The Jump Target Predictor is responsible for predicting the targets of Alpha's computed
jump instructions: JMP and JSR. The Jump Target Predictor keeps track of partial
addresses from the last four jump target predictions - called jg hist. It hashes together
those partial addresses to form an index into a 512 entry target table. The target table is
trained with the real computed targets from the execution units. A jghist is maintained
for each of the four TPUs, but the 512 entry target table is shared among the threads.
The jump predictor can predict one jump per cycle . If both fetch blocks that are fetched
in a cycle contain a jump instruction, the first one is processed, and the second fetch
block is squashed (see PC Unit).
Figure 3-4 Jump Predictor Block Diagram

Indexing into the jump predictor table is a result of hashing of the most recent predicted
jump targets as follows:
Compaq Confidential
3-26

Instruction Fetch Unit-the lbox

5 January 2001 ···Subject To Cfumge

Control Flow Prediction Unit

Assume the four most recent jump targets, from most recent to least recent are:
D<51,2>, C<51,2>, B<51,2>, A<51,2>
The 9 bit index into the 512 jump target table for the next predicted jump target is:
XOR
XOR
XOR
XOR
XOR
XOR
XOR

D<l9, 11>
D<lO, 2>
CONCAT (C<l8, 11>
CONCAT (C< 9, 2>
CONCAT (B<l7,11>
CONCAT (B< 8, 2>
CONCAT (A<l6,ll>
CONCAT (A< 7, 2>

0)
0)
00)
00)
000)
000)

The hash was chosen to ensure position independence of the prior targets, (eg so that
the target history A, B, A, B hashes to a different table location than B, A, B, A, since
the first should predict target A and the second should predict target B.) Zeros are
shifted into the older targets to ensure that older targets count less in the hash.
The jghist registers are checkpointed to facilitate restarting the jump target predictor in
case of an exception. When the machine is restarted the appropriate last four targets will
overwrite the jghist state that had progressed since the instruction that caused the
exception.
Jump mispredictions are trained by writing the correct target into the jump address predictor's table when the mispredicting jump retires. The Checkpoint Unit receives the
correct jump target from the Ebox when the jump executes. The Checkpoint Unit will
detect a jump mispredict at that time and will keep a record of the correct target to facilitate training once the jump retires.

3.6.3

Return Address Stack
The Return Address Stack is responsible for predicting the targets of returning instructions. The return address predictor is affected by instructions that jump to subroutines
and those that return from subroutines. There are several calling instructions:
BSR
JSR
CPL
JSC

Branch Subroutine
Jump Subroutine
Call Pal
Jump Co-Routine

There are also multiple returning instructions:
RET
JSC

Return
Jump Co-Routine

In order to predict return addresses, we use the simple concept of a stack. When a calling instruction is fetched, the PC following the calling instruction is pushed onto a
stack. When a return instruction is fetched, the stack is popped, and the PC is redirected
to the popped value. The stack holds 64 return PCs per thread.

Compaq Confidential
5 January 2001 ~·Subject To Change

Instruction Fetch Unit- the lbox

3-27

PC Unit

The return stack must be check-pointed. Upon a abort (branch mispredict, load/store
order trap, etc), any pushing or popping that has been done to the stack by instructions
on the badpath must be undone to restore the stack to a coherent state. In order to facilitate a fully checkpointed return address stack, we are implementing a structure that
behaves like a linked list. We have an array of elements. Each element consists of a PC,
and a previous top of stack pointer (PTOS_PTR). Externally we access the array with a
top of stack pointer (TOS_PTR). In order to pop the "stack" the array is accessed at the
address specified by the TOS_PTR. The PC that is read out is the return target PC, or
PopPC, and the PTOS_PTR that is read out corresponds to the next top of stack. When
performing a pop, the PTOS_PTR that is read out is written into the TOS_PTR latch. In
order to push a value onto the "stack" we need another pointer into the array, which is
the next element of the array to be allocated (NALLOC). On a push the array is written
at the location specified by NALLOC. The PC written is computed by incrementing the
address of the pushing instruction (PushPC) and the current TOS_PTR is written into
the PTOS_PTR component of the array element. Then NALLOC is written into the
TOS_PTR latch, and NALLOC itself is incremented by 1 (modulo the size of the
array). Checkpointing is performed by storing the array pointer corresponding to the
current top of stack element in the Checkpoint Unit (See Checkpoint Unit Documentation) for each instruction chunk. The current NALLOC pointer is also stored into the
Checkpoint Unit for each instruction chunk in order to reclaim space used by badpath
pushes and pops. The stack state is restored by restoring the TOS_PTR and the NALLOC pointer that were stored when the instruction causing the abort was fetched.
Since the 21464 is a multithreaded machine, we need to have a return address predictor
that can accommodate multiple code paths without getting confused. In the interest of
simplicity, we have decided to simply replicate the return stack array itself. Each of the
four (one per TPU) return stack arrays contains 64 entries.
In the 21464, we fetch up to two 8-instruction chunks from the !cache each cycle. Each
of those chunks has can contain an instruction that manipulates the return stack. The
return stack cannot, however, handle any combinations of pushes and pops in one cycle.
It can handle:

•
•
•
•

One Push
One Pop
One JCR (Pop followed by a Push)
Pop in slotO, Push in slotl

In the event that two !cache blocks are fetched that do not correspond to one of the four
scenarios above, the second block is squashed. (See Documentation for squashing in
the PC Unit PC Calculation section).

3.7 PC Unit
3.7.1

PC Calculation
The Program Counter(PC) is a register which holds the address of the instructions to
fetch next. In the 21464, there are four TPUs, each of which has an independent instruction stream. To keep track of all the TPUs' instruction addresses, the Ibox maintains
four PCs. The PC changes based upon either sequential fetching of the code, or based
Compaq Confidential

3-28

Instruction Fetch Unit - the lbox

5 January 2001 ··· Subject To Cfumge

PC Unit

upon PC changing instructions such as branches, jumps and returns. The computation
of a new PC must occur ever time instructions are fetched; this computation is refered
to as PC Calculation.
As mentioned above, the Ibox fetches up to two non-contiguous fetch-blocks of instructions each cycle. A fetch block begins with the PC of the first instruction, and all subsequent instructions' PCs in the fetch block must be a sequential increment to the first.
Between fetch-blocks, however, the PC's sequential stream can be broken. Since the
ibox can fetch up to two non-contiguous fetch blocks in the same cycle, a PC must be
generated for each of the fetch blocks. A PC is needed for each fetch block to compare
with the fetched !cache blocks' tags for hit determination, and check that the !cache
indicies produced by the line predictor were correct and pertained to the correct !cache
way. The transition from one fetch-block to another is governed by the exiting condition of the first fetch-block. The list of potential exits to a fetch block are listed in Table
3-6.

Table 3-6 Fetch-Block Exit Conditions
Last Instruction of a 32B Cache Line (Sequential)
Predicted Taken Conditional Branch Instruction (CBR)
Unconditional Branch Instruction (BR, BSR)
Jump Instruction(JSR, JMP, JSR_COROUTINE)
Return Instruction (RET)
IFETCHB Instruction (Halt Fetching until Retirement of IFETCHB)
Call PAL Instruction (CALL_PAL)

Starting with the PC (PCO) of the beginning of the first fetch block, the starting PC
(PCl) of the second fetch block is determined by the exiting condition of the first fetch
block:
Table 3-7 PC1 Calculation
Something

Something

Sequential

No PC Changing instructions in the first block
PCl = CONCAT(PC0<51:5> I 00000) + 1 fetch block (32B)

Taken Branch

Predicted Taken CBR, BSR, or BR
PCl = CONCAT(PC0<51:5> I 00000) +Branch Instruction Position Offset in Fetch Block+
Branch Displacement

Jump Predicts

JSR,JMP
PCl =Output of Jump Predictor

Stack Pops

RET, JSR_COROUTINE
PCl =Pop of Return Stack

Call Pal

Compaq Confidential

5 Jam.1ary 2001 --· Subject To Change

Instruction Fetch Unit -

the lbox

3-29

PC Unit

Table 3-7 PC1 Calculation
Something

Something

PC 1 =PAL Base Address + Trap Vector Offset
IFETCHB
PC 1 =PC of the IFETCHB + 4

In order to calculate the PCs, the fetch block exits must be known, as well as the branch
predictions, jump target prediction, and the current return address on the top of the
return stack. The fetch block exits come from the instructions themselves, ie the !cache
data array, and the predictors operate ahead of time to ensure that all the information for
PC Calculation is produced as soon as possible. For speed of PC calculation on taken
branches, the lower bits of the taken address are pre-calculated and stored in the offset
field of the instruction text in the !cache. This happens at fill time. This means that the
lower bits: <21 :2> of the target pc are not calculated, but simply read from the !cache
with the instruction. The higher bits <51 :22> need to be calculated. They could be
incremented by one, decremented by one, or not change at all based on whether the offset of the taken branch was positive or negative, and whether it caused a carry or borrow (we refer to both as "overflow") above bit 21. The overflow bit and sign bit are
stored with the offset in the instruction text at !cache fill time.
Two PCs are calculated per cycle. At the beginning of the cycle the current PC is
"PCO", which pertains to the start of the slot 0 fetch block. The two PCs that are calculated are "PC 1", which pertains to the start of the slot 1 fetch block, and "NextPCO"
which pertains to the start of next cycle's slot 0 fetch block. In effect NextPCO becomes
PCO for the next cycle. In order to maintain two fetched blocks every cycle both PCl
and NextPCO are calculated together. The table above showed how exit condition of
slot 0 determined the calculation of PC 1. In order to determine the calculation for
NextPCO, both the slot 0 and slot 1 exit conditions need to be considered. This is
because the computation of NextPCO must start with PCO, and not PCl, which is being
computed simultaneously. Considering all of the possible exit combinations, that is 6x6
cross products, is quite a large task.
Several restrictions on combinations of slot 0 and slot 1 fetch chunk exits reduce this
considerably. Some of the restrictions are imposed due to hardware limitations (eg, the
jump predictor can only handle one jump per cycle, so the slot 0 and slot 1 fetch blocks
cannot both end in jmp or jsr). Others were imposed to make the PC calculation logic
feasible. The first time the Ibox attempts to fetch two fetch-blocks in the same cycle
that violate one of the restrictions, the PC comparison logic will abort the fetch of the
second block, and cause a three cycle restart penalty. Thereafter the Line Predictor will
remember that the two fetch blocks are incompatible and only the first fetch block will
be accessed in that cycle. The second fetch block will be fetched in the following cycle

Compaq Confidential
3-30

Instruction Fetch Unit-the lbox

5 Jc11mary 2001 ···Subject To Chtmge

PC Unit

and it can be combined with a subsequent fetch block. The term for this is squashing.
The following table specifies the cases when the line predictor will learn to squash the
natural occurance of a slot 1.
Table 3-8 Conditions that Sqaush the Second Fetch Chunk
Both fetch chunks are to the same !cache bank (of 8).
Both fetch chunks end in a JMP or JSR or JSR-COROUTINE.
Both fetch chunks end in a RETor JSR-COROUTINE.
The first fetch chunk ends in JSR, BSR, and the second in RET
The first or second fetch chunk ends in a CALL_PAL
The PC cannot cross a 4Mb virtual address space region delimiter going from fetch chunk 0 to fetch chunk 1
A JSR, JSR-COROUTINE, or BSR is the last instruction of a 4Mb virtual address space region for slot 0 or slot 1

In the hardware, PC calculation is broken down into three components:
Table 3-9 Hardware PC Calculation Components
Component

Bits

The high bits

<51:22>

The middle bits

<21:5>

The low bits

<4:0>

For full functionality, PCl is always calculated correctly (the Ibox will never squash
slot 0, only slot 1). NextPCO calculation is governed by the squash rules above. The
matrixes in Table 3-10 show the three components for the calculation for NextPCO,
given those rules.
Table 3-10 Matrix Legend
Matrix

Description

SEQ

Slot exited sequentially, ie no PC changing instruction

TBR

Slot exited with a taken branch - taken CBR or BR or BSR

JPR

Slot exited with a JMP or JSR

RET

Slot exited with a RET

CPL

Slot exited with a CALL_PAL

PCO

Input from the original PCO

PCO+l

PC0<21:5>+ 1

PC0+2

PC0<21:5>+2

Input from Jump Predictor

JP+l

Input from Jump Predictor <21:5>+1

nput from top of Return Stack

Compaq Confidential
5 January 2001 --· Subject To Change

Instruction Fetch Unit -

the lbox

3-31

PC Unit

Table 3-1 o Matrix Legend
Matrix

Description

RP+l

Input from top of Return Stack <21:5>+1

OFO

Input from the computed branch target <21 :5> stored in the Icache Data Array for slot 0

OFO+l

Input from the computed branch target <21:5> stored in the Icache Data Array+ 1 for slot 0

OFl

Input from the computed branch target <21 :5> stored in the Icache Data Array for slot 1

xxx

Not a legal combination of slot exits, output comes from the computed PCl, indicated in Table 3-9

Table 3-11 NextPC o Calculation Matrix
S1_Exit
SO_Exit

SEQ

TBR

JPR

RET

CPL

SEQ

PCO

TBR

PCO

JPR

xxx

RET

CPL

xxx

xxx
xxx
xxx

xxx
xxx
xxx
xxx
xxx

SEQ

PC0+2

OFl

TBR

OFO+l

OFl

JPR

JP+l

OFl

xxx

RET

RP+l

OFl

CPL

xxx

xxx
xxx
xxx

SEQ

OFl

TBR

OFl

JPR

OFl

xxx

RET

OFl

CPL

xxx

xxx
xxx
xxx

PC0<51 :22>

PC0<21 :5>

xxx
xxx
xxx
xxx
xxx

PC0<4:0>

xxx
xxx
xxx
xxx
xxx

3.7.2 PC Compare
The PC Comparison Logic uses the newly calculated PCs to determine the following:

•
•
•
•

If the slot 0 and slot 1 predicted !cache indices were correct
If the slot 0 and slot 1 cache accesses were hits .
If there was an instruction access violation

If slot 1 should be squashed. (see PC Cale section above)

Compaq Confidentia I
3-32

Instruction Fetch Unit - the lbox

5 January 2001 ··· Subject To Change

PC Unit

•
•

If slot 0 and slot 1 accessed the correct way in the Icache
When to make a fill request for an Icache miss

3.7 .2.1 Index Mis predicts

Each cycle, the line predictor produces up to two Icache indices, which are necessary to
maintain a fully pipelined instruction fetch engine. The actual PCs needed to determine
the correct next Icache locations to access are not available for a cycle (slotO) or
two(slotl) after they are really needed. So the predicted indices that are generated by
the line predictor are checked by the real calculated PCs later in the pipeline. The indices produced by the line predictor are bits <14:2> of the anticipated PC. These bits are
compared directly with bits <14:2> of the calculated PCs. If the bits match, the index
was predicted correctly. If not, an index mispredict is signaled. The slot 0 index is compared in pipe-stage I3 below and the slot 1 index is compared in 14:
11

Index Pred

!cache Access

PCOIDXComp

PCl IDXComp

The Icache can be accessed with the correct index the cycle following an index mispredict. So, for a slot 0 index mispredict there is a 2 cycle penalty, and a 3 cycle penalty for
a slot 1 index mispredict.
3.7.2.2 lcache Hit Determination

Whether an Icache access hits or misses is also determined by PC comparison. The
Icache tag array produces the tag contents of the accessed cache block. The tag contains
several components including virtual and physical tags, as described in Section 3.5.2.
The Ibox supports two methods of hitting in the Icache for non-PALcode instructions:
1. Virtual tag hit- occurs when:
The virtual tag matches the bits of the calculated PC <51 :15> AND
the ASN of the tag matches the Current ASN of the accessing TPU's process context
OR
The ASM bit in the tag is set AND
the accessing TPU's tpu group matches the tags TPU group valid designation
2. Physical tag hit - occurs when:
PC matches the Micro Translation Buffer's (Micro TB) virtual address <51:13>
AND
the Micro TB 's valid bit is set AND
the Micro TB's physical address matches the Tags Physical address<47:13> AND
the Micro TB' s ASN matches the Current ASN of the accessing TPU' s context
OR
The ASM bit in the Micro TB is set AND the tag is valid for any TPU group
Virtual tag hits are expected to be the normal way of hitting in the Icache. Essentially,
the virtual tag matches and the address space number is correct, or the address space
match bit is set and the valid bit is set for that TPU's group. Physical tag hits are supported in the Ibox to facilitate sharing common code among different TPUs. Basically,
two programs running on different TPUs should be able to share instructions in the
Icache if they map to the same physical address. To facilitate this sharing, the Icache tag
Compaq Confidential
5 January 2001 ···Subject To Change

Instruction Fetch Unit - the lbox

3-33

PC Unit

array holds the physical as well as the virtual tags for all Icache blocks. Since the PC is
virtual, a fast virtual to physical address translation also needs to be done to compare
with the physical tags coming out of the Icache.

The Micro Translation Buffer (Micro TB), holds just one page table entry (PfE) per
TPU, and so is inexpensive and provides very fast translation for the newly computed
PCs. The Micro TB holds the virtual and physical tags as well as the ASN, ASM and
TPU group valid bits of the last block that was fetched from the lcache and was a virtual tag hit. Effectively, its a cache of the tag of the most recent virtual Icache hit. If two
TPUs address memory at the same physical address, and use the same virtual index to
access the lcache, they can share lcache blocks. The first time a TPU attempts to fetch
from a page of Icache blocks that are shared by another TPU, it will miss because the
ASN or TPU group valid bits will not match for a virtual cache hit, and the MicroTB
will be out of date since this is the first access to a new virtual page. But when this first
block is brought into the Icache, it will result in a virtual Icache hit, and the PfE information from the newly fetched block's tag will be written into the microTB so that subsequent lstream accesses to that physical page will physically hit in the cache.
PALcode uses a slightly different mechanism to hit in the lcache. All PALcode instruction blocks are mapped physically, so ASN and ASM are not relevant. The virtual tag in
the lcache will contain the actual physical address of the instruction block. When PALcode is fetched into the lcache, a bit in the Icache tag is set indicating that the block was
physically filled. In order to access physically filled blocks in the Icache, the TPU must
be operating in PALmode. An Icache hit occurs in PALmode if the PC matches the virtual tag or physical tag and the TPU and the block was physically filled.
lcache miss determination occurs roughly the same time as index mispredict determination in the Ibox pipeline. Once the PCs have been calculated, they are compared with
cache tags to determine if there is an lcache miss. If there is an lcache miss, the pipeline
stages prior to the cycle that the Icache miss has been determined are aborted, the fill
unit is informed of the newly requested address. The Fetch Thread Chooser is also
informed so that it will not choose TPUs with lcache fill requests in progress.
3.7.2.3 lcache Access Violation:
An Icache access violation occurs when an lcache block is fetched and is a hit, but the
context of the running process does not have privileges to access that particular block.
Each block in the Icache has one of the four privileges:
U - user read enable
S - supervisor read enable
E - executive read enable
K - kernel read enable

The USEK bits are set in the PTE entry for a particular block, and are filled into the
Icache Tag Array during a normal Icache fill flow. The current process context for a
TPU also has a designated USEK privilege level. An access violation interrupt is initiated when the process context USEK for a TPU does not match the USEK designation
written into the tag array for an lcache block that is a hit.

Compaq Confidential
3-34

Instruction Fetch Unit - the lbox

5 January 2001 - Subject To Clumge

PC Unit

3.7.2.4 lcache Way Mispredict Determination:

The !cache is pseudo-2way set-associative. It is 2way set-associative because instruction blocks can map into two different indexed locations in the !cache. In a standard
2way set-associative cache, both potential block locations are read out, the both sets of
tags are compared and if either of the tags matches the accessing address, a hit is signaled and the appropriate block is selected. In the 21464's "pseudo" 2way set-associative !cache, instead of the simultaneous access method used in a standard 2way setassociative cache, one way is predicted and that block is read out. If that block is the
wrong one, the other block is read out subsequently. This avoids extending an already
critical path in the !cache access path, and keeps the processor's cycle time short.
The PC Compare logic is responsible for determining if an !cache access was to the correct way. If the index generated by the line predictor is correct (ie, bits <15:2> of the
index match the PC), but the tag does not match, there is a potential way mispredict.
Each blocks tag in the !cache Tag Array stores a subset of the tag that was last filled in
the alternate way (See !cache Tag Array subsection in the Instruction Unit section). If
those alternate tag virtual address bits <23:15> match those of the accessing PC, a way
misprediction is signaled. The Line Predictor, which is responsible for predicting the
!cache way, is trained to predict the alternate way next time. On a way mispredict the
Ibox pipeline is aborted for the faulting TPU and then restarted accessing the alternate
way in the !cache. Since not all the alternate tags virtual address bits were matched, the
second access is not guaranteed to be an !cache hit. It could result in an !cache miss if
the upper bits of the virtual address tag did not end up matching. It could also result in
another way mispredict, if bits <23:15> of the originally accessed way also match the
PC, but the upper bits do not match. This can result in a deadlock, where the two Icache
locations ping pong back and forth, each time resulting in a way mispredict. To avoid
deadlock, the Line Predictor remembers if we already suffered a way mispredict while
trying to access the current PC. The second time, an !cache miss will be signaled
instead.
!cache way mispredicts are signaled in the cycle after index mispredicts are normally
signaled:
Table 3-12 lcache Mispredict Signalling
I1

WayPred

IC Access

PCOIDXCmp

PCOWayMisp
PC 1 Index Cmp

PCl WayMisp

3.7.2.5 Instruction Cache Fill Request:

When a correctly indexed !cache access is not a hit and not a way mispredict, a fill
request is signaled to fetch the instruction block from lower level memory. Since the
21464's second level cache is physically indexed and tagged, the Ibox must send a
physical address along with the fill request to receive the appropriate data. The Ibox has
two sources for translating the virtual PC into a physical address:
•

The MicroTB (See Section 3.7.2.2.)

•

The 128 Entry Instruction Translation Buffer (ITB) (See Section 3.8.)

Compaq Confidential
5 Janw1ry 2001 ··· Subject To Change

Instruction Fetch Unit -

the lbox

3-35

Fm Unit

In order to save power, the main 128 Entry ITB is not accessible every cycle. Furthermore, it is only necessary to access the main ITB when the MicroTB does not contain
the proper PTE. If there is an !cache miss and the MicroTB VA tag matches the upper
PC bits and the address space comparison matches, the PA found in the PTE is the correct missing physical page frame number. The page frame number <47:13> is the upper
portion of the physical address. It is concatinated with the page offset <12:6> to form
the complete physical address of the missing !cache block. If the correct PTE is not
found in the 1 entry MicroTB, the main ITB must be accessed to retrieve the page translation. The ITB cannot be accessed immediately because it was not operating to reduce
power consumption. The PC Compare logic causes an Ibox pipeline abort and restart,
and sends a signal to the Line Predictor indicating that the main ITB needs to be
enabled for the next fetch attempt. The next time the missing TPU is chosen by the
Index Unit, the Line Predictor will send a signal to the main ITB, which prepares it to
be accessed. The next time the !cache miss is detected the ITB will lookup the PC's VA
and use the page frame number found in the matching entry to generate the physical
address for the fill request.

3.8 Fill Unit
3.8.1

Instruction Translation Buffer
The 21464 has a virtually addressed instruction cache. All memory references outside
the CPU core (including the Scache and off-chip memory) are physically addressed. In
the event of an !cache miss, the translation buffer's main task is to determine, as quickly
as possible, the physical address of the cache line in which the miss occurred so that it
can be fetched by the Cbox.
The ITB contains only a subset of all possible address translations, called page table
entries (PfEs). Because the ITB itself is a 'cache' of PTEs, it is possible that when an
!cache miss occurs and the virtual address is given to the ITB for translation, the PTE is
not found. In this case, a trap causes a PALcode routine to lookup the correct PTE from
a software table and use an IPR to write the translation into the ITB.
So far, this operation is consistent with the 21264 ITB. However, unlike the 21264, the
21464 includes hardware support for simultaneous multithreading, which has the following implications for the ITB:

•

When an !cache miss occurs, only one TPU is affected. It is important that performing the PTE lookup and doing the !cache fill does not stall the pipeline so that other
TPUs can continue execution.

•

Because ITB fills are completely independant of ITB lookups, care must be taken
to reduce the possibility of one TPU's writes interfering with another TPU's read.

•

TPUs operating independantly of each other in separate thread groups (TGs) can
access the same physical page. To prevent the !cache from storing the same data
twice (and thus requesting two ITB lookups), a new mechanism is needed to detect
physical address sharing between TPUs. For operating system code, sharing
already occurs between processes within a TG (identified by a distinct ASN) by
using the ASM mechanism.

Compaq Confidential
3-36

Instruction Fetch Unit-the lbox

5 January 2001 ···Subject To Clumge

Fm Unit

•

Because each TG is an independant entity to which TPUs can belong (like separate
CPUs), PfEs belong to exactly one TG. The ITB stores four one-hot valid bits that
indicate to which TG the PTE belongs. TGs do not share PfE entries. The !cache,
however, does not signal a miss when there is physical address match, thus preventing the ITB lookup. A PTE hit is determined as follows:
pte_VA<51:13> == current_PC<51 :13> AND
pte_TG<3:0> == current_TG<3:0> AND
(pte_ASN<7:0> == current_ASN<7:0> OR pte_ASM == '1')

An !cache miss penalty is significantly reduced because the 21464 includes an onboard, second-level cache (Scache). Thus, time taken for address translation becomes a
significant part of the !cache miss penalty, and it is important that the ITB provides a
physical address for the fill as soon as possible.
As in the 21264, 8k page sizes are supported. The 21464 can additionally support 64k
pages sizes. Granularity hinting is allowed on 64k pages to provide up to 512MB effectively sized pages.
3.8.1.1 Architecture

For the first time in an Alpha implementation, the ITB consists of a pseudo two-level
'cache' of PfEs: a first-level micro ITB (uITB) and a second-level main ITB (the ITB).
The first-level uITB is a single PfE entry for each TPU. It effectively contains the last
good address translation that the !cache accessed for each TPU; the uITB is updated
any time a TPU virtually hits in the !cache. The PTE information for the update comes
from the !cache tag and not from the main ITB array. For ease of implementation, only
the first fetch can update the ulTB (the 21464 fetches twice per cycle). When there is an
!cache miss, the ulTB is quickly checked to see if it contains the correct PfE for the fill.
If the PfE is good, it is sent to the fill unit and a cache miss is signaled. The physical
address is available at the fill unit just two cycles after the miss was detected.
The second-level ITB is a 128-entry fully associative 'cache'. Writes are organized as
round-robin by using simple head/tail pointer logic. Simultaneous read/write is not possible, so read scheduling is important. Reads are pipelined across one and a half cycles
as follows:
i3b
i4a
i4b
i5a

Cam ASN/ASM and Group Valid
Cam VA
Read PIE
Send PIE to fill unit

Compaq Confidential
5 Janu~1ry 2001 ·- Subject To Change

Instruction Fetch Unit -

the lbox

3-37

Fill Unit

To save power, the main ITB is only activated for lookup operations upon a uITB miss
(cache miss is implied). This causes a penalty of at least six cycles between cache miss
detection and when the physical address is available to the fill unit. For simplicity, the
non-index restart mechanism in pipe control is used to enable main ITB lookup. What
happens is as follows:
Cycle

Event

0 (I3)

Cache miss detection, uITB determined to be wrong.

1 (I4)

Non-index fault is signalled causing the PC to be replayed in the pipe. !cache miss is
NOT signaled.

2 (11)

Index is sent to lcache

3 (I2)

lcache is read

4 (13)

Main ITB is enabled. lcache miss detection

5 (I4)

!cache miss signaled. Main ITB lookup in progress

6 (15)

PfE sent to fill unit

It is possible for the this penalty to be longer than six cycles if the TPU is not selected
immediately after the non-index fault.

The main ITB is very similar the 21264 ITB. Super page detection and invalidation
operates the same. Additionally, a new invalidate, TBIAG, invalidates all entries in all
TGs. Superpages are supported in the main ITB as follows:

Table 3-13 Superpage support in the Main ITB
Superpage

Description

SuperpageO

Direct maps one quarter of WindowsNT's 32bit address space. The kernel code is kept in this
area of memory. It is believed that 64-bit NT will use the Unix superpage mode.

Superpagel

Direct maps the least significant 41 bits of the physical address space (bits <47:41> sign
extended) to support older versions of UNIX and VMS. This superpage is consistent with the
43 bit virtual address supported by EV4, EV5 and the size of the 3 level VPfEs used in Digital
UNIX. (see SRM Digital UNIX 11-B section 3.1.1).

Superpage2

Direct maps the whole of the physical address space for more recent versions of UNIX and
VMS which may use four level PfEs.

If l_CTL[SPE<2>] = 1 AND VA<51:50> = "10" Then PA<47:13>=VA<47:13>, USEK="OOOl
If l_CTL[SPE<l>] = 1 AND VA<51:40> =
11

111111111100

Then PA<47:13>= 0000000 ,VA<40:13>, USEK= 000l

If I_CTL[SPE<O>] = 1 AND VA<51:30> =
11

Then PA<47:13>= lllllll ,VA<40:13>, USEK= 000l

111111111101"

If I_CTL[SPE<l>] = 1 AND VA<51:40> =
II

1111111111111111111110

Then PA<47:13>= #OOOOO,VA<29:13>, USEK= 000l

Compaq Confideaitia I
3-38

Instruction Fetch Unit-the lbox

5 J~·muary 2001

Subject To Change

Fm Unit

Address Space Match (ASM) is supported as in the 21264. When an entry in the main
ITB has the ASM bit set, matches against the ASN are ignored when determining TB
hit. The uITB also includes this support for hit detection.
Both the main ITB and the uITB utilize the full physical address space permitted
<47:13>. PFNs are limited to 32 bits by software, which yields a 64K-page PFN of
<47:16> and an 8K-page PFN of <44:13>. When in SK-page mode, the main ITB sign
extends the 45 bit physical address up to bit <47> when filling PfEs, and in 64K-page
mode, the ITB bypasses VA bits <15:13> from the PC into the physical address bits
<15:13> when reading. To correctly match entries in 64K-page mode, the main ITB
ignores VA bits < 15: 13> because they aren't part of the PFN. The uITB also ignores VA
bits <15:13> when performing a VA match in 64K-page mode.
Granularity Hint (GH) is taken care of in the main ITB in the same manner as the
21264. Special CAM structures on the VA bits affected by GH can disable miss detection on those bits thus giving the appearance of an ITB hit on a seemingly larger page.
Upon reading the PA out of the main ITB, the affected VA bits are muxed into the corresponding bits of the PA to return the physical PFN. Note that the ITB will always
return PfEs for base size (8k or 64k) pages. Essentially, GH allows an ITB entry to
cover multuple contiguous base size page translations. Here are the diffement GH mappings:
Table 3-14 Granularity Hint (GH) Mapping
GH Mode=>
Page Mode JJ.
gh<1 :0>

== 00

gh<1 :0>

== 01

8k page size

TB entry covers 8K

64k page size

TB entry covers 64K TB entry covers 2M

gh<1 :0>

== 10

TB entry covers 64K TB entry covers
512K

gh<1 :0>

== 11

TB entry covers 4M

TB entry covers 64M TB entry covers
512M

The uITB will not have support for granularity hinting or super pages explicitly. This is
taken care of since the uITB will just contain an explicit page translation that comes
from the main ITB in a round about fashion. For example, the first request in a super
page region will result in a main ITB read. The result of this read will return the hardwired superpage PA for that VA. The fill unit will fetch the required lcache blocks and
write the hardwired PA into the Icache tags. The next time that block is fetched from the
Icache successfuly, the uITB will be updated to contain the hardwired superpage PA.
Granularity Hinted pages will only be stored as a base page size translation in the uITB.
Jumps outside of a base page will cause a uITB miss although the main ITB will hit on
the same translation that filled the uITB.

Compaq Confidential
5 Janw1ry 2001 -· Subject To Change

Instruction Fetch Unit -

the lbox

3-39

Fm Unit

3.8.1.2 IPRs That Affect the ITB
Table 3-15 IPRs that Affect the ITB

IPR

Affect on the ITB

ITB_TAG

This IPR contains the VA used for filling the main ITB and also performing invalidate operations. There is one for each TPU.

ITB_PTE

This IPR contains the PTE used for filling the main ITB. Retiration of the MTPR to this IPR
causes the ITB_TAG and ITB_PTE contents to be written into the main ITB. There is one for
each TPU.

ITB_IASN

When a MTPR to this IPR retires, all entries for the current ASN and TG are invalidated. The
Icache must be invalidated for the current TG and the uITB invalidated. There is one for each
TPU.

ITB_IA

When a MTPR to this IPR retires, all entries in the current TPU's thread group are invalidated.
The Icache must be invalidated for the TG of the current TPU. The uITB must also be invalidated for the TPU. There is one for each TPU.

ITB_IS

When aMTPR to this IPR retires, any entry that matches the VA in ITB_TAG and matches the
current TPU's ASN and TG will be invalidated. The Icache must be flushed and the uITB invalidated. There is one for each TPU.

ITB_IAP

When a MTPR to this IPR retires, any entry that is valid for the current TG and whose ASM bit
is not set will be invalidated. The Icache must be flushed and the uITB invalidated. There is one
for each TPU.

I_CTL

When a MTPR to this IPR retires, the SPE<2:0> bits in this IPR enable the 3 super page modes.
Each TPUhas it's own I_CTL IPR.

PCTX

When aMTPR to this IPR retires, the ASN<7:0> bits in this IPR will indicate which ASN is
assigned to the TPU. Also, the TPU_GRP<3:0> bits in this IPR indicate which thread group the
TPU belongs to. Each TPU has it's own PCTX IPR.

3.8.1.3 ITB Operations
3.8.1.3.1 Fills

Writes stem from an ITB miss flow (PAL code). Here's a break down of what happens
from the miss code:
•

A MTPR (Move To Processor Register) to the ITB_TAG IPR is issued. The data is
written into a speculative register.

•

A MTPR to ITB_PTE is issued. The data is written into a speculative register.

•

When the MTPR to ITB_PTE retires, the data in the speculative registers are written into the main ITB array. The ASN of the current process is also written into the
array from the PCTX IPR. Note that there is no real ITB_PfE or ITB_TAG IPR
register, only a speculative register.

•

The MTPR to the ITB _PfE register must be followed by an IFETCHB to ensure
the main ITB state is updated before it is used.

The uITB gets written by a much more ciruitous route. It starts with a cache miss which
requires the main uITB to supply the fill unit with a PTE (PA, USEK bits, and the ASM
bit) for the page being requested. When the fill data returns, the fill unit supplies this

Compaq Confidential
3-40

Instruction Fetch Unit - the lbox

5 Jc1nuary 2001 - Subject To Change

Fm Unit

PTE to the Icache Tags for writing. The final step to writing the uITB requires that the
Icache virtually hits on an entry containing the this PTE. The virutal hit causes the uITB
to be written with PTEfrom the Icache Tags.

3.8.1.3.2 Reads

Reads are explained previously.
3.8.1.3.3 Invalidates

There are four invalidate operations for the ITB.
Table 3-16 ITB Invalidate Operations
Operation

Description

Invalidate All

Invalidates all entries within the current TPU's TG. Requires only a MTPR to the
TB_IA IPR. The actual invalidate is performed upon retire of the MTPR. Additionally, the Icache must be invalidated for this TG. An IFETCHB must follow the
MTPR to ensure the ITB is up to date before it is accessed again.

Invalidate ASN specific

Invalidates all entries in the ITB that match ASN and TG of the current TPU.
Requires only a MTPR to the TB_IASN IPR. The actual invalidate is performed
upon retire of the MTPR. Additionally, the lcache must be invalidated for the current TG. An IFETCHB must follow to ensure the ITB is up to date before it is
accessed again.

Invalidate Single

Invalidates a single entry specified by VA, ASN, and TG. Two MTPRs are
required. The MTPR to the ITB _TAG IPR must ocurr before the MTPR to the
TB_IS IPR. Upon retireation of the TB_IS MTPR, the ITB_TAG VA, the current
ASN, and the TG of the current TPU will cam against the contents of the main
ITB. Any matching entries will be invalidated. Additionally, the Icache must be
invalidated for this TG. An IFETCHB must follow to ensure the ITB is up to date
before it is accessed again.

Invalidate Process Specific

Invalidates all entries within the current TPU's TG which do not have the ASM bit
set. The actual invalidate is performed upon retire of the MTPR. Additionally, the
lcache must be invalidated for this TG. An IFETCHB must follow the MTPR to
ensure the ITB is up to date before it is accessed again.

3.8.2

Instruction Fill Unit
The Instruction Fill Unit (IFU) is responsible for fetching instructions when an !Cache
miss occurs. It consists of two sections: Request and Fill. The Request section itself is
made up of two subsections: Demand and Prefetch. The Demand subsection handles
!Cache misses detected by the Ibox pipeline, while also recording and sending all Ibox
memory requests to the Mbox preMAF for servicing. The Prefetch subsection generates
a fixed number of consecutive memory requests ahead of the original miss and routes
them to the Demand unit for Mbox handling. The Mbox preMAF funnels together both
Instruction stream and Data stream requests and delivers them to the Cbox for fetching.
As the I stream requests are satisfied, the resultant instructions are sent from the Cbox
to the Fill section of the IFU for predecoding and loading into the ICache. The following simplified block diagram shows the IFU and its Request and Fill sections in relation
to the !Cache, lbox Pipeline, Mbox preMAF, and Cbox return logic.

Compaq Confidential
5 January 2001 --·Subject To Change

Instruction Fetch Unit -

the lbox

3-41

Fm Unit

Figure 3-5 Instruction Fill Unit (IFU) Request and Fill Sections

::·G~p±<) :·

t:~t.:~~Mt~f

A fundamental assumption in the design of the IFU is that memory requests are never
cancelled once they have been sent to the Mbox. Dropping an unneeded request would
be dangerous because a remote part of the IFU might simultaneously decide that the
request was required after all. Furthermore, the minor benefit of cancelling certain
requests would not be worth the additional hardware cost of tracking dropped requests.
3.8.2.1 Demand Misses

The Demand subsection of the Request portion of the IFU is responsible for sending
!Cache miss requests to the Mbox preMAF for servicing by the SCache and/or Memory. A simplified block diagram of the Demand subsection appears in the figure below.
The physical address (PA) and pte_valid input signals come from the ITB, while all of
the others are from the Ibox pipeline. The fill_request signal serves as the valid bit for
demand requests from that pipeline.

Compaq Confidentia I
3-42

Instruction Fetch Unit - the lbox

5 January 2001 - Subject To Change

Fm Unit

Figure 3-6 Instruction Fill Unit (IFU) Demand Subsection

i:tti~J~;. . . . . . . . . .~.:"=========.;.;.;.;.;,;:.
M~:.+a..: :.,¥R.'J(t@:

#i$l~~~, . . . ::; .

::~:~:~:;'.f7f.~~~~f:•:•,.,...::·-~F
/f.J:J::J~:;;.;V:@'··--"·~.;..,.;.-;4

The Index CAM Array is used to determine if there is another fill request currently outstanding to the same !Cache index and way as the current valid miss. For a 64 kB
instruction cache, fill_VA<l4:6> is the index, while fill_ VA<15> is the way bit unless
the flip_ way signal from the Thrash Detector indicates that bit should be inverted. The
Pre tag Array stores fill_VA<51 :6> and ITB information for each request, all of which is
retrieved when the fill instructions return from the Cbox. Together, these two arrays are
referred to as the Entry Arrays; simulation studies have shown that 32 entries are appropriate.
The Freelist is a stack whose top indicates the next available free location in the Entry
Arrays. Because there can exist at most only one demand for each TPU, the Demand
Array contains four pointers (with valid bits) that indicate the index of a given TPU's
demand resides in the Entry Arrays. Each Demand Array entry also contains a "piggyback" bit detailed below. The Stall Logic is used to record which TPU s attempted to use
the IFU when it was full, or when either the Mbox preMAF or Cbox MAF were full.
3.8.2.1.1 Demand case: simple

The first case to consider is a simple demand. The fill_request signal goes high at the
start of cycle 15, indicating a valid miss. The pte_valid signal is true for this simple
case, while both PMF_full and flip_way are not. The Index CAM Array is probed: in a
simple demand, there is no match with any valid entry. Starting in 16, the output of the
Compaq Confidential
5 January 2001 ··· Subject To Change

Instruction Fetch Unit -

the lbox

3-43

Fm Unit

Index CAM Array therefore indicates that this is a legitimate miss, so Ifetch_vld goes
high. The Mbox preMAF uses this signal to validate the Ifetch_pa, consisting of
PA<47:13> from the ITB and fill_VA<12:6> from the Ipipe; and the Ifetch_ptr, which
is a token used to uniquely identify this request.
Simultaneous with the request shipment to the Mbox in I6, the Entry Arrays are written
with data at the index indicated by Ifetch_ptr; and in the following phase, the Freelist
stack is popped. The Pretag Array caches the following signals: fill_ VA<51:15> and
ASN<7:0> from the Ipipe, PA<47:13> from the ITB, and 11 bits of control called the
hit conditionals. These latter bits consist of physical_fill, console, and tg_valid<3 :0>
from the Ipipe, along with ASM and USEK<3:0> from the ITB. Finally, the fill_ way bit
is appended to fill_VA<14:6> and stored in both of the Entry Arrays. The Demand
Array is written in I6 with the lfetch_ptr at the location indicated by the TPU number.
3.8.2.1.2 Demand case: index and way match of active request: "piggybacking"

In order to avoid potential livelock cases, the IFU allows only one outstanding memory
request to a given !Cache index and way at any time. A novel technique, denoted "piggybacking", is used to handle the request if such a match occurs. Following the simple
case above, the input control signals are the same, but here, the Index CAM Array indicates a match. This forces the lfetch_vld signal to become false to prevent the fetch
from occurring. The Demand Array is written at the requesting TPU with the pointer to
the entry that matched the request, and the piggyback bit for the same TPU is set.
When the instructions from the original request arrive from the Cbox, the Fetch Thread
Chooser (FTC) is notified to restart the corresponding TPU. A few cycles later, the FTC
is allowed to restart any other TPU that piggybacked onto that request. The use of the
term "piggybacking" for this method is now apparent, because any subsequent demand
misses that match an active request ride along with that request.
Recall that only the ICache index and way are checked for piggybacking, not the
higher-order VA bits. Simulation studies have shown that these bits often match as well.
Most often, this is caused when a redirected goodpath restart requests a block already
desired during badpath execution, or when a demand miss contains a short forward
branch to code being prefetched. If the higher-order VA bits to not match, the TPU
restart of any piggybacked entry results in a new miss.
3.8.2.1.3 Demand case: flip_way active

The ICache way into which a given miss will fill is determined at miss time. The majority of requests will fill into their "natural" way, which is fill_VA<l5> in a 64 kB
ICache. The Thrash Detector determines under which circumstances flip_ way goes
high, indicating that the complement of fill_ VA<l5> (known as the "alternate" way)
should be used.
The decision whether or not a demand miss must piggyback is a function of what the
fill_way is determined to be. Consequently, the Thrash Detector output must be read
and the fill_ way altered before the Index CAM Array is probed for a match. Otherwise,
a miss that doesn't match an active request in its natural way might have its way toggled
and have an alternate way match that is not detected.

Compaq Confidential
3-44

Instruction Fetch Unit - the lbox

5 January 2001 -~ Subject To Change

Fm Unit

3.8.2.1.4 Demand case: capacity stall

There is finite storage for handling memory requests, both in the IFU (in the Entry
Arrays) and beyond (in the Mbox preMAF and the Cbox MAF). A full_resource signal
is raised when any or all of these storage areas are full, taking into account any in-flight
delays. If a demand miss arrives when full_resource is high, a stall bit corresponding to
the TPU number of the request is set. When the full_resource line falls, the stall bits are
sent to FTC, indicating which threads must retry their requests.
3.8.2.2 Prefetching

Once a demand miss has been confirmed by the Ibox pipeline, the Prefetch subsection
can generate memory requests. Because the IFU interface to the Mbox preMAF can
accept one memory request per cycle, the Prefetch subsection generates a single request
per cycle and routes it to the Mbox through the Demand subsection interface. The
prefetch requests are for consecutive !Cache blocks beyond the confirmed miss address.
The maximum number of such prefetch requests that are generated for a given miss is
determined by a per-TPU IPR value. The actual number sent to the Mbox may be less
than the maximum due to filtering.
A simplified block diagram of the Prefetch subsection appears in the figure below.
There are four 101-bit Capture Registers (one per TPU), each of which saves all of the
required information about a confirmed Demand miss that will be needed to generated
prefetch requests (specifically, fill_ VA<51 :6>, ASN<7:0>, and fill_way from the
Demand subsection, PA<47:13> from the ITB, and the 11-bit hit conditionals from
both). The Filter Array is essentially a copy of the ICache Tag array in that it contains 2
sets of 512 entries each (for a 64 kB !Cache), but this copy stores a hashed version of
the tags in order save chip area. The Index CAM Probe determines if another active
request shares the same !Cache index and way as the feedback_VA. It is shared with the
Index CAM Array in the Demand subsection.

Compaq Confidential
5 January 2001 ··· Subject To Change

Instruction Fetch Unit -

the lbox

3-45

Fm Unit

Figure 3-7 Instruction Fill Unit (IFU) Prefetch Subsection

-r~'.it:~f :;A.r~~:f::

raaan?...

....... Xn-~~l::'>iri.a~i: .
::::;.··:::··:~:·:·::·:·.·::·

.. :·<:·::·:\.::;

·i;;.:...,....,..,.llli'I

<.":'. ):J?#. ~~;"¥:~::,
..

.... '

·:· >·:i'AT.J:

~Mi>

An unusual characteristic of the Prefetch Subsection is that it conceptually exists in two
different time domains. Simulation studies have shown that demand misses should
always proceed
ahead of prefetch requests, so a portion of the design uses recirculating
I
latches between clock stages to "freeze" the prefetch state while a demand miss is sent
to the Mbox. Yet certain inputs arrive from the Demand and Fill Subsections that cannot be frozen without data loss, so they must be handled immediately. More specifically, the inputs to the Capture Registers and the Filter Array are stored as soon as they
are valid. The different time domain "worlds" in the prefetcher are distinguished by a
dashed line in the figure.
3.8.2.2.1 Prefetch case: simple

Activity in the prefetcher begins when an ICache miss is confirmed by fill_request in
the Demand subsection. When that is using the Index CAM Array to check the index
and way of the demand miss, the prefetcher feeds the VA through the Source Mux to
look up the hashed tag, while the same fill_ VA is both incremented and hashed in parallel. Ideally, an incremented fill_VA would be used to index the Filter Array, but there is
insufficient time to do both the increment and the lookup in a single cycle; instead, the
index to which the hashed version of the banked_VA is written is decremented before
writing the array. ICache_miss confirmation also initializes the per-TPU Range Counter
in IS.
Compaq Confide11tia I
3-46

Instruction Fetch Unit -

the lbox

5 J,1nuary 2001 -· Subject To Clumge

Fm Unit

The appropriate Ipipe contents are latched into one of the Capture Registers in 16. This
also triggers the Tag Compare of the hashed Filter Array tags with the hashed fill_ VA:
in the simple case, there are no matches. The Index CAM Array probe also occurs in 16,
as long as the Demand subsection does not need the shared hardware. The Capture Register data is then used to construct a request that is sent to the Mbox via the port in the
Demand Subsection. Because this prefetch pipeline is one cycle longer than the demand
one, the first prefetch can be sent to the Mbox in the cycle after the demand has been
sent. This is critical, because the prefetched blocks most likely to be needed for execution are those most near the demand miss.
Once it is confirmed that this prefetch request has been accepted, the Confirm box
sends a decrement signal to the per-TPU Range Counter, which stops the generation of
new prefetch requests when the range becomes zero. Until then, the Source Mux will
select the feedback_ VA as its input, which is the fill_ VA from the previous cycle incremented by the !Cache block address. Simulation studies have shown that the optimal
number of consecutive blocks to fetch ahead of the demand miss is usually between 2
and 4. The Range Counters are therefore 3-bit counters, allowing a maximum fetchahead distance of 7 !Cache blocks.
3.8.2.2.2 Prefetch cases: tag match or page boundary crossing

A variety of conditions may keep the number of prefetch requests sent to the Mbox
below the value specified by initial Range Counter value. First, if the Tag Compare unit
detects that a hashed tag in the Filter Array matches a hashed version of the Source
Mux VA, it is highly likely that the stream of prefetch requests will be for instructions
already (or soon to be) resident in the !Cache. In order to preserve memory bandwidth,
prefetching is squashed (stopped) by zeroing the proper Range Counter and invalidating the matching request before it is sent to the Mbox. Any request that crosses an 8K
page boundary is also squashed because its PA would require an ITB translation different from that stored in its Capture Register (superpage handling is TBD).
3.8.2.2.3 Prefetch case: Index CAM match
If the Index CAM Probe reports a match without a Tag Compare match or page cross-

ing, the given request is skipped (by forcing Ifetch_vld false) but prefetching is not
squashed. Recall that an Index CAM match indicates that there is another currently-outstanding request to the same index and way as the probe. Because prefetching is inherently speculative, it is considered too risky to have a prefetch request displace another
!Cache request, particularly if the other is a demand miss.
3.8.2.2.4 Prefetch case: alternate TPU demand during prefetching

One TPU may produce a demand miss and start prefetching when another TPU also
confirms a miss. When this happens, the recirculating latches "freeze" the state of the
prefetcher while the new demand miss is sent to the Mbox and the new demand state is
captured in the proper Capture Register. The prefetcher then resumes running in the following cycles until the appropriate Range Counter is zero. The New Start logic then
notices that the new demand state for the alternate TPU is ready, so prefetching proceeds for that TPU.
More than one Capture Register may have valid state ready for prefetching. This
requires the New Start logic to implement a picker to select amongst multiple ready
TPUs. Simulation has shown that this is a very rare occurrence, so any simple picking
algorithm is acceptable.
Compaq Confidential
5 January 2001 ··· Subject To Change

Instruction Fetch Unit -

the lbox

3-47

Fm Unit

3.8.2.2.5 Prefetch cases: badpath indication during prefetching

Again, once a request has been sent to the Mbox, it cannot be cancelled. However,
unsent requests in the Prefetch Subsection pipeline are dropped if a badpath indication
is received for the same TPU as the prefetch requests. This allows new prefetch
requests on the goodpath to proceed in the Mbox and/or Cbox without having to stall
behind the badpath ones. If a badpath indication is received for a TPU having valid
Capture Register state, that state is invalidated.
3.8.2.3 Fill

The Fill Section of the IFU contains the circuitry between the Cbox and the I Cache for
the predecoding and parity generation of instructions returning from the SCache or
memory system. A simplified block diagram of the Fill Section appears in the figure
below. The Cbox initiates the transfer of instructions to the IFU by supplying the
early_waming_ptr corresponding to the request for those instructions. This pointer is
used for probing two arrays in the Demand Subsection. The Demand Array Probe determines if the returning instructions are non-piggybacked demand requests for any of the
TPUs, while the Entry Array Probe looks up the VA and tag data stored earlier for this
request. This data, combined with the returning instructions themselves, are fed into the
Predecode Br_Offset Gen and Parity Gen boxes to determine predecode bits, branch
target offsets, and parity. The aggregation of these bits, combined with the tag and
instruction bits, is called the Fill Packet.
Figure 3-8 Instruction Fill Unit (IFU) Fill Section

':(t.:;.~·:~d~ffo.t::::

.·.·.

~;'i:~9.~ltd:2t::~.:J.L±i::
:~b:+Lecic.S$:t~,t
,•.

".:-:. ':.
~:

.'·.

·.:····.-.;

<t:6:,:.

.}~%~~¢,ti~
,'.';::~~::t::¥~:¥;:·'.~~rl~:,
.. ·:<

Compaq Confidential
3-48

Instruction Fetch Unit - the lbox

5 Jc1nuary 2001 ··· Subject To ChangE~

Fm Unit

The branch offset calculations calculate a portion of the target address for both conditional and unconditional branches.
Carry predecodes are generated during branch targe precalculation. If an integer or
floating-point conditional branch or unconditional branch is detected, the branch target
is precalculated and the displacement field is replaced with the target as follows.
The overflow predecode bit is calculated as follows, where the circumflex ( A ) represents an XOR operation:
[(PC <21:2> + 1) + I<19:0>]

I<20>

The increment predecode bit is calculated as follows, where the increment is to the next
address that falls on an 8-instruction boundary:
[(PC <21:2> + 1) + I<19:0>] + 8
Because the displacement is overwritten by the target address when the branch is stored
and because the displacement field is 21 bits long, only the lower 20 bits of the result
are calculated. By leaving the sign bit <20> intact and including the overflow bit in the
!cache to hold the carry-out, the rest of the addition can be performed when the PC's are
calculated. Because the sign and overflow bits can both be 0 or 1, the high bits of the
target can be incremented or decremented or unchanged.
The predecodes are split between those needed in the Ibox and those passed further
down the pipeline.
3.8.2.3.1 Predecode Bit Generation
Table 3-17 shows an overview of the predecode bits that are generated in the Instruction Fill
Unit (IFU). In the table, the last column shows the Pbox predecode bits as A, B, C, and D, and
the Ibox predecode bits as UE, CB, P2, P3, CM, and MA.

Table 3-17 is sorted according to opcode value. In the table:
•

The first column lists the instructions.

•

The second column lists each of the 23 possible Pbox-assigned instruction types.

•

The third column lists the opcode for each instruction or group of instructions.

•

The fourth column lists the function field bits in the instruction that the IFU uses to
determine the instruction type.

•

The fifth column lists the predecodes that the IFU generates. The Pbox assigns the
instruction type (column two) according to the EDCBA predecode bits. Similarly,
the lbox assigns the other encoding bits for the control flow instructions, described
in Table 3-18.

Table 3-17 Predecode Bits Defined by the lbox Instruction Fill Unit
Function Field Bits: 2

Predecode Bits: 3

151413121110 9 8 7 6 5 p

EDCBA UE CB P2 P3 CM MA

Instruction

Type1

Opcode Bits:
31-26

CALL PAL

XXP

000000

- - - - - - - - - - - - 00010

RES

xxx
xxx

000001

- - - - - - - - - - - - 01001

000010

- - - - - - - - - - - - 01001

RES

Compaq Confidential
5 January 2001 ~·Subject To Change

Instruction Fetch Unit - the lbox

3-49

Fm Unit

Table 3-17 Predecode Bits Defined by the lbox Instruction Fill Unit (Continued)
Function Field Bits:2

Predecode Bits:3

151413121110 9 8 7 6 5 p

EDCBA UE CB P2 P3 CM MA

01001

0 0 0 0

01001

0 0 0 0

01001

0 0 0 0

01001

0 0 0 0

01001

0 0 0 0

11011

0 0 0 0

001001

------------------------------------------------------------------------------

11011

0 0 0 0

Sii

001010

------------

10000

0 0 0 0

LDQ_U

SUI

001011

11001

0 0 0 0

LDWU

Sii

001100

-----------------------

10000

0 0 0 0

STW

IIS

001101

- - - - - - - - - - - - 11111

0 0 0 0

STB

IIS

001110

11111

0 0 0 0

STQ_U

IIS

001111

-----------------------

11111

0 0 0 0

INTA

II I

010000

---0--------

00100

0 0 0 0

INTA

IXI

010000

---1--------

00101

0 0 0 0

INTL

III

010001

---0----0---

00100

0 0 0 0

INTL

IXI

010001

---1----0---

00101

0 0 0 0

INTL

II I

010001

- - - 0 - - - 1 1- - -

00100

0 0 0 0

INTL

IXI

010001

- - - 1- - - 1 1- - -

00101

0 0 0 0

CMOVx

II I

010001

---0---01---

00100

0 0 0 0

CMOVx

IXI

010001

---1---01---

00101

0 0 0 0

III

010001

---010-00---

00100

0 0 0 0

IXI

010001

- - - 1 10 - 0 0 - - -

00101

0 0 0 0

INOP

II I

010001

---001-00---

00100

0 0 0 0

INOP

IXI

010001

- - - 10 1- 0 0 - - -

00101

0 0 0 0

INTS

II I

010010

---0--------

00100

0 0 0 0

INTS

IXI

010010

---1--------

00101

0 0 0 0

INTM

II I

010011

---0--------

00100

0 0 0 0

INTM

IXI

010011

---1--------

00101

0 0 0 0

FLTS

FFF

010100

-------1----

11100

0 0 0 0

ITOFx

IXF

010100

-------0----

00111

0 0 0 0

FLTV

FFF

010101

------------

11100

0 0 0 0

FLTI

FFF

010110

------------

11100

0 0 0 0

3-50

Instruction Fetch Unit - the lbox

Opcode Bits:
31-26

Instruction

Type1

RES

P.xxxx4

xxx
xxx
xxx
xxx
xxx

LDA

XII

001000

LDAH

XI I

LDBU

RES
RES
RES

000011
000100
000101
000110
000111

Compaq Confidential
5 January 2001 ··· Subject To Change

Fm Unit

Table 3-17 Predecode Bits Defined by the lbox Instruction Fill Unit (Continued)
Function Field Bits:2

Predecode Bits:3

31-26

151413121110 9 8 7 6 5 p

EDCBA UE CB P2 P3 CM MA

FFF

010111

------01----

11100

0 0 0 0

CPYSx

FFF

010111

------000---

11100

0 0 0 0

MT_FPCR

FFC

010111

------001-0-

11110

0 0 0 0

MF_FPCR

FFF

010111

- - - - - - 0 0 1- 1-

11100

0 0 0 0

CVTxx

FFF

010111

------1-----

11100

0 0 0 0

FNOP

FFF

010111

------00000-

11100

0 0 0 0

TRAPB

011000

00---0------

01001

0 0 0 0

011000

00---1------

01001

0 0 0 0

011000

01--00------

01001

0 0 0 0

(MB)

xxx
xxx
xxx
xxx

011000

0 1- - 1- - - - - - -

01001

0 0 0 0

WMB

IIX

011000

0 1- - 0 1- - - - - -

01111

0 0 0 0

FETCH

011000

100---------

01001

0 0 0 0

FETCH_M

xxx
xxx

011000

1 0 10 - - - - - - - -

01001

0 0 0 0

RPCC

XIY

011000

1 1 0 - - - - - - - - -

01011

0 0 0 0

XXN

011000

1 1 1-0-------

00011

0 0 0 0

xCB

IIX

011000

1 1 10 1- - - - - - -

01111

0 0 0 0

WH64x

IIX

011000

1 1 1 1 1- - - - - - -

01111

0 0 0 0

LDx_ARM

SII

011000

10 1 1 0 - - - - - - -

10000

0 0 0 0

QUIESCE

IIX

011000

1 0 1 1 1- - - - - - -

01111

0 0 0 0

HW_MFPR

RXI

011001

- - - - - - - - - - - - 00110

0 0 0 0

JMP
JMP

XII
XII

011010
011010

00---------0
00---------1

11011
11011

1 0 0 0
0 0 0 0

0
0

RET

XII

011010

10----------

11011

1 0 1 0

JSR

XII

011010

01----------

11011

1 1 0 0

JCR

XII

011010

1 1- - - - - - - - - -

11011

1 1 1 0

HW_LD

S II

011011

- - - - - - - - - - - - 10000

0 0 0 0

INTV

II I

011100

---00-------

00100

0 0 0 0

INTV

IXI

011100

---10-------

00101

0 0 0 0

INTV

II I

011100

---010------

00100

0 0 0 0

INTV

IXI

011100

- - - 1 1 0 - - - - - -

00101

0 0 0 0

INTV

II I

011100

- - - 0 1 1 0 - - - - - 00100

0 0 0 0

INTV

IXI

011100

- - - 1 1 10 - - - - -

00101

0 0 0 0

FTOix

FXI

011100

- - - - 1 11-----

01110

0 0 0 0

HW_MTPR

RIW

011101

------------

10110

0 0 0 0

Opcode Bits:
Instruction

Type1

FCMOVx

EXCB
MB

Compaq Confidea1tial

5 January 2001 ·-Subject To Change

Instruction Fetch Unit -

the lbox

3-51

Fm Unit

Table 3-17 Predecode Bits Defined by the lbox Instruction Fill Unit (Continued)
Function Field Bits:2

Predecode Bits: 3

15141312111098765 p

EDCBA UE CB P2 P3 CM MA

01001

1 0

1 1

11111

0 0 0 0

11000

0 0 0 0

11000

0 0 0 0

100010

--------------------------------------------------------

11000

0 0 0 0

SIF

100011

- - - - - - - - - - - - 11000

0 0 0 0

STF

FIS

100100

------------

11101

0 0 0 0

STG

FIS

100101

------------

11101

0 0 0 0

STS

FIS

100110

11101

0 0 0 0

STT

FIS

100111

11101

0 0 0 0

LDL

S II

101000

----------------------------------

10000

0 0 0 0

LDQ

S II

101001

- - - - - - - - - - - - 10000

0 0 0 0

LDL_L

S II

101010

------------

10000

0 0 0 0

LDQ_L

S II

101011

- - - - - - - - - - - - 10000

0 0 0 0

STL

IIS

101100

- - - - - - - - - - - - 11111

0 0 0 0

STQ

IIS

101101

------------

11111

0 0 0 0

STL_C

IIL

101110

- - - - - - - - - - - - 00000

0 0 0 0

STQ_C

IIL

101111

0 0 0 0

XXI

110000

- - - - - - - - - - - - 00000
- - - - - - - - - - - - 00001

1 0 0 1

FBEQ

FX:X

110001

-----------0

01100

1 0 1

FBEQ

FXX

110001

-----------1

01100

0 0 0 0

FBLT

FXX

110010

-----------0

01100

1 0 1

FBLT

FXX

110010

-----------1

01100

0 0 0 0

FBLE

FXX

110011

-----------0

01100

1 0 1

FBLE

FXX

110011

-----------1

01100

0 0 0 0

BSR

XXI

110100

- - - - - - - - - - - - 00001

1 1 0 1

FBNE

FXX

110101

-----------0

01100

1 0 1

FBNE

FXX

110101

-----------1

01100

0 0 0 0

FBGE

FXX

110110

-----------0

01100

1 0 1

FBGE

FXX

110110

-----------1

01100

0 0 0 0

FBGf

FXX

110111

-----------0

01100

1 0 1

FBGf

FXX

110111

-----------1

01100

0 0 0 0

BLBC

IXX

111000

-----------0

01000

1 0 1

Instruction

Type1

Opcode Bits:
31-26

IFETCHB

xxx

011110

HW_ST

IIS

011111

LDF

SIF

100000

LDG

SIF

100001

LDS

SIF

LDT

Compaq Confidential
3-52

lnstructio~ Fetch Unit-the lbox

5 J<1nuary 2001

Subject To Change

Fm Unit

Table 3-17 Predecode Bits Defined by the lbox Instruction Fill Unit (Continued)
Type1

Opcode Bits:
31-26

Function Field Bits:2

Predecode Bits:3

Instruction

1514131211 10 9 8 7 6 5 p

EDCBA UE CB P2 P3 CM MA

BLBC

IXX

111000 .

-----------1

01000

0 0 0 0

BBQ

IXX

111001

-----------0

01000

1 0 1

BBQ

IXX

111001

-----------1

01000

0 0 0 0

BLT

IXX

111010

-----------0

01000

1 0 1

BLT

IXX

111010

-----------1

01000

0 0 0 0

BLB

IXX

111011

-----------0

01000

1 0 1

BLB

IXX

111011

-----------1

01000

0 0 0 0

BLBS

IXX

111100

-----------0

01000

1 0 1

BLBS

IXX

111100

-----------1

01000

0 0 0 0

BNB

IXX

111101

-----------0

01000

1 0 1

BNB

IXX

111101

-----------1

01000

0 0 0 0

BGB

IXX

111110

-----------0

01000

1 0 1

BGB

IXX

111110

-----------1

01000

0 0 0 0

BGf

IXX

111111

-----------0

01000

1 0 1

BGf

IXX

111111

-----------1

01000

0 0 0 0

1
2
3
4

The predecode type (or logic group) is described in Section A.2.
In the function field bit listing, P represents the physical bit, described below.
See Table 3-18 for information about predecode bits other than BDCBA.
Paired single-precision floating-point instructions.
3.8.2.3.2 Predecode Bits for Control Flow Instructions

Table 3-18 describes the meaning for those predecode bits that are generated by the
IFU for control flow instruction processing. In the table:

•

UE is an unconditional exit and CB is a conditional branch. The UE and CB predecodes are used by the branch predictor to quickly determine the exit point of the
two fetch slots.
P2 is popstack and normally means to pop the return stack. P3 (or branch) normally
means Bxx. P2 and P3 are used to determine how the return stack and jump predictor outputs are used. The following attributes can be determined for all 16 fetched
instructions during the A phase of I3 when the branch predictor is determining the
exit point:

JPRBD
POP
PUSH
TBR
CPL
IFBTCHB

= UB& !P2& !P3
= P2 & !P3
= UB&CB
= !P2 &P3
= CB &P2&P3
= !CB &P2&P3
Compaq Confidential

5 January 2001 ··· Subject To Change

Instruction Fetch Unit - the lbox

3-53

Fm Unit

For detailed information, see Section 3.7

•

For "legacy" CMOV/FCMOVinstructions (see Section 2.11.2), a set CM bit causes
the Collapsing Buffer to create a CMOV2 instruction by making a new instruction
chunk. Legacy CMOV/FCMOV instructions are always the first instruction in the
map chunk.

•

The MA bit is set when an XOR (11.40) instruction with destination R31 is
detected as the final instruction in either half-block of eight instructions received by
the IFU. When MA is set and the chunk is fetched from the !cache, the Collapsing
Buffer starts a new map chunk that begins with the current fetch chunk. See also
Section 2.11.3 for more information.

•

The physical bit, P, in the function field bits column indicates that no address translation was performed when fetching instructions. When set, the VA field in the
TAG represents the actual PA from which the instructions were fetched, and not a
translation.

Table 3-18 lbox Predecode Bit Summary
UE CB P2 P3

CM MA

Meaning

Fall through (integer and floating-point conditional branch in
PALmode, physical bit = 1)

Fall through and Collapsing Buffer starts a new map chunk to
begin with a CMOV2 instruction

x x x x

Collapsing Buffer starts a new map chunk at current fetch chunk

1 0

Integer and flaoting-point conditional branch (physical bit = 0)

Jump

Unconditional branch

Return (pops the return stack)

1 0 1 1

IFETCHB, (stops thread, next PC= PC + 4)

1 1 0 0

JSR (pushes return stack)

0 1

BSR (pushes return stack)

JSR_COROUTINE (pops and pushes return stack)

1 1 1 1

CALL_PAL (pushes return stack)

0 0 0 1

x
x
x
x
x
x

Not used - do not care

1 0 0

1 0 0
1 0

0 0 1 0
0 0

0 0

1 0

Not used - do not care
Not used- do not care
Not used- do not care
Not used - do not care
Not used - do not care

Compaq Confidential
3-54

Instruction Fetch Unit - the lbox

5 January 2001 --· Subject To Cl~ange

Checkpoint Unit

3.8.2.3.3 Fill Data Routing

A few simple rules govern the routing of information in the Fill Section. First, the Cbox
returns its data in 16-instruction chunks, whether or not the request hit the Scache.
Next, if the Demand Array Probe indicates that the returning instructions are for a nonpiggybacked demand miss, the wake_tpu signal for the corresponding thread is activated, which is read by the FTC. Finally, the !Cache is designed to give writes priority
over reads, so buffering of writes is not necessary.
The early warning signals sent in cycle ClO are latched in cycle IX. The array probes
occur in IY, and the fill_inst instruction bus latches its C12 signals on the IZ edge. The
discard_fill and dbl_ecc_err error signals are used in IZ. If the former is true, all processing for this fill is terminated, and all IFU state will behave as if the
early_waming_ptr had never been active for this fill. At some later time, the Cbox will
again try to complete the fill for this request. The discard_fill signal covers a number of
late-kill cases, including single-bit error detection, that occur too late to affect sending
of the early_waming_ptr.
If discard_fill is false, but dbl_ecc_err is true, the fill proceeds as normal, but with the

resultant Fill Packet being written into the !Cache with its ecc_uncor bit set. Any reads
of this ICache line will force a late exception. This method allows for fast processing of
double-bit errors when the machine is in PAL mode, because the normal handling via
Cbox interrupt is not possible in PALmode.

3.9 Checkpoint Unit
The checkpoint table, as the name implies, serves as a repository for important information flowing through the pipeline every clock cycle. This information is later used for
(a) restoring state when restarting on an exception and (b) for training predictors in the
Ibox.
The checkpoint table plays a pivotal role in restarting the pipeline for all exceptions
except those that are specific to the Ibox such as those caused due to line or way
mispredictions. Specifically, the checkpoint table handles only restarts for instructions
that have been mapped and assigned an INum by the Pbox. The class of exceptions that
is handled by the checkpoint table is also known as the "post-map" exceptions. Another
important role played by the checkpoint table is to provide sufficient information at
instruction retire time for training the branch and jump predictor. The information
stored in the checkpoint table is also leveraged to allow mispredicted jumps to be identified as well as to generate the return address whenever a subroutine call is made.
Effectively, the checkpoint table acts as a link between the pre-mapped and postmapped world of instructions. Before the mapping is performed, an instruction is identified using its address. Once the instructions are mapped and dispatched from the Ibox
to the Pbox, the INum associated with the instruction becomes its sole identifier. However, the address of an instruction may be needed occasionally during its lifetime. This
is especially true when an instruction restarts on an exception or a return address is
needed by the Ebox to be pushed into a stack register when a subroutine call is executed. The checkpoint table enables such operations to be performed with its ability to
reverse-map the INum of an instruction to its address using the information stored in it.

Compaq Confidential
5 Jam.uiry 2001 - Subject To Change

Instruction Fetch Unit -

the lbox

3-55

Checkpoint Unit

It must be noted that the amount of information that needs to be checkpointed for
restarts and training is non-trivial. However, due to area constraints on the die, the
information cannot be stored in a naive fashion. Hence, several optimizations are performed to condense the information for reduced storage without losing any details.

The following sections provide additional details of the checkpoint table.

3.9.1

Checkpoint Table Components
The checkpoint table consists of a pre-map and post-map table. The pre-map table
reflects the instruction buffer and stores information on a fetch-slot basis while the
post-map table stores information on a map-chunk basis. The post-map table forms the
core of the check-pointing mechanism. The pre-map table acts only as a temporary
store to hold data until the collapsing buffer creates a new map-chunk from fetch slots
in the instruction buffer. Once this operation is completed, the information for the collapsed fetch slots flows from the pre-map table to the post-map table and is stored in a
collapsed form to reduce storage requirements.
As with the instruction buffer, the pre-map table consists of 16 entries per thread for a
total of 64 entries. Information corresponding to each of the two slots that may be
fetched every cycle is written into the pre-map table using the same index that is used to
write into the instruction buffer. Table 3-19 lists the different fields that are stored for
each slot along with the producer of that information. Since a fetch slot cannot have a
valid jump as well as a return instruction, the jump and return predictions share single
field. The appropriate data is written based on the type of the exit instruction. For more
information on each specific field, please refer to the appropriate producer section.
Table 3-19 Fields in a Pre-Map Table Entry
Producer

Information

PC calc logic

PC<51 :5>, PC<O> (palmode bit)

Jump predictor

Jghist<35:0>,

Jump or Return predictor

Jump or return prediction<51:2>

Branch Predictor

Lghist<23:0>. Shift distance<2:0>, No shift

Branch Predictor

Bank<6:5>, Next Bank<6:5>, Next to Next Bank <6:5>

Branch Predictor

Previous index <6:5>, Next to Previous index <6:5>

Branch Predictor

Prediction entries: G0<7:0>, G1<7:0>, CH<7:0>, BM<7:0>

Branch Predictor

Jump, Push, Pop exit attributes and sequential exit flag

Branch Predictor

Conditional branch attributes<7:0>

Return Predictor

Nalloc<5:0>, Tos<5:0>, Ptos<5:0>

The post-map table contains 32 entries, each of which corresponds to an in-flight map
chunk. The table is indexed using the map chunk INum. Most of the information is
stored on a map-chunk basis though some information needs to be stored on a per-slot
or per-instruction basis. The information stored in the post-map table mostly originates
either from the pre-map table or from the collapsing buffer. Since up to two fetch slots
may be collapsed to create a map-chunk, the information stored in one entry of the post-

Compaq Confidential
3-56

Instruction Fetch Unit - the lbox

5 January 2001 - Subject To Change

Checkpoint Unit

map table spans the information from two adjacent entries of the pre-map table. However, the information for the two fetch slots is not stored as such. Instead, it is collapsed
such that the storage space is vastly reduced without losing any information.
Most of the fields in a post-map table entry are written during map time. However, there
are a few fields that are not created until instruction execution time or when the Pbox
signals a kill due to some exception.
We now list the different fields for an entry in the post-map table. Table 3-20 lists those
fields that store a collapsed form of the fields read from the pre-map table for two adjacent slots: slot A and slot B. Table 3-21 lists those fields that are stored in the same format (for each slot) as they are read from the pre-map table. Table 3-22 lists the
remaining fields that do not use any pre-map table entries and are written directly using
information provided by the collapsing buffer. Finally, Table 3-23 lists the fields that
are created during execution or kill time. We also provide a brief description for some
of the fields that include details on how the collapsing is performed.
Table 3-20 Collapsed fields Stored Into a Post-map Table Entry at Map Time
ID

Data from Pre-Map Table

Collapsed fields in Post-Map Table Entry

Slot A LGhist<23:0>

Slot B LGhist<23:0>

LGhist<24:0>

Slot A ShiftDist<2:0>

Slot B ShiftDist<2:0>

ShiftDist<3:0>

Slot A Bank<6:5>
Slot B Bank<l:O>
Bank<6:5>
Slot A Bank_next<6:5>
Slot B Bank_next<6:5>
Bank_next<6:5>
Slot A Bank_next_next<6:5> Slot B Bank_next_next<6:5> Bank_next_next<6:5>
B ank_next_next_next<6:5>

Slot A prev_index<6:5>
Slot B prev _index<6:5>
Prev_index<6:5>
Slot A prev_index_next<6:5> Slot B prev_index_next<6:5> Prev_index_next<6:5>
Prev_index_next_next<6:5>

Slot A Nalloc<5:0>
Slot A Tos<5:0>
Slot A Ptos<5 :0>

Slot B Nalloc<5:0>
Slot B Tos<5:0>
Slot B Ptos<5:0>

Slot A Nalloc<5:0>
Slot B Nalloc<5:0>
Slot A Tos<5 :0>
Slot B Tos<5:0>
Slot B Ptos<5:0>

Slot A noshift

Slot B noshift

Slot A noshift

Table 3-21 shows the fields in the post-map table entry that is maintained for each slot in
a map-chunk and is directly transferred from the pre-map table at map time.
Table 3-21 Post-Map Table Entry Fields
ID

Fields in the Post-Map Table entry

PC<51:5>, PC<O> (palmode bit), PC+4<15:5>

Jump, Push, Pop exit attributes and sequential exit flag

Branch prediction entries: G0<7, 0>, G1<7, 0>, CH<7, 0>, BM<7, 0>

Conditional branch attributes<7:0>

Jghist<35:0>, Jump prediction<51:2>

Compaq Confidential
5 January 2001 -· Subject To Change

Instruction Fetch Unit -

the lbox

3-57

Checkpoint Unit

Table 3-22 Fields that are Available from Collapsing Buffer at Map Time
ID

Fields in the Post-Map Table entry

Alternate PC<21 :0> (8 in all; 1 for each map-chunk instruction)

Store Set ID<4:0>

Slot Mask<7:0>

Map Chunk information: length<2:0>, slot 0 start position<2:0>, slot 1 start position<2:0>, slot 0
length<2: 0>

(8 in all; 1 for each map-chunk instruction)

Table 3-23 Fields in Post-Map Table Entry That are Created During Execute (E) and Kill Time (K)
ID

Fields in the Post-Map Table entry

Jump Target<51:0>, Jump Target Valid (Execute)

Kill location <2:0> Kill Valid (Kill)

Notes for Tables 3-20 through 3-23:

For some of the fields mentioned above, we give a brief description that includes details
on how information is collapsed before it is written into the post-map table. Note that
we use the ID in the above tables to describe the corresponding field.
•

Lghist for Slot B can have at most one new bit added to it with the rest of the bits
overlapping with that of Slot A. To determine if there was indeed a new bit added to
slot B's lghist, we use the newest Shift Distance bit for Slot B. The collapsed Lghist
is created as follows:

If Slot B Shif tDist<O>
LGhist<24:0> = CONCAT (Slot A LGhist<23:0>, Slot A Lghist<O>)
Else
Lghist<24:0> = CONCAT (0, Slot A Lghist<23:0>)

•

The Shift Distance for Slot B has one new bit while the other two bits overlap with
that of Slot A.

•

ShiftDist<3:0> = CONCAT (Slot A ShiftDist<2:0>, Slot B ShiftDist<O>)

•

The two successors to Slot A Bank are exactly the same as Slot B Bank and its successor. Hence, we need to store only 4 out of the 6 bank identifier fields.

•

As with the bank identifiers, the next prev_index of Slot A is the same as that of
Slot B prev_index. So we store only 3 out of the 4 fields.

•

Ptos (previous top of stack) is used solely when restarting after an instruction that
pops the return stack. If Slot A had such an instruction, Slot B's top of stack would
indeed be slot N s Ptos. Hence, there is no need to store the Pros for slot A.
Compaq Confidential

3-58

Instruction Fetch Unit-the lbox

5 J~·muary 2001 ···Subject To Cha.ngE~

Checkpoint Unit

•

The no shift bit, which prevents the shift distance bits from being modified more
than once for the same fetch slot when it is restarted (on an exception), is relevant
only for the first slot (see branch predictor section for more details).

•

The low PC bits <4:2> are created only on a need basis for a particular instruction
in one of the fetch slots comprising the map chunk. The PC+4 field is not present in
the pre-map table and is created on the fly from the corresponding PC bits. Pre-calculating this field is necessary for restarting the line predictor latches with a new
index as fast as possible.

•

For conditional branches that are predicted as not taken, we need to store the alternate address (alternate PC) to handle mispredicts. This address would be used on a
restart from a mispredicted not-taken branch. Since a branch instruction can occur
in any position of the map chunk, provision must be for storing up to 8 alternate
addresses.

•

As with branches, a load or store instruction may occur in any position in the mapchunk. Hence, we need to provide storage for all instruction positions in the mapchunk.

•

The slot mask specifies whether a particular instruction originated from slot A or
slot B.

•

Slot 0 length is not directly available from the collapsing buffer. It is calculated
using the slot mask that is provided by the collapsing buffer.

•

The Jump Target Valid bit enables two jump instructions each belonging to slot A
and slot B to share the same location for storing the actual target on a jump misprediction. The following section provides more details on the sharing mechanism.

•

To ease implementation, both the pre-map and post-map tables are partitioned such
that a particular partition resides close to the check-pointed component. For
instance, in the partition residing close to the branch predictor, we need to store
only those fields that are relevant to the branch predictor such as Lghist, shift distance, no shift, prediction entries etc. while fields such as store set identifiers and
Jump predictions need not be.

3.9.1.1 Checkpoint Table Functions

As mentioned earlier, the fields in the checkpoint table are not only written during
instruction map time but also during the execution phase as well as when instructions
are killed due to an exception.
When the Ebox executes a jump instruction, it forwards the actual target of the jump to
the checkpoint table so as to validate the jump prediction. The checkpoint table
accesses the corresponding entry in the post-map table using the INum that is provided
by the Ebox to access the predicted jump address. If a mismatch occurs between the
true target and the predicted address, the checkpoint table signals a jump mispredict to
the Ebox. At the same time, it stores the true target into the table. This target value will
eventually be used for restarting the pipeline as well as for training the jump predictor.
The jump valid bit is also set on a jump mispredict when the correct target is stored.
Since an earlier exception overrides a younger exception, a mispredicted jump in slot A
can always store its true target while a mispredicted jump in slot B may do so only
when a jump instruction in slot A has not already mispredicted.
Compaq Confidential
5 January 2001 ··· Subject To Change

Instruction Fetch Unit -

the lbox

3-59

Checkpoint Unit

Occasionally, the Ebox requires the return address that needs to be saved in the stack
register when executing a subroutine call. The checkpoint table uses the INum provided
to find the associated PC of the subroutine call instruction and sends the PC of the subsequent instruction (PC+4) to the Ebox.
When the Ebox eventually executes the "return" instruction in the subroutine, execution
is redirected to the address that was provided by the checkpoint table. Note that the
return address also needs to be validated. The description given for jumps for signaling
mispredicts is also true for "return" instructions. This is because the address predicted
by the return and jump predictors share the same field as only a jump or a return
instruction can be valid in a fetch slot.
3.9.1.1.1 Restarting on an exception

The checkpoint table is responsible for restarting the pipeline on an exception by providing the line predictor and PC calc logic with the new address. An exception may
occur appear through the exception funnel (E-funnel) from the Pbox or on the fast-path
used for early signaling of branch mispredictions.
The E-funnel exceptions take priority over the fast-path exceptions. The information
available to the checkpoint table from the E-funnel includes the type of exception and
the exception INum. The checkpoint uses the exception type and the INum to access the
post-map table to get the appropriate restart address. The slot mask (Table 3-22) lets us
determine the slot in which the misprediction occurred. With this information, we can
choose the appropriate address from a set of addresses that is stored on a fetch-slot basis
(PC). The low bits of the exception INum helps us to choose an address from a set of 8
addresses stored on an instruction basis in the map-chunk (Alternate PC). Remember
that the low bits<4:2> of the PC are not stored in the post-map table. However, by using
the map chunk information (Table 3-22) and the position of the instruction in the map
chunk, the low bits of the restart address can be easily determined.
The restart may cause control to be transferred to PAL code in which case the checkpoint table also needs to provide the address to which control has to resume after return
from PAL code. The PAL starting address itself is created by adding the offset provided
through the exception funnel to the base address that is read from a PAL base register.
If no exceptions are present in the E-funnel, the fast-path, which is used for early reso-

lution of conditional branch mispredictions, is checked for the presence of an exception. Information on whether the conditional branch instruction was a mispredicted
taken or not-taken type as well as its INum is also available on the fast path.
Table 3-24 lists the different restart scenarios that are handled by the checkpoint table.
The different types of restart addresses mentioned for the non-PAL exceptions are
available in the post-map table. Note that the complete restart address is needed only by
the PC calc logic while just the low bits of the restart address <14:2> are needed by the
line predictor latches but a cycle earlier than PC calc. Due to timing constraints in the
implementation, the low bits<14:5> of the incremented PC (PC+4) are stored apriori in
the post-map table. This would be used whenever the restart address is PC+4 rather
than calculating the value at the time of restart.

Compaq Confidential
3-60

Instruction Fetch Unit - the lbox

5 J,1nuary 2001

Subject To Clumge

Checkpoint Unit

Table 3-24 Exception Types and Restart Address
Exception

Restart Address

Return Address for PAL

Mispredicted not-taken Conditional branch, IFETCHB

PC+4

N.A

Mispredicted taken Conditional branch

Alternate PC

N.A

Mispredicted jumps

Jump Target

N.A

Replay, Load Store order violation

N.A

DTB Miss

PALbase + offset

Unalign, Write FPCR, Integer/FF Trap

PALbase +offset

PC+4

3.9.1.1.2 Restoring Predictor States

In addition to providing the restart address, the checkpoint table also needs to restore
the states of the different predictors in the Ibox namely, the branch, jump and return predictors. Due to the complex nature of the branch history bits, the control for restoring
the state is non-trivial. Table 3-25 shows how we create the initial lghist and shift distance from the post-map entry based on whether the restart occurs in slot A or slot B.
Table 3-26 details the complete restoration process.
Table 3-25 Creating Slot-Based Predictor States From Mapped Information in the Post-Map Table
LGHIST

SHIFT DISTANCE

Slot_A

if (MappedShiftDist<O>)
Slot_Ghist = MappedGhist<24: 1>
else
Slot_Ghist = MappedGhist<23:0>

Slot_ShiftDist = MappedShiftDist<3:1>

Slot_B

Slot_Ghist = MappedGhist<23:0>

Slot_ShiftDist = MappedShiftDist<2:0>

Table 3-26 shows .....
Table 3-26 Restoring Predictor States on a Restart
Type of
Excepting
instruction
Conditional
Branch (Bxx)

LGHIST, SHIFT DISTANCE, NOSHIFT

JG HIST

NALLOC/TOS

Restart in 1st half of slot?
Taken?
Ghist = Slot_Ghist,l; ShiftDist =Slot_ShiftDist,l; NoShift = 0
Not taken & No valid Bxx insn after?
Ghist = Slot_Ghist,O; ShiftDist =Slot_ShiftDist, 1; NoShift = 1
Restart in 2nd half?
Taken?
Ghist = Slot_Ghist,O; ShiftDist = Slot_ShiftDist,l; NoShift = 0
Not taken & No valid Bxx insn after?
Ghist = Slot_Ghist,1; ShiftDist = Slot_ShiftDist, 1
NoShift = 1 (=0 if slot ends i.e PC_low<4:2> == Ox7)

PUSH(BSR)

If valid insn before?
If restart in 1st half II no valid Bxx insn in 2nd half?
Ghist = Slot_Ghist,O; ShiftDist = Slot_ShiftDist, 1; NoShift = 0
else /* restart in 2nd half & valid Bxx insn in 2nd half */
Ghist = Slot_Ghist,l; ShiftDist = Slot_ShiftDist, 1; NoShift = 0

Jump (JMP)

*** Same as for Push(BSR) ***

Nalloc = Slot_Nalloc + 1
Tos = Slot_Nalloc

JGhist = Slot_JGhist<26:0>,
(Jtarget<19:11> "Jtarget<l0:2>)

Compaq Confidential
5 January 2001 ··· Subject To Change

Instruction Fetch Unit - the lbox

3-61

lbox Interfaces

Table 3-26 Restoring Predictor States on a Restart
Type of
Excepting
instruction

LGHIST, SHIFT DISTANCE, NOSHIFT

JGHIST

NALLOCITOS

Push+ Jump (JSR)

***Same as for Push(BSR) ***

JGhist = Slot_JGhist<26:0>,
(JTarget<19:11> "Jtarget<10:2>)

Nalloc =Slot Nalloc + 1
Tos = Slot_NaJloc

Pop (RET)

***Same as for Push(BSR) ***

Nalloc = Slot_Nalloc
if (Slot A restart)
Tos = Slot B Tos
else
Tos = Slot B Ptos

Pop + Push
***Same as for Push(BSR) ***
(JSR_COROlITINE)

Nalloc = Slot_Nalloc + 1 Tos =
Slot_Nalloc

Any other instruction If no valid Bxx insn after & valid insn before?
(restart would be "at"
If restart in 1st half II no valid Bxx insn in 2nd half?
this instruction)
Ghist = Slot_Ghist,O; ShiftDist = Slot_ShiftDist, 1; NoShift = 1

else /* restart in 2nd half & valid Bxx insn in 2nd half *
Ghist = Slot_Ghist,1; ShiftDist = Slot_ShiftDist, 1; NoShift = 1
Default State

Ghist =Slot Ghist
ShiftDist =-Slot_ShiftDist
if (Slot A restart)
NoShift =.NoShift_old (from post-map table)
else
NoShift= 0

JGhist = Slot_JGhist

Nalloc =Slot Nalloc Tos =
Slot_Tos
-

3.9.1.1.3 Predictor Training

The checkpoint table is also used for training the branch and jump predictors. The jump
predictor is trained only on a misprediction while the branch predictor is trained on both
correct and incorrect predictions. The mispredict information is available in the kill
field of the post-map table (Table 3-23).
The following state information is provided to the branch predictor for training each
slot in the map chunk: lghist, shift distance, bank, previous index and prediction bits
(Table 3-20). In addition, using the kill position and the map-chunk information (Table
3-22), the actual instructions retired in each slot are also provided. This includes information on whether there was a mispredict in any of the slot as well as the position in the
map-chunk where the mispredict occurred. For more details on how the training is
done, please refer to the branch predictor section.
As for the jump predictor training, the checkpoint table provides the true target to the
jump predictor. It also uses the slot J ghist to calculate the index into the jump predictor
array. The hash function for the index calculation is mentioned in the jump predictor
section.

3.10 lbox Interfaces
3.10.1

Pbox Interface

3.10.2

Qbox Interface

3.10.3

Ebox Interface

3.10.4

Mbox Interface

3.10.5

Cbox Interface

Compaq Confidential
3-62

Instruction Fetch Unit-the lbox

5 Jana.utry 2001 -·Subject To Change

4
Dependency Mapper Unit -

the Pbox

Forward Path
Trap/Retire Path

8 Instructions
(fromlbox)
8 Instructions
to Qlox

,;;xec Traps (from EIF/Mboxes

Trap Inst Virt Fag
tolbox)

Fatire/.K.ill IN"uIDI'PU Bus

Compaq Confidential
5 January 2001 --· Subject To Change

Dependency Mapper Unit - the Pbox

4-1

Dependency Analysis: General Concepts

The Pbox consists of the following components:
Table 4-1 Pbox Components
Described
in Section

Name

Mnemonic

Description

Bid/Grant Exception
Logic

BEL

Chooses which of the pending kills from all TPUs should be
broadcast to the rest of the chip.

4.3.10

Instruction Decoder

IDC

Decodes each of the eight instructions that arrive in a cycle.
The decoder is placed early in the pipe to aid slotting decisions and to provide inputs to the load/store flow control
mechanisms and to the IPR interlock mechanisms

4.3.6

INum Allocator

INA

Allocates INums to new map blocks sent down by the Ibox. 4.3.3
The INA also contains the Map Thread Chooser (see Section
4.3.3.3), which picks the next thread that will map instruction
blocks and informs the Ibox

INumMapper

IMP

Responsible for mapping source operand registers (VReg)
into the INum of the last writer for the source operand

Load/Store Serial
Number Allocator

LSN

Associates a sequential identifier with each load instruction, 4.3.7
and a second identifier with each store instruction. These
LNums and SNums are used to prevent deadlock and manage
flow control into the Mbox load and store queues

Mapper Exception
Logic

MEX

Rolls the IMP, PMP, LSN, and RIF state back to the trap point 4.3.4
when the MEX is notified by the BEL of an exception

Memory Queue Allacation

MQA

Governs the allocation and deallocation of load queue (LQ)
and store queue (SQ) chunks to memory instructions. Also
controls the High-Water Mark (HWM) that is sent to the
Qbox to regulate the issuing of loads and stores.

4.3.5

Physical Register Map PMP

Allocates physical destination registers to each dispatched
instruction. This table is also used to map virtual register
operands into the corresponding physical registers

4.3.2

Post-Map Skid Buffer

PSB

Holds a silo of the last few map blocks that have passed
through the Pbox forward path

4.3.8

RC/RS Interrupt Flag
Widget

RJF

Maintains state necessary to implement the RC/RS instructions

4.3.9

Retire/Kill Unit

RKU

Communicates the identity of retired and/or killed instructions to all concerned boxes by way of the Retire/Kill bus

4.3.11

4.1

4.3.1

Dependency Analysis: General Concepts
Previous "out of order" processors detected dependencies between instructions in different ways. The key goal is to recognize real dependencies between instructions (i.e.
true read-after-write (RAW) dependencies) while "untangling" dependencies that are an
artifact of the processor architecture (like write-after-write (WAW) or write-after-read
(WAR) dependencies). For example, take a look at the following chunk of C code:
a = b + c;
d =a

* a;

a = e + f;

4-2

Compaq Confidential
Dependency Mapper Unit-the Pbox
5 Janwiry 2001 ·-Subject To Change

Dependency Analysis: General Concepts

Note that there is a RAW dependency (a = b + c must be computed before d = a * a), a
WAR dependency (d must be computed before the result of a= e + f is written), and an
apparent WAW dependency if the compiler chooses to use the same register for the first
value of a as for the second. Let's look at the macro for this C code. (Again, all macro
programs are stylized and not meant to reflect actual Alpha assembler code.)
; A is in Rl, Bin R2 ...
001 ADDL R2,R3 -> Rl
002 MULL Rl,Rl -> R4
003 ADDL R5,R6 -> Rl

As you already know, the key to out of order execution is to recognize that Rl in this
case has many different lifetimes in the course of a program. The lifetime of Rl in lines
1 and 2 is separate and distinct from the lifetime of Rl in lines 3 and thereafter. If the
processor architecture provided a bazillion registers, the compiler would use a new register for each lifetime of a value. That is, it would create a new name for each lifetime
of the variable a. Let's pretend that the C code was compiled into such an instruction
set:
001 ADDL R2,R3 -> Rl
002 MULL Rl,Rl -> R4
003 ADDL R5,R6 -> Rll

In this case the second lifetime of a is stored in Rll. This removes the WAW and WAR
spurious dependencies. Now a suitably intelligent scheduler can recognize that instruction 003 can be executed in parallel with (or even before!) instruction 001 or 002.
•

Alas, we don't have an infinite (or even very large) number of architectural registers, so sooner or later a compiler that creates a new name for every register lifetime
would run out of new names. Fortunately, there are lots of ways to create these new
lifetime names at execution time in hardware, rather than at compile time. We

believe that execution time mechanisms offer the best opportunities to squeeze
the last bit of performance from a program. The two most frequently encountered renaming approaches are:
Rename each destination register (in the architectural register space) into a
physical register (in a larger physical register space). If two different instructions that are in flight (that is, they have been fetched and have entered the
scheduling unit and have not yet retired), write architectural register Rl, then
each lifetime of Rl will be assigned to a different physical register. This mechanism removes WAW and WAR dependencies. In some cases it is used to
detect RAW dependencies. (The 21264 uses detects RAW dependencies by
comparing physical register names.)
Rename each destination register into a serial number. Each in-flight instruction has a unique serial number. This instruction serial number (or INum) can
be used for RAW dependency detection. Unfortunately, it cannot be used to
eliminate WAW or WAR dependencies unless (as in the case of machines using
a re-order buffer) the microarchitecture provides a separate architectural register file. (Since the INum space is finite, each INum is reused fairly often. If
instruction 51 writes Rl at time t1 and then writes R5 the next time INum 51 is
re-allocated at t2, then we have no way of referring to the lifetime of the Rl
that was written at t1. If the write at t1 was the last time Rl was written, then

Compaq Confidential
5 January 2001 ... Subject To Change

Dependency Mapper Unit - the Pbox

4-3

INum Space

we have lost its state. The solution to this problem is to copy Rl 's value to the
architectural register file sometime before t2. This copy operation is the reason
we don't use a classic re-order buffer organization in the 21464 Qbox.)
The 21464 Pbox renames incoming register operands into an INum space to facilitate
the scheduling decision. We rename incoming register operands into a physical register
space to eliminate WAW and WAR dependencies.

4.2 INum Space
Similarly to the 21264, 21464 uses instruction numbers (INums) to uniquely identify
in-flight instructions. All TPUs share a single INum space, so INums are unique across
TPUs. It is the INum Allocator (see Section 4.3.3) that allocates IN urns and the Qbox
Completion Unit (see Section 5.2.16) that frees them upon retirement.
We use INums in the range 0 to 511. We consider the INum space to be cyclic, so after
511 we wrap back to 0. One can visualize the space as a circle (as in the diagram below)
that increases in the clockwise direction, except where we wrap from 511 back to 0. We
allocate INums within a TPU in an increasing order (i.e. clockwise). Therefore, within a
TPU, younger instructions have larger INums, except in the case of a wrap. In the diagram, INum A is younger than INum B. Imagine that the space between A and B shows
the total range of INums in use. Then A represents the insert pointer (the youngest
INum is use) while B represents the retire pointer (the next INum to retire).
Figure 4-2 The INum Circle

Why do we have 512 INurns? The architecture group did a number of studies and determined that we need to support a scheduling window of 128 entries, and we need to
allow at most 256 in-flight instructions at any given time. Therefore, we need to choose
4-4

Compaq Confidential
Dependency Mapper Unit -the Pbox
5 January 2001 ~·Subject To Change

INum Space

an INum space containing at least 256 INums to uniquely identify all in-flight instructions. In addition to uniquely identifying instructions, we need to be able to compare
INums of the same TPU to determine which of two instructions is older. With only 256
INums we cannot accomplish this without additional information, namely which INum
represents the youngest in-flight instruction for a given TPU. However, by increasing
the INum space to 512 values - tacking a 9th wrap bit onto the lower 8 bits - we can.

4.2.1

INum Age Comparison
A TPU's allocated (i.e. in-flight) INums will never cover more than a contiguous half of
the INum circle, 256 of the 512 possible values. This is a very important point; it is this
fact that allows us to determine which of two instructions in that TPU is older. In fact
there is a simple, robust method for making this determination. First of all, note that we
can interpret the INum space as consisting of 9-bit 2's complement signed numbers
rather than unsigned values; i.e. the wrap bit becomes a sign bit. The diagram below
visualizes the INum circle using this interpretation. Given this fact, the rule for determining the relative age of INums A and Bis as follows:
i f (A - B > 0}

A is younger than B
else if (A - B < 0}
A is older than B

Where A-Bis a 9-bit 2's complement subtraction. Note that the outcome A-B==O is not
possible because of the constraint that in-flight INums cover no more than a contiguous
half of the space and are therefore, by definition, unique. To understand why this algorithm works, consider the diagram below. Let A and B be the youngest and oldest in-

5 January 2001 -~Subject To Change

Compaq Confidential
Dependency Mapper Unit-the Pbox

4-5

INum Space

flight INums, respectively. The contstraint on the distance between oldest and youngest
means that the relative values of A and B break down into four cases, illustrated below.
Note again that in every instance, A is younger than B.

Table 4-2 INum Age Relationship
Case

Sign
of A

Sign of B

Magnitude Relationship

Sign of A-B

IAl>IBI

256 < IAI + IBI < 512

+ (overflow) - (overflow)

IAl<IBI

0 < IAI + IBI < 256

3
4

Sign of B-A

Table 4-2 shows the relationship in greater detail. Tue first four columns merely transcribe what is evident from the illustration, while the last two show that the algorithm
gives the correct result for each case. Case 1 is very straightforward; A-B subtracts a
positive number from a larger positive one, yielding a positive result. Case 3 is the dual
of Case 1, with the signs and relative magnitudes of A and B reversed. In Case 4, A-B
subtracts a negative number from a postive one where IAI and IBI add up to a max of
255, so the result is positive and within the range of 9-bit 2's complement representation
[-256,255]. For all of these cases, B-A is simply the negation of A-B. Case 2 is a little
less intuitive. A-B subtracts a positive number from a negative one, but since IAl+IB I >
256 the result is negative yet out of the range of 9-bit representation - which means
that it wraps around to the positive side of the circle. Likewise, B-A subtracts a negative number from a positive one, giving a positive, out-of-range result- which thereCompaq Confidential
4-6

Dependency Mapper Unit - the Pbox

5 Januc1ry 2001 ··· Subject To Change

Component Details

fore maps to a negative value in 9 bits. Notice that IAl+IBI < 512 which means that
neither A-B nor B-A can wrap all the way around from positive to positive or negative
to negative. Thus 9-bit 2's complement subtraction is sufficient to determine the relative
age of any two INums.
In places where we use INurns as unique identifiers, and do not need to do age comparisons, we need not store the 9th bit of the INum. Dependency detection is one situtation
where uniqueness is sufficient. Therefore, in most places in the Instruction Queue, the
21464 stores only the lower 8 bits of the INum.

4.3 Component Details
4.3.1

INum Mapper (IMP)

4.3.1.1 Design considerations

The central problem in scheduling for out-of-order-issue processors is the identification
of dependencies between instructions. Each instruction that reads results from a register
file depends on the instruction that last wrote the required result to the register file.
Before the issue mechanism can decide that an instruction X is ready to issue, it must
know what other instructions produce the data that X requires. (These instructions are
the parents of X.)
As an example, consider the following code fragment:
Il: LD R3 <- (R4)
I2: CLR R2

I3: ADD R5 <- R3 + R2

(All code fragments in this report are stylized and not meant to be in the form of legitimate Alpha assembler notation.)
Assume for the moment that R4 was loaded by an instruction that executed a very long
time ago. I1 then is data ready when it is fetched and passed from the Ibox to the Qbox.
It has no known parents. Similarly, 12 doesn't read any input operands. It is data ready
when it arrives at the Qbox. I3 on the other hand, requires inputs generated by I1 and 12.
I3 has two parents, (Il ,I2). Until I1 and I2 are issued, I3 is not ready. As it turns out, I1
is a load, so it has a latency of two cycles, thus I3 can't be issued any earlier than two
cycles AFTER Il has issued.
We can determine the dependencies between instructions via several different mechanisms. 21464 has chosen to use INum mapping. In this scheme, a mapper remembers
the INum of the last instruction to write each register. At map time, we rename each
input register for each instruction from its original virtual register name to the INum of
the last writer for that register. This mapping operation maps dependencies from the
(limited) virtual register name space with all its spurious write-after-read and writeafter-write dependencies into the INum space which is free of these false dependencies.
4.3.1.2 Design Architecture

The INum Mapper (IMP) processes each map chunk (8 instructions) in parallel. It maps
the source register specifier for each instruction from the 6 bit virtual register space (31
int registers, 31 floating point register, 2 PAL permanent registers) into INum space (8
bits), each source virtual register being replaced with the INum of the in-flight instruc-

5 January 2001 --· Subject To Change

Compaq Confidential
Dependency Mapper Unit - the Pbox

4-7

Component Details

tion that last wrote the virtual register (from the point of view of program order.) Additionally, the IMP remembers which INum last wrote each of the 64 virtual registers in
the CMAP (current map) vector. There is a CMAP vector for each of the four hardware
threads. The vector is indexed by virtual register number. If the instruction that last
wrote a virtual register is not in flight (i.e. the last writer for the register has retired)
then the mapper will produce a NULL INum.
Instructions are processed in the order in which they were fetched. Effectively, the read
operands of the first instruction are mapped to their producer INums and the write destination register for the first instruction is marked with the first instruction's INum.
Then the second instruction is processed, and so on until each of the eight instructions
in the fetch block have been remapped.
In fact, the eight instructions in a fetch block are all remapped in parallel. This means
that we have two stages of mapping. The first stage maps each of the 16 source operands from VReg space to the INum of its last writer ignoring other instructions in this
fetch block. This is done by a 16 way parallel lookup into the CMAP for the current
thread. The second stage of mapping looks for dependencies within the line and supplies new INum mappings for source operands that are written by instructions in the
same fetch block.
The IMP must also maintain a list of "last writers" for each of the registers for every
mapped instruction. This is called the back map or BMAP. Given that we support up to
256 in-flight instructions, the BMAP could require as much as 256 * 64 bytes or
l 6Kbytes of storage. A table organized in the obvious way (indexed in one dimension
by INum, in the other by VReg) would be very hard to maintain, as we'd have to write
up to eight rows per tic into the BMAP.
As it turns out, the BMAP is indexed in one dimension by the fetch block number (this
is the INum of the first instruction in the block divided by eight) and in the other dimension by VReg number. Each cell (B,VReg) in the BMAP contains two bytes. The first
byte (LAST_WRITER) contains the INum of the last writer of VReg BEFORE fetch
block B was processed. The second byte M contains a mask such that if M is set, then
INum =B *8 + k wrote register VReg. On a trap, the Mapper Exception logic directs the
BMAP to read the map state from the column of cells that corresponding to the trap
point and load this state into the CMAP.
This restore operation is done in parallel for each of the 64 entries in the CMAP like
this:
FOR i

= 0 to 64 DO
IF BMAP(TrapINurn<7:3>,i) .M<7:0> == 0
THEN

CMAP(i)

= BMAP(TrapINum<7:3>,i).LAST_WRITER

ELSE

CMAP(i)

= BMAP(TrapINum<7:3>,i).LAST_WRITER

= 0 to TrapINurn<2: O> - 1 DO
IF BMAP(TrapINum<7:3>,i) .M == 1

FOR j

THEN

CMAP(i)

= 8 * TrapINum<7:0> + j

END

4-8

Compaq Confidential
Dependency Mapper Unit - the Pbox
5 Janw~ry 2001 ~· Subject To Change

Component Details

END
END
END

If an entry BMAP(B,VReg).M<7:0> is zero, then the CMAP(VReg) is loaded with

BMAP(B,VReg).LAST_WRITER. Otherwise we scan the writer mask for the backmap
cell up to the trap point if any instructions that are before the trap point and in this map
block have written VReg, then the CMAP is loaded with the INum of the last such
instruction. Otherwise it is loaded with the LAST_WRITER INum.
4.3.1.3 Map Predecode Bits from the lbox

The predecode value bits communicate the type of the source and destination dependencies for each instruction category, as shown in the table below.
In the table:

•
•
•

•

Source A and Source B correspond to the Ra and Rb fields respectively
A Null value indicates that the corresponding field does not contain a valid source
SourceIPRclass and WriterIPRclass refer to Internal Processor Register scoreboarding classes.
ShadowRegl is a PAL shadow mode register number 1; its use is implicit in the
CALL_PALL instruction opcode.

Note that this table is a somewhat coarse approximation of how opcodes are allocated
to predecode categories. The actual assignment for a particular opcode is not always
obvious. Consult Section 3.8.2.3.1 for the exact mapping. Table 4-3 lists the predecode
value meaning for the predecode bits received from the Ibox.
Table 4-3 Predecode Value Meaning for 1%MAP_INST_I4A_H[7:0]<35:32>
Predecode
Value 1

Source A
Dependency
Type

Source B
Dependency
Type

Destination
Dependency
Type

Destination
Dependency
Field

00101

Integer

Null

Integer

Integer operates with
Rb=immediate

01110

Floating-point

Null

Integer

Ftolx

00011

Null

Integer

Unconditional branch

01001

Null

MB and other special
instructions

00111

Integer

Null

Floating-point

ltoFx

01100

Floating-point

Null

Floating-point conditional branch

00010

Null

ShadowRegl

Implicit

CALL_PAL

01000

Integer

Null

Integer conditional
branch

00100

Integer

Integer operates and
store-conditional

Instructions

Compaq Confidential
5 January 2001 ··· Subject To Change

Dependency Mapper Unit -

the Pbox

4-9

Component Details

Table 4-3 Predecode Value Meaning for 1%MAP_INST_I4A_H[7:0]<35:32> (Continued)

Predecode
Value 1

Source A
Dependency
Type

Source B
Destination
Dependency Dependency
Type
Type

Destination
Dependency
Field

11110

Floating-point

Floating-point Floating-point

Floating-point operates, MF_FPCR, and
MT_FPCR

11011

Null

Integer

Integer loads

11000

Null

Integer

Floating-point

Floating-point loads

00110

SourceIPRclass

Null

Integer

HW_MFPR(See Sec.
Section 17 .2)

11101

Floating-point

Integer

Null

Floating-point stores

10110

SourceIPRclass

Integer

WriterIPRclass

HW_MTPR (See Section 17.2)

11111

Integer

Null

Integer stores

Instructions

1 These values do not necessarily represent the types of the operands themselves. The Pbox uses the pre-

decode bits to distinguish between floating-point, integer, IPR, and PAL shadow register dependencies
(since the same virtual register bits can have different semantics depending on the instruction type),
and to detect the absence of dependencies (i.e. the Null cases). The predecode values allow dependency analysis to proceed without waiting for a full decoding of the instruction opcode and function
fields. See also Table A-2 for more mapping information.

4.3.2

Physical Register Map (PMP)

4.3.2.1 Design Considerations

While the IMP has mapped each virtual register operand from its virtual register number into the INum of the last instruction to write the register (or INum =NULL if the
last writer has already retired), we still need to find out where to store each destination
and where in the physical register file the latest lifetime of each virtual register resides.
(That is, the IMP told us who last wrote register X, but we also need to know where the
writer put the data - which physical register contains the latest lifetime of each virtual
register.)
Why did we bother to remap to INum space if we're only going to map again into a
physical register space? The whole thing has to do with our nearly paralytic fear of freelists. In a 21264-like scheme for register mapping, we would need to pick eight good
free registers out of a pool of 512. That is perceived as being very hard to do.
And so, you will notice the INum mapper has no giant free list. The next INum is
peeled off sequentially. (This is almost true, see Section 4.3.3.) Because of that, the
backup map is much simpler than it might otherwise be, and trap recovery is simpler.
Unfortunately it forces us into another remapping, since some input operands were
written so long ago that the INum that was formerly associated with them has retired
and been re-allocated to a new destination register. (An INum is only a good "rename"
for a virtual register while the INum is in flight.) Again, we are in mortal fear of building a "gimme eight good ones" free list mechanism. So, after a whole lot of collaboration, the 21464 team came up with a really neat scheme for mapping from INums to
physical registers that doesn't use a free list.
4-10

Compaq Confidential
Dependency Mapper Unit -the Pbox
5 Jam.1c1ry 2001 ~·Subject To Change

Component Details

4.3.2.2 Design Architecture

The general scheme that we intend to use for physical register renaming is to translate
each virtual register specifier in an instruction into the INum that last wrote it (or NULL
if the last writer has retired) and then translate the last writer INum into the physical
register that it wrote. If the last writer INum is NULL, then we look up the physical register name in the Architectural State Table. (A register that was last written by a retired
instruction is referred to as being part of the architectural or in order state of the
machine. We will use the term architectural state here.
The IMP has done the first half of the rename task for us. The second half can best be
described with a program chunklet. We need to rename both source register operands
and destination register operands. Assume that the source virtual register is SVReg, the
INum it was last written by is SLastWriterINum, and we are looking for the register's
current physical home SPReg.
i f (SLastWriterINum == NULL) {
/* the source register is part of the architectural state */

SPReg = ArchRegTable[SVReg];

else

/* the source register was last written or will be written
by an instruction that is currently in flight.

What is

that instruction's next destination register? */
SPReg = NextDest[SLastWriterINurn];

Note that the operation of mapping from source virtual register to source physical regis.ter required nothing more complicated than one or two reads from a thing that looks
like a register file.
Mapping destination registers is a little more complicated. Assume that the destination
virtual register is DVReg, the INum that the destination was last written by was DLastWriterINum, and the physical destination register will be DPReg. The INum of the
instruction we are remapping is WriterINum.
i f (DLastWri ter!Num == NULL) {

/* The last writer retired a while ago.

The

DVR.eg is currently in architectural state. *I
LastDest [WriterINum]

= ArchRegTable [DVReg];

else
LastDest [WriterINum]

= NextDest[DLastWriterINum];

DPReg = NextDest [Wri terINum] ;

Let's start from the bottom. Note that there is an array called "NextDest" that is indexed
by the INum of the instruction we are currently mapping. This array contains the destination register for each in-flight INum. If a physical register number appears once in the
NextDest array it won't appear twice. That is, each in-flight instruction has a unique
5 January 2001 ···Subject To Change

Compaq Confidential
Dependency Mapper Unit - the Pbox

4-11

Component Details

destination register. (Up to 256 in-flight instructions - 512 registers; not a coincidence.) All we need to do in translating from a virtual destination register to a physical
destination register is look the instruction's INum up in the NextDest array. At the same
time, we find out where the previous physical home for the DVReg was. We load this
physical register number into a second array that contains the physical destination register that WriterINum will use at some point in the future. The idea here is that if an
instruction writes R5, then the previous home for R5 will be a "free" register when this
instruction retires. This is a really knotty point here. Read the paragraph again. It usually takes folks a few times through before they fully appreciate the elegant simplicity
of this approach.
Now notice that if we don't retire WriterINum, then the old home for DVReg (that is,
physical register DPReg) will not become a free register. (If an instruction doesn't
retire, it is as if we never wrote its destination operand.) On the other hand, since WriterINum never retired, we can re-use the same physical register stored in the NextDest
array when we re-allocate WriterIN um next time. If the instruction does retire, then we
need to prepare WriterINum's entry in the NextDest array for the next time around. We
can't leave NextDest[WriterINum] unchanged, or we'll over-write what might well be
architectural state. So first we need to update the architectural state table:
ArchRegTable [DVReg]

= NextDest [WriterINum] ;

Then, we simply do this:
NextDest[WriterINum]

= LastDest[WriterINum];

Note also that on a trap (i.e. when we abandon a group of INums) we don't do anything
at all in the PMP.
The whole scheme looks pretty good. The downside here is that the obvious implementation requires two read ports into the NextDest array for the read operands, plus one
read port for the write operand, plus one read port for writing the last writer from NextDest to LastDest, plus one write port for the LastDest to NextDest update for each of
eight instructions. In addition, since we may retire as many as 16 instructions per cycle,
we need 16 read ports to copy from the NextDest table to the Architectural state table.
That's 48 read ports and one write port into the
NextDest array plus the read and write ports into LastDest and NextDest for instruction
retiring and so forth.
The first refinement is to turn the copy of LastDest to NextDest into a lateral copy that
doesn't actually use any read ports at all. That makes the retire operation rather inexpensive in the PMP. The second refinement is to use group reads of contiguous entries (i.e.
one read port is only needed to read out one block of 8 entries) for the 8 read ports for
the write operand and the 16 read ports for updating Architecture State Table. Therefore, we reduce the number of read ports of the NextDest array to 27. One obvious
approach to further reduce is to replicate the N extDest and LastDest arrays. The Qbox/
Pbox team has developed an approach that replicates the NextDest array by a factor of
three, reducing the requirement to a register file like array that is eight bits wide, 256
entries deep, and has nine read ports, and one write port. This appears well within the
bounds of practical implementation.

4-12

Compaq Confidential
Dependency Mapper Unit - the Pbox
5 Jam1c1ry 2001 - Subject To Change

Component Details

4.3.3

INum Allocator (INA)

4.3.3.1 Design Considerations

Each incoming instruction must be assigned an INum. INums are defined in INum
Space (Section 4.2). 21464 has a space of 512 possible INums where only 256 at most
are in use at any given time - the same as the maximum number of in-flight instructions.
The INum is used for RAW dependency detection, branch mispredict recovery, and
general exception identification. Dependency detection is discussed in Section 4.1. If a
branch instruction is executed and the Ebox/Fbox/Qbox determines that the branch was
mispredicted, the executing unit must send the branch instruction's INum to the Ibox.
The Ibox uses the INum as an index into a table containing the alternate branch address.
The INum is also used to establish priority of exceptions. If two exceptions are signalled to the Ibox at the same time, the Ibox will compare the INums responsible for the
exceptions and chose the earliest INum.
To address dependency detection, the assigned INums need only be "unique", that is
two in-flight instructions can't ever have the same INum. The requirements imposed by
exception processing are more stringent, however. In order to identify the "oldest" of a
pair of in-flight instructions their INums have to be assigned such that we can always
determine whether instruction A is older than instruction B or not. This issue is
described in Section 4.2.1. But more interesting problem for the INum Allocator is
deciding how to divide INums amongst different threads.
The INum Allocator also subsumes the related but distinct functionality of the Map
Thread Chooser (Section 4.3.3.3), which decides from cycle to cycle which TPU should
map and pass an instruction block along to the Qbox.
4.3.3.2 Design Architecture

In single thread mode, INum allocation is simple; all we need to do is to allocate INums
in the range 0 to 255 and toggle a "wrap bit" that becomes INum<8> each time we pass
through 0. Further, we need only make sure that all the in-flight INums are on the same
half of the 512 entry circle.
Multithread mode is a little more complicated. In this case, we need to maintain the
ordering relationship between INums within a thread only. (A comparison between
INum A from thread 0 and INum B from thread 1 is meaningless, or rather, doesn't need
to have any significance.Exception priority resolution - a major reason for having
INums - is done independently in each thread.) This ordering can be maintained by
ensuring, as in the single thread case, that the insert pointer and retire pointer stay in the
same 255 value range.
The problem with INum allocation in multithread mode is ensuring an optimal allocation of INums to active threads. Perfect allocation requires knowledge of what each
thread will do in the future. We don't have that knowledge, so we have to settle for
"good enough" allocation.
We considered dividing the space from 0 to 255 into four equal sized chunks and allocating each chunk to one and only one of the four threads. We called this scheme "hard
partitioning". This approach has one really big problem: it implies that when we only
have a single thread that is active, that thread can only have 64 instructions in flight. In
this case, 21464 takes a 20% performance hit. This is unacceptable for two reasons.
Compaq Confidential
5 January 2001 ···Subject To Change

Dependency Mapper Unit - the Pbox

4-13

Component Details

First, we want multithreading to be modeless: when there is only one active thread we
are in single thread mode, when there is more than one active thread we are in multithread mode. The transition from one to the other should not require great honking
masses of machinery to reconfigure themselves. Second, there is some suspicion that
many applications pass through serial sections of code where all the child threads are
waiting on a single parent thread to do a particular task. In the hard partitioned scheme,
that parent has no access to the idle resources for which the children have no need. (The
children are quiescent- asleep- waiting for the parent to do its thing.) This approach
was too slow.
At the other end of the solution space, we considered completely free allocation of
INums from a pool. We hate the idea of free-list choosers. This approach was too hard.
The approach we have chosen is a compromise. We divide the range 0 to 255 into four
chunks. Thread 0 is given INum blocks 0, 4, 8, 12 and so on to block 28. Thread 1 is
given INum blocks 1, 5, 9, 13 and so on. Each block contains 8 INums. When a thread
needs a new INum block, it looks forward from its insert pointer over the next four
blocks for the first free block. When a thread is quiescent, it returns all of its "own"
blocks to a "shared pool" as each block retires. (If some of a quiescent thread's blocks
are not in flight, they enter the shared pool when the thread quiesces.) When a thread
wakes up, it claims all of its own blocks from the "shared pool". When a thread X goes
looking for a new block, it looks in the shared pool, and in the list of idle (not in flight)
blocks belonging to thread X. This has the advantage of making idle resources available
to active threads, while avoiding complex free-list schemes. This approach was just
right.
4.3.3.3 Map Thread Chooser (MTC)

Apart from being responsible for INum allocation proper, the INum allocator also contains the Map Thread Chooser (MTC), which is responsible for picking a valid thread to
map each cycle. A valid thread is one that is not quiesced, and either has instructions in
the Ibox collapsing buffer or has its fetch valid bit set. The latter signal indicates that
this thread's instructions are currently being fetched and thus can bypass the instruction
buffer. No threads can or will map if the post-map skid buffer gets full or the INA runs
out of INums. Sometimes one or both of these events occurs - or the MTC chooses a
thread which turns out to be invalid - while a fetch block is en route from the Ibox to the
Pbox. To enable recovery in these events, the Ibox retains the last fetch block sent until
the Pbox indicates that the collapsing buffer can update itself. The MTC will assert the
update signal any cycle it has a valid map thread choice and has INums available (since
this implies the last block was successfully mapped).
The MTC will choose the thread with the fewest number of consumed INum chunks
that have valid instructions to map. In the event that two or more valid threads have the
same least number of consumed INum chunks, the tie is broken using a round-robin
algorithm. In order to monitor the number of INum chunks in flight, the MTC maintains
four counters, one per thread. A thread's counter is incremented when one of its fetch
blocks is passed to the Pbox from the !box's collapsing buffer and successfully mapped,
and decremented when one of its map blocks is retired by the Qbox. The MTC determines which threads have valid instructions to map by monitoring the number of fetch
blocks in flight that could be mapped by the time the map choice is accepted by the collapsing buffer stage. After making a map choice, the MTC forwards its selection to the
Ibox.

4-14

Compaq Confidentia I
Dependency Mapper Unit -the Pbox
5 Jc1nwtry 2001 -- Subject To Cfumge

Component Details

The MTC may choose a thread that cannot actually map because it had a line mispredict, set mispredict, !cache miss, ITB miss, etc. If any of these Ibox mishaps should
occur, the Ibox will send a "slot 0 invalid" signal to the MTC. If the MTC sees this signal, it knows that the chosen thread cannot map this cycle. Therefore will not update its
counters but restart its map choice on the next cycle. It will also clear all of the bits in
the Map TPU signal, indicating that the current map thread choice is invalid.

4.3.4

Mapper Exception Logic (MEX}

4.3.4.1 Design Considerations

When the Ibox signals that it has fielded an exception, the IMP, PMP, LSN, and RIF,
must restore their state to the time the trapping instruction was mapped. The INum allocator insert pointer is unchanged, but all INums between the trap point and the current
insert point must be abandoned.
4.3.4.2 Design Architecture

The Mapper Exception logic (MEX) takes as inputs the thread id and the exception
INum of an exception (from the BEL) and accesses the appropriate column in the backmap (BMAP) of each affected section (IMP, PMP, LSN, and RIF) to roll the thread's
context back to the trap point. In addition, the MEX clears all relevant BMAP state
between the trap and insert points. (The "writers" mask in each cell of the BMAP
between the trap and insert point is cleared if the corresponding INum has been abandoned. This way, when the state of the mapper is restored, all registers whose last writer
is either retired or abandoned will be marked as being "ready to read".

4.3.5

Memory Queue Allocation Unit (MQA}
The Memory Queue Allocation Unit (MQA) governs the allocation and deallocation of
load queue (LQ) and store queue (SQ) chunks to memory instructions. The Mbox
accepts and implements the allocation decisions of the MQA. However, the Mbox initiates the deallocation process, with the exception of killed instructions, whose LQ/SQ
chunks are aggressively reclaimed by the MQA. The MQA also controls the Highwater Mark (HWM) that is sent to the Qbox to regulate the issuing of loads and stores.
Note: We will attempt to explain the operation of the MQA in terms applicable to both
load and store queue functionality. However, we will default to using the store queue
operation when a description in generic terms becomes too cumbersome. Differences
between the load and store queue allocation logic will be noted as necessary.

4.3.5.1 Allocation

Allocation is based on the per-TPU demand for load/store queue chunks (LSChunks).
The MQA assigns LSChunks only to active (i.e. not quiescent) TPUs which exhibit
demand by mapping load and store instructions - with one twist.
There is a delay of several cycles between when the HWM is elevated and the Qbox
recognizes that an instruction that was above the HWM is now below it and may issue.
To mitigate this latency, the MQA artificially inflates demand for each active TPU, so
that even without mapping any memory instructions, a TPU has demand for some number of LSChunks. This number is currently thought to be 2, but that is the subject of
ongoing performance model experiments. This inflated demand effectively results in

5 Janw1ry 2001 ~·Subject To Change

Compaq Confidential
Dependency Mapper Unit - the Pbox 4-15

Component Details

the preallocation of LSChunks to a TPU, and a corresponding elevation of the HWM,
such that data-ready memory instructions have a better chance of issuing immediately
upon entering the instruction queue (IQ).
4.3.5.2 Background and Terminology

To make it easier to discuss the MQA algorithm, we define the following terms:
YLSNum
ADLSNum

The Youngest LSNum allocated to mapped instructions by the LSN
The Artificial Demand LSNum - i.e. YLSNum plus artificial demand; provided
to the MQA by the LSN

In addition, there are a few important background facts to keep in mind when considering MQA operation:

•
•

LSChunks contain groups of 4 LSNums .
The HWM is maintained on an LSChunk granularity (i.e. it does NOT change on
the granularity of individual LSNums).

•

ADLSNum is also maintained on an LSChunk granularity.

•

Only instructions with LSNums strictly below the HWM may issue; it follows that
the HWM is always immediately above the last allocated LSChunk.

•

Each of the two load queues and one store. queue have completely distinct, independently managed LSNum/HWM spaces; so do the TPUs within a given
load/
store queue.

•

LSNums, unlike INums, are allocated continuously; there is no necessary connection between the boundaries of INum blocks and LSChunks. This fact comes into
play most significantly in dealing with kills and retires.

4.3.5.3 Basic Allocation Loop

We will use store queue operation to illustrate the basic allocation loop. On every cycle,
each TPU compares its store ADLSNum (from the LSN) to the current store HWM. If
ADLSNum >= HWM, this TPU has demand for store chunks and submits a bid to TPU
arbitration.
If there are one or more free store chunks available, the TPU Arbitration unit selects a

winner from the active TPUs which have demand. The winner is the TPU from the set
of bidders which was least recently allocated an LS Chunk; the TPU which wins in this
cycle goes to the back of the line. Arbitration declares no winner if there are no active
TPUs with demand and/or no free chunks.
The MQA allocates an LSChunk to a TPU which wins arbitration in a given cycle.
Conceptually, when an LS Chunk is allocated, it is set aside for the stores whose SNums
are in the range of the ADLSNum value for which the bid was generated. For example,
if the bidder's ADLNum was 80, the allocated LSChunk will contain the stores with
SNums 80, 81, 82, and 83.
A successful arbitration leads to a number of events, both internal and external. Internally to the MQA, the winning TPU updates its store HWM by adding one LSChunk.
The Allocation Picker chooses which of the free LSChunks will be allocated, and the
ADLSNum value is written into that LSChunk's entry in the Tag Array. The match

4-16

Compaq Confidential
Dependency Mapper Unit-the Pbox
5 Jc1nuary 2001 ···Subject To Change

Component Details

enable (MATCH_EN) bit for that entry is also set. Finally, the allocated chunk is
removed from the Available vector and assigned to the Inflight Vector of the winning
TPU.
As for externally visible events, the MQA sends the winning TPU ID, ADLSNum,
encoded LS Chunk ID, and a valid bit to the Mbox, which writes the TPU and AD LSNum into the appropriate SQ entry. When stores with LSNums in the range of this
ADLSNum issue, their state will be assigned to this entry. In addition, the MQA sends
the updated HWM to the Qbox.
Going back to the beginning of the process, if on a given cycle ADLSNum < HWM for
a given TPU, then it has no additional demand for store chunks. It does not submit a bid
to arbitration nor update its HWM.
4.3.5.4 Reset

During reset, the HWM for each TPU is set to 0, and LSNum allocation also starts from
0. However, the LSN initializes the ADLSNum for each TPU to a positive value, ensuring that all TPUs come out of reset with demand. At their first opportunity, the TPUs
will bid and arbitrate to raise their HWMs to the point where ADLSNum < HWM - possibly before any memory instructions have been mapped or even fetched.
4.3.5.5 Deallocation

The following sections cover how the MQA responds to kill and retire events. However, at this point it is important to address the non-obvious role of the Mbox in deallocation. It is easy to grasp that LS Chunks mapping to instructions in the shadow of a kill
can deallocate, although there are some subtleties we will discuss shortly. One would
also tend to think that LSChunks in the shadow of a retire can also be immediately
reclaimed, but this is not the case.
Both stores and loads may surrender their INums before they are ready to actually leave
the SQ or LQ. Stores maintain state in the SQ until they are copied out of the merge
buffer, which may not only happen long after they become retireable, but out of INum
order. In the LDQ, prefetches never lead to a retry and thus retire early, possibly long
before they have executed. For these reasons, the MQA may only deallocate LS Chunks
for retired instructions once the Mbox says it is safe to do so. This is achieved via a
fully-decoded deallocation signal per LQ/SQ. The MQA response to this signal is to
remove the designated LSChunks from the Inflight Vectors and place them in the Available Vector.
4.3.5.6 Kills

When a kill occurs, the MQA has to reclaim any LS Chunks in the shadow of the kill. In
the interest of performance, the MQA tries to reclaim any LS Chunks that map onto
killed instructions. This is achieved by comparing the ADLSNum corresponding to the
killed instruction with all LSChunk tags for the kill TPU. All tags where LS Chunk >=
kill ADLSNum and the MATCH_EN bit is set generate a match signal, which leads to
their removal from the TPU's Inflight Vector and addition to the Available Vector. The
MATCH_EN bit is also cleared for all matching tags, making them ineligible for future
comparisons - this avoids aliasing problems when the LSNum space wraps.
Note that we kill from the ADLSNum corresponding to the kill INum - a value obtained
from the LSN Backmap - not the YLSNum which maps to the kill INum. This means
that the blocks between the YLSNum and ADLSNum - i.e. most of the blocks allocated
Compaq Confidential
5 January 2001 -· Subject To Change

Dependency Mapper Unit - the Pbox

4-17

Component Details

due to artificial demand - remain allocated to the TPU after a kill, a handy optimization.
This also has the side effect of solving the problem of partially-killed LSChunks. The
effective kill point is always at the beginning of some LSChunk after the true kill point,
so we don't need to worry if the kill occurs in the middle of a chunk or not.
Also note that the mapping from a kill INum to the corresponding YLSNum (and ADLSNum) is subtle. If the kill INum is that of a store instruction, for example, then the corresponding kill YLSNum is the SNum of that store. But if the kill is for some other type
of instruction, the kill YLSNum is the SNum of the last store allocated prior to this
instruction.
The aggressive reclamation of killed LSChunks by the MQA has an important implication for the Mbox. To avoid a hazard, the Mbox must not send a deallocation signal for
LSChunks that are completely in the shadow of a kill. Otherwise, the MQA could
receive spurious deallocation signals for LSChunks that it has just reclaimed and then
allocated.
Making killed LSChunks available for reallocation is only part of the task. The MQA
also needs to check the relative position of the kill and the current HWM. If kill ADLSNum >= HWM, the HWM has not yet caught up with or is just equal to demand, and
there is no problem. However, ifHWM >kill ADLSNum, then the High-Water Mark is
above the point where we have allocated LSChunks for this TPU and must be lowered.
In this case, the MQA sets HWM =kill ADLSNum. Note how this means that the TPU
comes out of a kill with a positive demand for LSChunks.
As a final note, the Mbox must make sure on kills that it is truly good and done with
LQ/SQ entries before the MQA has a chance to reallocate them. The pipeline would
appear to provide more than enough time for this, but we need to make sure.
4.3.5. 7 Retires

We have already discussed how retirement of LS Chunks is decoupled from deallocation, due to the tendency of stores and prefetches to linger in their queues past retirement. The immediate action that the MQA must take on retirement is to disable the
retired instructions from matching against future kills (and retires). The mechanism for
handling this operates as follows: on a Retire Block event, the LSN reads its Backmap
and supplies the YLSNum corresponding to the retiring block. Note that this must be
the YLSNum, not the ADLSNum, since the latter corresponds to instructions still in
flight! The LSChunk tags compare against the retire YLSNum, and all LSChunks for
the TPU that are in the shadow of the retire (i.e. older and with MATCH_EN bits set)
clear their MATCH_EN bits. There are actually two different cases:
1.If retire YLSNum == ... 11 then clear MATCH_EN for all tags where LSChunk <=
retire YLSNum 2.If retire YLSNum != ... 11 then clear MATCH_EN for all tags where
LSChunk <retire YLSNum
In other words, if the retire YLSNum corresponds to the youngest entry in an LSChunk,
the chunk is fully retired and we may clear MATCH_EN for it and all older chunks. If
the retire YLSNum is not the youngest one in a chunk, then the chunk is only partially
retired, and only the MATCH_EN for the older chunks may be cleared.
LSChunks will typically retire well before they are deallocated. However, the MQA
retires on a granularity of Retire Block events (i.e. INum blocks), whereas the Mbox
retires at the resolution of Next-to-Retire events (i.e. individual instructions). For this

4-18

Compaq Confidential
Dependency Mapper Unit -the Pbox
5 Januc1ry 2001 -·Subject To Change

Component Details

reason, the MQA may see a block deallocate before it has retired - for instance, if a
Next-to-Retire of a store allows the Mbox to release a SQ chunk before the INum block
containing that store can retire. This means that the block in question will reside in the
Available Vector and no longer in the TPU's Inflight Vector, but still have its
MATCH_EN set. This is not a problem, since the fact that the chunk doesn't belong to
any TPU means that its tag won't match on anything; the tag state will be overwritten
when the chunk reallocates.
4.3.5.8 Quiesce

[NOTE: This is just a sketch - we'll need to decide if things actually work this way or
not - Peter].
In the interest of performance, a TPU going into Quiesce needs to release all of its
LSChunks to the free pool. The Mbox signals Quiesce to the MQA only after the Quiesce trap has been taken, and after all LQ and SQ instructions for the TPU have completed. When the MQA sees the Quiesce signal, it removes all allocated LSChunks
from the Inflight Vector of the TPU and places them into the Available Vector. This
avoids forcing the Mbox to send out deallocate signals for any partially-utilized
LSChunks belonging to the TPU going into Quiesce. The MQA knows that when it
sees the Quiesce signal, it is safe to reclaim all chunks allocated to the TPU.
Coming out of Quiesce is relatively straightforward; for the TPU in question, it looks
very much like coming out of reset. LSNum allocation restarts at 0, and the HWM also
starts at 0, arbitrating its way up to the level called for by the ADLSNum in the first few
cycles after waking up.
4.3.5.9 Merge Buffer Purging

[This is a placeholder for whatever policy we decide to implement. - Peter]

4.3.6

Instruction Decoder (IDC)

4.3.6.1 Design Considerations

In the traditional microprocessors of yore, there was an instruction decoder (just one). It
sat at the front end of the processor, parsing instructions as they came into the machine
and telling the functional unit(s) exactly what to do for each operand; which registers or
memory to get data from, how to operate on the data, where to put it, etc. 21464 has
many more registers, funtional units, and operations than that Jurassic CPU, and by virtue of this additional complexity - and size - is far less centrally controlled. There is
also a complex register renaming process and out-of-order scheduling operation
between the processor front end and the functional units, so this classical model of
operations is not practical. Still, there are certain things we would like to know about
the instructions within the Pbox and Qbox before we ship them off to their ultimate destinations.
4.3.6.2 Design Architecture

The Pbox Instruction Decoder (IDC) acts similarly to the instruction decoder in a traditional textbook microprocessor, with a few important distinctions. First of all, the IDC
operates on up to 8 instructions in parallel. Secondly, the output of the IDC does not
directly or exclusively drive the functional units which execute the instructions - the
Ebox, Fbox, and Mbox, which also see the opcode and function bits, perform local
decoding to determine (for the most part) how to execute a given instruction. Rather,
5 January 2001 ·-Subject To Change

Compaq Confidential
Dependency Mapper Unit - the Pbox 4-19

Component Details

the IDC outputs are signals which highlight particular properties of the instructions, in
the manner of scoreboarding, mode, predecode, or valid signals, for example. Some of
the signals go to the Load Store Serial Number Allocator or the RC/RS Interrupt Flag
Widget to condition their behavior. The remainder go through the Post-Map Skid Buffer
to the Mbox, or to the Qbox where they either influence scheduling decisions and/or get
cached in the Payload Arrays for distribution at issue time. One of the most important
specific functions of the IDC is to provide slotting information to the Instruction Queue.

4.3. 7

Load/Store Serial Number Allocator (LSN)

4.3.7.1 Design Considerations

21464 issues loads and stores out-of-order. This can lead to deadlock situations in the
Mbox if things aren't managed properly. For instance, imagine we executed the following chunk of code:
001:

R3->(R5)

002:

(R5)->R2

003:

(R9)->R2

Note that these three instructions must execute in order if we are to get the ''right"
answer. Now imagine that we have an Mbox with a combined load/store queue. That is,
all stores and loads enter a single queue. Further, suppose that the queue had just two
entries. Instructions enter the queue at issue time, and leave the queue when it is known
that they will retire. Suppose R3 does not become data ready until well after R5 is data
ready. In that case, the two load instructions will issue before the store instruction. They
will consume both entries in the LD/ST queue. Neither however, can leave the queue
since we don't know whether either will be able to retire until instruction 001 issues.
(Note that we don't need to wait until they retire to eject them, we just need to make
sure they won't need to be replayed. Without seeing all "earlier" stores, we can't make
that decision. The LD/ST queue is now deadlocked.
Yes, this example is contrived- we won't build a combined queue of just two entries.
However, the deadlock behavior is inherent in the design and is inevitable if the number
of LD/ST queue entries is less than the in-flight instruction limit. (Or perhaps the limit
is the size of the instruction window - it depends on when instructions are allowed to
leave the IQ window.)
4.3.7.2 Design Architecture

To solve this problem, we assign an ascending serial number to each load and store
instruction. (Each load will get a LNum and each store will get a SNum.) The Mbox has
separate load and store queues. Each has sixty four entries. We will allocate LNums and
SNums in the range 0 to 255, which accommodates the pathological case where all 128
instructions in the queue are loads. Each TPU has its own independent LNum and
SNum space. Memory barrier instructions will get an SNum. (A memory barrier
instruction will never be dispatched to a functional unit - it is a NOP. But a marker for
the MB will be sent to the Mbox when the MB passes through the Post-Map Skid
Buffer. The Mbox can always deduce the next LNum or SNum to be allocated and the
position in the load and store queues of each decoded barrier instruction.
As the Mbox processes entries in the LSqueues, it frees up space in the queue. On each
cycle the Mbox will send two values per TPU to a box in the IQ. This box is called the
Load/Store Number High-water Marker (HWM). Each load/store/barrier instruction in
4-20

Dependency Mapper Unit -

Compaq Confideaitia I
the Pbox
5 Jc1nuc1ry 2001 -- Subject To CfJange

Component Details

the IQ stores its LIS number in the HWM. The Mbox sends MAXLNum<8:0> and
MAXSNum<8:0> to the HWM. Each load instruction compares its LNum to MAXLNum. If it is "less than" MAXLNum, then the load instruction may issue as soon as it is
data ready. Otherwise the load instruction must wait. Stores behave similarly. Note that
the comparison is actually "less than" but within the same half of the 512 point LSNum
"circle".
As the Mbox processes entries in the LSqueues, it frees up space in the queue. On each
cycle the Mbox will send two values per TPU - the load and store high-water marks - to
a box in the IQ. This box is called the Load/Store Number High-water Marker (HWM).
Each load instruction compares its LNum to current load high-water mark for its TPU.
If it is "less than" (or "below") the high-water mark, then the load instruction may issue
as soon as it is data ready. Otherwise the load instruction must wait. Stores behave similarly. Note that the comparison is actually "less than" but within the same half of the
512 point LSNum "circle".
Finally, note that the LNums and SNums must be reclaimed when instructions are abandoned due to a trap. This behavior differs from the IMP recovery mechanism. However,
the LSN is almost identical to the IMP. In this case, the backup map array for the LSN is
indexed as LSN_BMAP(B,ISload), that is there is a column for each block of INums,
and one row for loads and one row for stores. The LSN_CMAP is a pair of counters that
are incremented each cycle by the number of loads/stores that were allocated in that
cycle. On recovery, the LSN_CMAP is restored like this:
FOR i

= LOAD to STORE DO
IF LSN_BJ).'.]'Ap(TrapINum<7:3>,i).M<7:0> == 0
THEN

LSN_CMAP(i)

= LSN_BMAP(TrapINum<7:3>,i) .LAST_NUM

ELSE

LSN_CMAP (i) = LSN_BMAP (TrapINum<7 :3>, i) .LAST_NUM
FOR j

= 0 to TrapINum<2 : 0> - 1 DO

IF LSN_BMAP(TrapINum<7:3>,i) .M == 1
THEN

LSN_CMAP(i)

= LSN_CMAP(i) + l;

END
END
END

END

That is, the LSN_CMAP entry for loads is set to the last LNum allocated before the trap
point. If the trap point is in the middle of a block, we check to see if there were loads
before the trap point but within the trap block. If so, we increment the LAST_NUM by
the number of loads we find. Stores are treated in the same way.

Compaq Confidential
5 January 2001 -- Subject To Change

Dependency Mapper Unit -

the Pbox

4-21

Component Details

4.3.8

Post-Map Skid Buffer {PSB)

4.3.8.1 Design Considerations

Because there are fewer instruction queue entries available than the in-flight instruction
limit, the IQ sometimes fills. Because of pipeline delays, the Pbox may not realize that
IQ is full until well after it has filled, and several instructions have been dropped on the
floor. In this circumstance, we don't want to signal a trap all the way back to the Ibox,
becaise this wastes time and (perhaps more importantly) wastes INums. (We do not
reclaim bad path INums when a trap occurs, only after their retirement.) It is much better to buffer up the excess instructions until the Ibox can be informed to stop fetching in
a tidy mannerDesign Architecture
4.3.8.2 Design Architecture

As each instruction passes out of the main Pbox forward path components (INA, IMP,
PMP, IDC, LSN, and RIF) on its way to the Qbox, it is copied into the post-map skid
buffer circular queue. When an IQ full stall is signalled, the skid buffer sets the front
pointer of the queue to point to the instruction block after the last one that was successfully allocated into IQ. Which one this is can be determined in advance from the pipeline timing. When the IQ full condition is cleared, the skid buffer replays the stored
map blocks down to the IQ.
Control of the Ibox is achieved by sending an INHIBIT_ISTREAM signal to the INA,
which incorporates it into the ''INums available" information that it sends to the Ibox,
thus signalling that there are NO available INums. Rather than simply telling the Ibox
to send new instructions every time the Qbox signals that the IQ is not full, we prefer to
empty out the PSB so that we can bypass the new data coming down the pipe to the
Qbox. To do this, we make sure that the PSB is empty enough that by the time the new
instructions get to the PSB, we are able to bypass the data to the Qbox and not write it
into the PSB only to be read out sometime later.
The skid buffer contains all the muxing logic necessary to select between the silo outputs and the unbuffered forward path.
The post map skid buffer (PSB) is approximately 3 entries deep, the actual depth is
dependent on the control pipeline timing from when the Qbox signals that the IQ is full
to when we can tell the Ibox to stop sending the Pbox new instructions, and the data
pipeline timing from the Ibox collapsing buffer to the IQ. Each entry stores the TPU,
the instruction valid vector, and all the data that goes to the Qbox for a given instruction. The PSB is built in 4 separate storage arrays - one for data produced in PlA,
another for P3A data, and two for P2A data since these signals span a layout partition
boundary. The P2A and P3A array uses the control piped from the Pl A array.
The control logic itself uses no state machines apart from the read and write pointers,
which are one-hot and either advance or maintain their current position on every cycle.
The decisions of whether or not move the pointers, and whether to bypass incoming
blocks or read out of the buffer are based on four binary variables: the adjacency (or
not) of the pointers, the (in)validity of the incoming block, the current state of the output mux, and the fullness (or not) of the IQ.
On a kill, the trap TPU needs to be compared to the stored TPUs. If there is a match,
then the instruction valid vector needs to be set to all zeros. If there are any valid entries
in the PSB after a trap, (eg. there are multiple threads running, at least two threads have

4-22

Compaq Confidentia I
Dependency Mapper Unit - the Pbox
5 Januc1ry 2001 -· Subject To Change

Component Details

data in the PSB and one of the threads traps) the Pbox will send the invalidated data to
the Qbox in subsequent cycles when there is room in the IQ to accept instructions. In
other words, we won't collapse out invalid instructions if there are any valid instructions in the PSB.
We do cause a performance loss by sending invalidated map blocks to the Qbox in the
case when a trap occurred to a thread that had blocks in the PSB. In the case when the
machine is running a single thread, we would not need to send the invalidated map
blocks since we could tell that the skid buffer is empty. To determine that it is empty, we
need to OR across the instruction valid vectors. If all instruction-valid vectors are 0 then
the PSB is empty!

4.3.9

RC/RS Interrupt Flag Widget (RIF)

4.3.9.1 Design Considerations

The Alpha SRM specifies two instructions that are to be used by code that translates or
emulates VAX instructions. The two instructions are RC (for Read and Clear interrupt
flag) and RS (for Read and Set interrupt flag). In real life on an in-order machine, RS
Rx sets a bit called the interrupt flag and writes the previous state of the bit to register
Rx. (RC Rx, as you might guess, clears the interrupt flag and saves the previous state to
register Rx.) The SRM further says that this bit is cleared whenever a PALCALL_REI
instruction is executed. PALCALL_REI is the PAL instruction sequence that is called
whenever an interrupt service routine decides to return to the process that got the interrupt. Imagine the following code sequence:
Ret:ry:

do something that is probably wrong if an interrupt occurs

End:

BLBC

R9,Ret:ry

If an interrupt occurs and is serviced between Retry and End, then the interrupt flag will

get cleared by the PALCALL_REI routine that returns back to our program segment.
This is not really all that hard on an in-order machine. 21464 is not an in-order machine.
The interrupt flag must be maintained as in-order state.
21464 only knows about program order up to the point at which instructions are sent to
the Qbox. (And then again, in the completion unit: we map dependencies in order, and
retire instructions in order - everything else is higgledy-piggledy.) So it seems natural
that we'd try to do the interrupt flag processing in the Pbox at map time (or soon thereafter). If this paragraph looks familiar, it is because I copied it from the description of
the Store/Conditional Failure Widget (which no longer exists).
4.3.9.2 Design Architecture

The interrupt flag (we'll call it the INTerrupt Flag or INTF) is computed based on the
speculated !stream. For this reason, the state of the flag must be "rolled back" when the
Ibox erroneously predicts a path between an RS and an RC instruction (or, for that matter, between two RS's or two RC's).

Compaq Confidential
5 January 2001 --· Subject To Change

Dependency Mapper Unit - the Pbox

4-23

Component Details

Let INTF_CURRENT represent the current value of the interrupt flag. This value must
be maintained on a per-thread basis. We also maintain an interrupt flag table, INTF_T,
with an entry for each INum. INTF_ T can be shared among threads, since the INum
space is partitioned between threads in a non-overlapping manner.
Here are the rules for generating and maintaining the flag. Assume for now that we process instructions coming from the lbox one at a time.

•
•

At reset, clear INTF_CURRENT[3 .. 0] (i.e. the current value for every thread) .
If instruction X in TPU Y is an RS, copy INTF_CURRENT[Y] to the INTF bit in

the RS's payload. Set INTF_T[X] and INTF_CURRENT[Y].

•

If instruction Xis a RC, copy INTF_CURRENT[Y] to the INTF bit in the RC's pay-

load. Clear INTF_T[X] and INTF_CURRENT[Y].

•
•

For all other instructions X, copy INTF_CURRENT[Y] to INTF_ T[X] .
On a trap where INum Tin TPU Z is the trap point, copy INTF_T[T] to
INTF_CURRENT[Z].

You will notice that we didn't mention PALCALL_REI. That's because PALCALL_REI
is implemented as a stream of PAL instructions. Before the PALCALL_REI actually
returns to the interrupted program, it will perform an RC R31 to clear the INTF bit as
per the SRM.

4.3.10

Bid/Grant Exception Logic (BEL)

4.3.10.1 Design Considerations

The Pbox shares the responsibility with the Ibox for deciding which of all pending disruptions (exceptions, traps, interrupts, etc.) should be fielded. The function of the Bid/
Grant Exception Logic (BEL) is to choose one execution-time or retire-time disruption
per cycle out of the pool to trigger a chip-wide kill and send its INum to the Retire/Kill
Unit (discussed below) which controls the Retire/Kill Bus (RK Bus). The BEL must
also inform the lbox so that it can make the final decision as to whether the kill will
actually occur. Keeping track of execution-time disruptions for every in-flight instruction, and arbitrating between disruptions from different TPUs are among the challenges
in implementing the BEL.
This discussion is fairly localized to the Pbox and may not shed much light on the chipwide framework and arbitration mechanism for disruptions. For a better understanding
of this context, see Section 22.1.
4.3.10.2 Design Architecture

The work of the BEL is simplified by the fact that the Qbox Completion Unit (CU)
maintains the information associated with retire-time disruptions. The CU passes along
no more than one candidate - the next instruction eligible to retire, if it has any associated disruptions - at a time. The BEL must keep track of the execution time disruptions
for very in-flight instruction, and decide on a per-TPU basis which one is the oldest.
Only the oldest in each TPU matters, since any younger disruptions are by definition on
the bad path and will be killed off. This sorting by age is done in a decoded INum
space. Disruptions on one TPU have no effect on another, since they represent independently executing programs at the hardware level. Nevertheless, the 21464 can only
broadcast one kill at a time, which requires an arbitration mechanism. The BEL thereCompaq Confidential
4-24

Dependency Mapper Unit - the Pbox

5 Jam.uiry 2001 ··· Subject To Change

Component Details

fore hops in a round-robin fashion between those TPUs which are signalling any execution-time disruptions. If the CU is signalling a retire-time disruption on a given cycle,
that disruption trumps the execution-time ones since it is, by definition, the oldest disruption in the machine. The winner of this arbitration process has its INum and TPU
passed on to the Ibox and the Retire/Kill Unit. There is an additional dimension to kills,
namely, whether the kill affects only the instructions younger than the one corresponding to the disruption INum, or includes that particular instruction as well. This ''kill at"/
"kill after" distinction is also passed on by the BEL.

4.3.11

Retire/Kill Unit {RKU)

4.3.11.1 Design Considerations

The Pbox owns the responsibility of driving the highest priority Retire INum or Kill
INum to all the boxes on the chip that have state affected by instructions being killed
and/or retired. The Retire/Kill Unit (RKU) controls the Retire/Kill Bus (RK Bus),
which is the medium for communication of these events for the entire CPU. The structure of the RK Bus is such that only one kill or retire INum may be broadcast per cycle,
which requires some means of arbitration. Also, since the Ibox is involved in the arbitration process, there are delays between when the BEL informs the Ibox of its choice
for a kill and when the kill can be driven onto the RK Bus.
4.3.11.2 Design Architecture

Since the 21464 retires all instructions in order, there is only one possible candidate for
retirement at any given point in time, winnowing down the choices for what to broadcast on the RK Bus. Further, as described above, the BEL funnels all pending kills
down to one candidate per cycle. As the final arbitration mechanism, the RKU prioritizes kills over retires. If there is a valid kill and a valid retire in the same cycle, the kill
will win and be broadcasted on the RK Bus, causing the retire to be stalled and to try
again the next cycle.
The RKU has a stall pipeline that queues retire requests from the Qbox Completion
Unit (CU). A Kill only becomes valid if the Ibox signals to the RKU that it may proceed. A pipeline in the RKU manages the latency between a kill being passed from the
BEL and the valid/invalid indication returning from the Ibox. Retire-time exceptions
(RTEs) are handled in a special way. The RKU first broadcasts them as a next-to-retire
INum event, after which they are passed along to the BEL to be re-broadcast in the
form of a kill.
The RK Bus sends the retire or kill INum and TPU, and whether the event is a kill or
retire. The RK Bus also broadcasts whether a kill should occur at or after the retire/kill
INum, or, in the case of a retire, whether it applies just to the INum in question or the
entire INum block. For a detailed description of the RK Bus signalling protocol, please
consult Driving the Retire/Kill Bus (RK Bus).

Compaq Confidential
5 January 2001 - Subject To Change

Dependency Mapper Unit - the Pbox

4-25

Component Details

4-26

Compaq Confidential
Dependency Mapper Unit - the Pbox
5 Jam.u~ry 2001 ··· Subject To Change

5
Instruction Issue and Retire Unit- the Qbox
The Qbox processes instructions that are renamed by the Pbox and determines an
appropriate schedule for those instructions. When all input operands for an instruction x
have been produced or will be produced by an instruction y, already in the execution
pipeline, we say that instruction xis "data ready". The Qbox selects the eight best "data
ready" instructions for execution in eight integer pipeline units and four floating-point
pipeline units. In addition, the Qbox selects up to four data-ready branch instructions
for resolution in each cycle. It also retires all eligible instructions, committing them to
architectural state.
The Qbox consists of the following components:
Table 5-1 Qbox Component Summary
Name

Mnemonic Description

Described
in Section

Queue Chunk
ALC
Allocator/Deallocator

Manages the 32 instruction queue allocation chunks. Picks the two 5.2.14
chunks to be allocated to the next group of eight instructions.

Instruction Queue IQ

The queue from which instructions are picked for execution.

Queue Entry
Table

QET

Translates INum dependencies delivered from the Pbox IMP stage 5.2.2
into queue entry number dependencies. It also sets the No Live
Dependency (NLD) bits which are set, for instance, when an
instruction enters the queue data ready.

Dependency
Arrays

DAs

Holds an identifier for the producer of each operand for each
instruction in the queue.

5.2.3

Picker Arrays

PKs

On each cycle, chooses the oldest data ready instruction for each
execution pipeline.

5.2.4

Bid Enable Logic BID

Prevents otherwise ready instructions from bidding in pipes that
cannot service them, either because of a slotting decision or
because of non-data related resource conflict.

5.2.5

Completion Unit

Keeps track of which instructions have issued, which have passed 5.2.16
their trap points, which are I/O instructions, and which have retired.

CMP

5.2.1

Destination
DRN
Register Number
Array

Holds the destination register specifiers for each instruction. This
array are separately located from the SRN because it is not on any
performance-critical paths.

5.2.9

Exception Kill
Logic

Is repsonsible for removing from the Instruction Queue any
instructions that have been killed due to an exception.

5.2.18

EKC

5 January 2001 ···Subject To Change

Compaq Confidential
Instruction Issue and Retire Unit - the Qbox

5-1

.Scheduling Decisions - General Concepts

Table 5-1 Qbox Component Summary
Described
in Section

Name

Mnemonic Description

In-Flight Table

IFx

Keeps track of instructions that have issued and feeds INums which 5.2.15
have passed their trap points to the Completion unit.

Load/Poison
Re-arm Widget

LPR

Handles notification of load/miss events from the Mbox and
ensures that all instructions that depend on a missed load will
replay at some later time. The LPR also determines when individual instructions are eligible to be deallocated.

5.2.11

Load/Store
Number
High-Water
Marker

HWM

Disables load and store instructions whose LSNums indicate that
there may not be space available for them in the Mbox load/store
queues. Also contains the logic for preserving the consistency of
the DTB on misses.

5.2.10

Oldest CBR
Selector

ocs

Identifies the oldest conditional branch issuing in the current cycle 5.2.13
(that is, the one most likely to cause a misprediction).

Pay load Array

PAY

Contains all the instructions and the register file addresses of all
operands.

5.2.17

Post-Issue Logic

PIL

Gathers bubble requests and routes them to the appropriate pipelines. The PIL is also responsible for sequencing completion signals for the floating-point pipelines.

5.2.12

FPCR Control

FCR

Controls the update of the FPCR in the Fbox.

5.2.6

Profile-Me Data
Collection

PFM

Collects the following instruction-time-oriented performance data 5.2.7
for the two in-flight profile-me instructions: data ready, bid, issue,
deallocation, and queue chunk deallocation.

Source Registers
Number Arrays

SRNs

Contain the indices of the physical registers assigned to each
5.2.8
source operand of each instruction. These arrays (there are two) are
kept close to the dependence/bid/grant logic as the launch of the
input physical register specifiers may be a critical path.

5.1 Scheduling Decisions - General Concepts
Our goal in the Qbox is to choose the ''best" 8 instructions to execute for each tic of the
clock. The Qbox chooses these instructions from a "window" of 128 candidates. Each
of the eight scheduling pipelines can handle a subset of the 128 candidate instructions.
Alas, the subset can contain (in some cases) up to half of the instructions in the window.
So, the Qbox includes "pickers" that choose the best instruction out of a set of 64 candidates. Scheduling is a four step process:
1. Identify all data ready instructions.
2. For each pipe, select the "oldest" data-ready instruction enabled for execution in
that pipe.
3. Assert the result-ready signal that corresponds to each selected instruction, so that
all instructions that are stored in the queue can see that the chosen instructions have
been issued.
4. For each instruction in the queue, test the result-ready signal for each operand for
each instruction in the queue. (The IMP in the Pbox has renamed each source virtual retister into the INum of its last writer. The QET renames these dependencies
from INum space into queue entry space.

5-2

Compaq Confidential
Instruction Issue and Retire Unit - the Qbox
5 January 2001 -~ Subject To Change

Component Details

We can identify a data ready instruction by checking to see that both of its parent entries
have asserted their result-ready signals. This scheme is called a "decoded-space"
dependence array. Earlier plans called for an encoded-space scheme, but though such
schemes are more scalable than decoded-space schemes, this comes at a cost in cycle
time and complexity. The following sections describe the individual blocks that implement the general solution that has been described so far.

5.2 Component Details
5.2.1

Instruction Queue {IQ) Generalities

5.2.1.1

Design Considerations

The goal is to find every little bit of instruction level parallelism (ILP) in a program. (In
multi-thread mode, we want to find all the ILP and all the parallelism between threads.)
In the Pbox, we remove as many spurious dependencies as we can from the instruction
stream by renaming the virtual registers specified in each instruction into a physical
register space. This removes. WAW (write-after-write) and WAR (write-after-read)
dependencies. The task then is to pick from the collection of instructions that are in
flight but have not yet issued, the best eight instructions to issue on the next cycle based
on their actual RAW (read-after-write).
Picking eight instructions out of a pool of any reasonable size is a tough proposition no
matter how you do it. The problem gets more difficult as the number of candidates that
need to be examined increases. Back when dinosaurs roamed the earth we did a bunch
of studies that showed (for single threaded applications) the point of diminishing
returns (for performance vs. scheduling window size) seemed to occur around 96 to 128
entries in the window. So, 128 sounded like a good number. (More recent studies show
that, when we consider all the other nitty gritty details, the point of diminishing returns
is a little larger than 128.) The Qbox chooses 8 instructions on every tick from a pool of
128 instructions in the scheduling window.
The problem is complicated by the fact that the functional units of the Ebox and Fbox
are divided into multiple clusters. Results from Ebox cluster 0 incur a one tic delay
before they are available to clusters 1, 2, 3~5, 6, and 7. (The results are available immediately in clusters 0 ~d)t.) The scheduling unit must take this into account without
imposing "spurious" delays. If instruction Y depends on X, and both can execute in pipe
0, then X and Y should execute back-to-back in the absence of contention that might
otherwise delay Y.
5.2.1.2 Design Architecture

This section has an implied familiarity with the Overview and Scheduling Decisions
sections.
The 21464's solution to the scheduling problem did not spring forth all at once. It
evolved over a few years of tinkering, experimenting, and brainstorming. Our solution
features at is core a "persistent decoded space dependence array". The dependence
array is the widget that keeps trace of RAW dependencies for the instructions in the
queue. No matter how you cut it, this is a CAM. Each non-ready read operand for each
instruction must CAM against the "result ready" signals for all of the issued instructions that are still in the instruction queue. When both operands for an instruction have
seen a CAM match, the instruction will send its "bid" for a picker grant slot to the
Compaq Confidential
5 January 2001 -~ Subject To Change

Instruction Issue and Retire Unit -

the Qbox

5-3

Component Details

"picker" units. In the 21464's decoded-space CAM one result-ready wire is associated
with each instruction in the queue. When an instruction is granted (issues), it asserts its
result-ready signal. In the 21464's persistent array, this signal stays asserted until the
relevant instruction has left the queue.
Such decoded space arrangements normally consume lots of wire tracks. Our scheme
makes use of some clever encoding tricks to reduce the width of the dependence array
to a manageable size. A description of the tricks is beyond the scope here, but they are
well documented in the "q_dax_arx" RTL code and in the detailed dependence array
block diagrams. (See the Qbox implementation leader.)
Our initial suspicion (which has been borne out in all of our subsequent circuit feasibility studies) was that the CAM-bid-grant (or more concisely, "bid-grant") loop was
going to be very tight. This meant that we had to keep the width of the dependence
array as small as possible.
The first insight was to split the logic that looks at all of the "bids" from the data ready
instructions and generates the eight "grants" to the issued instructions into eight pieces.
This reduces the problem from a "pick 8 of 128" to 8 problems of "pick 1 of 128". This
looked like a good idea. But a pick 1 of 128 picker is much slower than a pick 1 of 64
relative to our cycle time. (A 10% hit in cycle time is not offset by the difference in performance between 8 pick 1of128 pickers and our scheme.) So, we divided the queue
into two halves. The west half of the queue contains all the in-flight instructions with
even INurns, the east half of the queue contains the odd instructions. Each half of the
queue picks instructions for four pipes. Each pipe has its own picker. So there are eight
pickers, each picking the oldest instruction out of 64 in its half of the queue. Each
picker picks instructions for just one functional unit.
Because of the one-to-one association with pickers and functional units, each picker
considers bids from just those instructions that may execute in the corresponding pipe,
but each picker sees CAM results (data ready signals) for all 64 instructions in its half
of the queue. However, the picker's outputs -- the result-ready wires, are routed to the
dependence arrays in both halves of the queue.
Figure 5-1 shows a simplified view of one half of the instruction queue. The top picker
and dependence arrays are in the "west" or even half of the queue, the bottom picker
and dependence arrays are in the "east" or odd half. The CAM entry for an instruction
in the dependence array is loaded when the instruction is allocated into the queue.
When the CAM entry detects that both of the instruction's operands have matched
against the result ready wires, the entry sends a bid to the attached picker.

5-4

Compaq Confidential
Instruction Issue and Retire Unit - the Qbox
5 Jc1nuary 2001 - Subject To Clumge

Component Details

Figure 5-1 Simplified View of One-Half of the Instruction Queue
-

Bid Enable and Picker

Dependence Array

CAM cell

Wire-OR

l'.ll

i:!
0

-ri

.JJ

;:l

1-1

.JJ

i:!

For now, let's assume that the 21464 only has two pipelines. All the odd instructions
will be sent to the bottom or "odd" half of the IQ, and all the even instructions will be
sent to the top half. (This picture is rotated relative to the actual chip layout -- we often
refer to the even half as the "west" half and the odd half as the "east" half.) Assume that
we have the following code segment:
INum

EntryNum

80
81

20
21
22

Op Code
SUB
ADD
ST

Operands
R3,#5 -> R5
R5,R5 -> R9
R5,(R9)

Note that we re-map from the INum space to Entry number space to make the comparison logic smaller. The SUB instruction is in the even half, the ADD in the odd half. The
ADD instruction must wait until it sees that the instruction at entry number 20 has been
issued.
So, let's assume that the SUB instruction is data ready. It sends its data ready signals to
the bid-enable/picker logic. SUB instructions always bid as soon as they are ready, so
the picker scans all outstanding bids and picks the bid from the "oldest" instruction.
Eventually the SUB instruction wins the bid. The picker asserts the "grant" signal for
Compaq Confidential
5 January 2001 - Subject To Change

Instruction Issue and Retire Unit - the Qbox

5-5

Component Details

entry 20, which asserts the "even result ready" signal for entry 20. This result ready
indication stays asserted until the SUB instruction is released from the queue. The indication is sent to the dependence array blocks labeled EE and OE in the diagram.
In the odd half of the queue, the ADD instruction has been waiting for entry number 20
to signal that it has been granted. Entry 21 in Array OE was loaded (when the ADD
instruction entered the queue) with a 128 bit mask. For each of the instruction's two
input operands. One and only one bit in each mask is set, corresponding to the entry
number of the "parent" instruction for that operand. When the result of "ANDing" the
mask with all the "result ready" signals is non-zero, the operand is data ready. Dependence array OE is responsible for the even bits in the mask for each odd numbered
instruction. (00 holds the odd bits for odd instructions, and so on.) One cycle after the
SUB instruction was granted (or issued) the ADD instruction will become data ready,
bid, and be granted if it is the oldest bidding instruction from the odd half of the queue.
Finally, the Store instruction has been waiting for both the ADD and the SUB to issue.
Its A operand became data ready at the start of cycle 2 when the SUB was issued. The
EE dependence array noted that the SUB had issued, and sent the operand A data ready
signal into the picker. The B operand became data ready in cycle 3 when the ADD was
issued. This was noted by the EO dependence array. The STore then issues in cycle 3
when it is picked by the even picker.
The 21464, of course, has eight pipelines. For the moment, assume that each instruction
is assigned to just one pipeline (and, thus, one picker) by the Pbox instruction decode
logic. The figure below shows all eight pickers and the sixteen dependence arrays that
make up the core of the 21464's instruction queue.
To illustrate, consider our SUB instruction 20 that has been granted by a picker and has
a consumer -- ADD instruction 21. When 20 issues, every instruction in the queue tests
to see if its operands have become data ready. As a result of this test, the entry containing the ADD will notice that both its operands will be data ready as a result of the SUB.
So, returning to our example, instruction 20 might be picked by the picker for pipe 7,
while instruction 21 can only execute in pipe 1. How does the grant information from
picker 7 get to picker 1? Via a global result ready signal, of course. When the SUB
instruction issues, it will assert its "global result ready" signal one cycle after it asserts
its "local result ready" signal.

Compaq Confidential
5-6

Instruction Issue and Retire Unit -

the Qbox

5 Jc1nuary 2001 -· Subject To Clumge

Component Details

Figure 5-2 Simplified View of Full Instruction Queue

::::::

.·

~~;+:~~::;~~:~:.·::~;~~i: ·:·~:·~:::~::.~;!;:: ~.;:~~i:;~*··

:: . : .

~~~7~~:·;:~~::(:~~#= -~~~r.~:~(:;~~f~~ :_· . :·.:

•.·: : : :

•;:::L::~~~I$•:·~:.~;~~:••~t:1:;~'.;;:~:-~;~;t~::-··

.
..

II"
·..

. .

. ..

. . ..

: :· .. : ;

..•.. : ~:;: ·~.:~-:·~:; ·':{,~:;:;;~:~{·:.:::~::;.:~~~:

::":

,: : : :;~;;~t"~;~~:~ (~'Ji:~~:;;;.::~ :f;:J;,fo ....... i: :

··:~~ ~-

·.·:~·~-

·.·~~;; )}~~

:.~ .~
. .:-•:;t:

::_·......
:s:
.• .•
:.·_:: .. ·

·~...~- ~i'i· :~!>-:·~··~~fa~;#~~~~:-:_:~?(~> : :'.:.: ..
·_: ..
~: ....-...·.~..... ..-~."·~_.: .·.· ·:"··~·.·.· .-.·-.· .'. :, _.._.: ·: ~-:::·~:\:.:~=~:::::~~~::::·
.. . . .....
··-

.·._·,·~

_:_•_,·· -_··.·'.· ·.·_,·

••••••• ••••••••••••••••••••••••••••••••••••

:·:·

·.·.

;.:.·

11>~

. .d:;;~~: .;::· ·i~;;;m:#~~:~ <. • :•.

-~
_.,~'!;~._.·_:.·:•.•.:_:•:.·~·•.· :··:·:·:~.·+!·:·,···i·:./~,!,·.····· ~~~~
. . . ·..
: .. : .

. ·. : .».·: .... : .

.•.•..:. .

• ) , .r:·~h i~: :fi::.'~U

. .•. : . •...•: . : : = : •.

. :._.·_.

:::::.

:~. >i~k~ ··~)~::t -:~;~~:.=:::;:;:~:: -~;~;:~::· ...
: '.·~~;/~~:::~~ );~~~~~/~~~~:~i~~~~~i):~~~~:.;:

:.:;:.
:::::.

Compaq Confidential
5 January 2001 --· Subject To Change

Instruction Issue and Retire Unit - the Qbox

5-7

Component Details

Note that because the SUB and ADD are executing in different pipelines we incur a one
cycle penalty (similar to the EV-6 "cross cluster" penalty). In this scheme, the ADD
won't issue until cycle 3 even though the result from the SUB may be available early
enough to allow the ADD to issue in cycle 2. This penalty was seen as a big deal. So we
explored some other ideas.
One way to mitigate this problem is to group pairs of pickers together. This way, each
picker in the pair sees the CAM results from each entry's match against its own granted
INum and the INum granted by the other picker in the pair. This works pretty nicely.
Now instead of having eight opportunities to incur the cross-picker penalty, we only
have 4 opportunities. If the SUB is "slotted" to picker 7 and the ADD is slotted for
picker 3, we would encounter no "cross-cluster" penalty.
A second way to mitigate the problem got the name "follow me" picking. To illustrate:
imagine that Y depends on X, that X must execute in pipeline 7, and that Y could execute
in either pipeline 7 or pipeline 1. We need to be careful about where Yexecutes however. Imagine it became data ready in pickers 1 and 7 at the same time. Then it might
issue from both at the same time. We'd probably get the correct answer, but at the very
least, we'd waste an issue cycle in one of the pipelines. So, our instruction slotter picks
a pipeline in which we'd prefer to execute Y. Assume for now that the slotter picked
pipeline 1. With this arrangement we'd still incur the cross-picker penalty. But notice
that Y becomes data ready in picker 7 one full tick before it becomes data ready in
picker 1. If we allow Y to issue from picker 7 ONLY in the first tick after its operand
became data ready from an instruction issued by picker 7, then there is no chance that it
would be issued by pickers 7 and 1 at the same time. (Since picker 1 won't know about
Ybeing data ready until after picker 1 samples the global data ready signal.)
Up until now, our discussion has ignored the fact that load instructions have a three
cycle latency -- that is, their result data is not available until the start of the third cycle
after the start of the load's Execute (E) cycle. This means that if a LD instruction is
issued in cycle 0, its dependents can't be issued until cycle 3. For this reason, when a
LOAD instruction is issued, it does not assert its "result ready" wire until three ticks
have passed. It will assert the "global result ready" wire at the same time it asserts the
"local result ready" wire. Why? Well, we know that the load data arrives at the inputs to
all of the functional units in the Fbox and Ebox at the same time. Therefore, by asserting global data ready one tick earlier than it normally would be asserted, the other functional units can grab load data as soon as it is ready.
We should note that the Mbox supports three load ports. Because of our odd/even distribution of instructions into the pickers, we need to find some way of allocating the extra
load port. On even ticks, load instructions in some even positions in the map block may
issue to this "weak" load port. On odd ticks, loads in certain odd positions may issue to
the "weak" port.
Entries in the queue are allocated in groups of 4 instructions called "queue chunks".
The IQ is full if either the even half or the odd half of the queue has no available queue
chunks. An instruction stays in the IQ until we are certain that no instructions in its
chunk will need to be replayed as the result of a load-miss or other mishap.

5-8

Compaq Confidential
Instruction Issue and Retire Unit-the Qbox
5 Jc1nuary 2001 - Subject To Change

Component Details

5.2.2

Queue Entry Table (QET) and Reallocation Logic (RAL)

5.2.2.1 Design Considerations

The Pbox produces dependence information based on the INum of the instruction that
produced each input operand for a mapped instruction. That is, an instruction like
ADDQ R3,R2,Rl will pass through the Ibox and have its input registers (Rl and R2)
remapped into INum space. With our decoded space dependence array, it would be both
un-necessary and slow to represent the entire INum range in the instruction queue itself.
After all, only 128 instructions can be in the queue at the same time, why do we need to
reflect completion for the entire 256 (or 512) instruction range?
So, we need to transform INum based dependencies into a more compact form, as the
speed of a decoded space scheme is dependent on keeping the dependency checking
logic as small as possible.
5.2.2.2 Design Architecture
5.2.2.2.1 Algorithm

The Queue Entry Table transforms INum dependencies into EntryNum dependencies.
When an instruction passes through the QET, the INum of each operand's parent is used
to index a table that indicates the position of the parent instruction in the instruction
queue. If the parent instruction is no longer in the instruction queue, we know that the
associated operand is now data-ready.
The lookup is most easily described as
for(i

= O; i < 8; i++) {

inst[i] .src_a_entry_num<6:0> = EntryTable[ inst[i) .src_a_inum];
inst[i] .src_b_ent:ry_num<6:0> = EntryTable[ inst[i] .src_b_inum];

As it turns out, the table does not need to be quite so large. We know the low three bits
of the entry num for any parent INum: they are identical.
Enum<2:0> = INum<2:0>

Further, we want the entry number in decoded form to make things convenient for the
IQ core. Finally, we need to send all the ODD parent dependencies to the ODD dependency arrays, while the EVEN dependencies go to the EVEN arrays. (We split the QET
and Dependency arrays into EVEN and ODD parent dependencies to speed up the bid/
grant loop.) So, we're really just translating the INum BLOCK bits from INum space to
entry number.
for(i

= O; i < 8; i++) {

if (inst[i] .src_a_inum<O>)
inst[i].odd,_src_a_ent:ry_rnsk<15:0> =
OddEntryTable[ inst[i] .src_a_inum<7:3>];
inst [ i] . even_src_a_entry_msk<15: 0> = 0;

else {
inst[i] .odd_src_a_entry_rnsk<15:0> = O;
inst[i] .even_src_a_ent:ry_msk<15:0> =

Compaq Confidential
5 January 2001 ~· Subject To Change

Instruction Issue and Retire Unit - the Qbox

5-9

Component Details

EvenEnt:i::yTable [ inst [i] . src_a_inurn<7: 3>];

if(inst[i].src_b_intun<O>} {
inst[i] .odd_src_b_enti::y_msk<15:0> =
OddEntryTable[ inst[i] .src_b_inurn<7:3>];
inst[i] .even_src_b_entry_msk<15:0> = O;

else {
inst[i] .odd_src_b_entry_msk<15:0> = O;
inst[i] .even_src_b_entry_msk<15:0> =
EvenEnt:i::yTable [ inst [i] . src_b_intun<7: 3>];

For the most part, this is a rather simple operation: a RAM lookup. But there are two
problems that must be addressed.
1. What if instruction 51 depends on instruction 48? They are both in the same map
block, so when 51 arrives, neither it nor 48 have yet been entered into the Queue?
2. What if instruction 51 depends on instruction 2 which is leaving the queue just as
instruction 51 arrives at the ET?
We solve the first problem by updating the QET map (the table that maps INums to
ENums) during cycle QO, while we don't actually translate the instructions that will be
mapped until cycle Ql.
The solution second problem is not so simple. We solve it by adding an extra bit of
information to the parent information for each operand. This extra bit is called the "No
Live Dependency" or NLD bit. Each "chunk" of INurns in the map (4 entries comprise
a chunk in the IQ) has a stored NLD bit. This is read each time we translate an INum
parent into an entry num parent. SRCx_NLD is set for the x (x is A or B) operand if its
parent is no longer in the instruction queue.
Note that each entry in the entry tables has an associated NLD bit that indicates that all
INums in this chunk have already issued and left the queue. The NLD bit is set for
INum<7:3,0> when the entry for that INum chunk in the IQ is re-allocated to a new
chunk of instructions. (So we don't actually set the NLD bit when an instruction is
"done", but rather when we need the space that it formerly occupied.) We clear the
NLD bit for a chunk of INums when that chunk is loaded into the IQ.
Deriving the signals necessary to clear the NLD bit on re-allocation is entwined in the
details of queue chunk allocation and deallocation. When a queue chunk is re-allocated
(that is when the "write_chunk" signal is asserted for the chunk), the RAL section in the
IQ core sends the INum block number associated with the re-allocated chunk back to
the QET. The QET then sets the NLD bit associated with that INum block.

5-10

Compaq Confidential
Instruction Issue and Retire Unit-the Qbox
5 J~·muary 2001 ···Subject To Change

Component Details

5.2.2.3 Physical Organization

The QET is built from a block that is replicated eight times. Each block decodes parent
INums for HALF of the instructions entering the queue. (The west blocks decode parent
INums for even instructions, and the east blocks decode parent INums for odd instructions.) Each block corresponds to one issue pipeline in the instruction queue core.
There are four instances of the QET schematic section. Each section contains two
blocks (even and odd) and translates instructions bound for an associated Qx_DAE and
Qx_DAO sections as shown in the next figure. All four instances are identical with
identical inputs and outputs. The replication is an aid to routing and is required to limit
the transit time through the QET stage.
Figure 5-3 Simplified Diagram of QET and Pickers for Two Pipelines

EVEN
Entry

ODD

Entry

'"d

Table

u
QJ
Q

fQN_QET_AOO

fQN_QET_AEO

EVEN
Entry

ODD

Entry

'"d

Table

u
QJ
Q

fQtj _QET_AE1

fQt- _QET_A01

~
pM_DAE_A~

~I
><:

It,

::2i
c"Y

lgrani1

~
RM_DAOj~

~
pM_DAE_AR1

~><:I
It,

::2i
c"Y

lgran;I

~
RM_DAO_AR~

The RAL (reallocation logic) is contained in the IQ core itself.

5 January 2001 ···Subject To Change

Compaq Confidential
Instruction Issue and Retire Unit-the Qbox

s-11

Component Details

5.2.3

Dependency Arrays (DAs)

5.2.3.1 Design Considerations

Once an instruction has entered the IQ it must wait until both of its operands have
become data ready. We need to detect that an operand's parent has issued and, to support the followme scheduling technique, we need to know whether the parent issued in
a local pipeline or in a different execution cluster. The detection scheme must also take
into account the different instruction latencies for loads, integer operations, and floating
point operations.
5.2.3.2 Design Architecture

Our strategy is to represent the dependence of an operand on a parent instruction as a
128 bit vector. Bit X in the vector is set if and only if the operand is produced by the
instruction at entry number X in the instruction queue. When an instruction is issued we
assert a "result_ready" wire corresponding to the instruction's entry position in the IQ.
There are actually two result_ready signals for each instruction in each dependence
array. One result_ready wire (local_result_ready) indicates that entry X has issued and
dependents may now issue in the same Ebox/Fbox cluster that X issued to. The second
result_ready wire (global_result_ready) indicates that dependents on X may issue in any
cluster (i.e. the cross cluster penalty has been paid).
The algorithm can be depicted as
for(e = O; e < 128; e++) {
ent:ry[e] .srca_lcl_rdy =
(local_result_ready<127:0> & entry[e] .srca_entry_mask<127:0> != 0);
ent:ry[e] .srcb_lcl_rdy =
(local_result_ready<l27:0> & entry[e] .srcb_entry_mask<127:0> != 0);
entry[e].srca_glb_rdy =
(global_result_ready<127:0> & ent:ry[e].srca_ent:ry_mask<127:0> != 0);
ent:ry[e] .srcb_glb_rdy =
(global_result_ready<l27:0> & ent:ry[e].srcb_ent:ry_mask<127:0> != O);

But again, we use the fact that INum<2:0> is the same as ENum<2:0> to save some
wires when we load the dependence array, so that we need not store a 128 bit entry
mask. Further, we divide the dependence array entry for a given position in the queue
into two halves. The EVEN dependence array (shown on the floor plan as Qx_DAE)
checks for dependencies on instructions in the EVEN half of the IQ core. The ODD
array checks for dependencies on ODD instructions. This arrangement mirrors the division of the QET tables into even/odd halves. So, the actual algorithm looks like this:
for(e = O; e < 128; e++} {
entry[e].srca_lcl_rdy_odd =
(local_result_ready<127:1:2> & entry[e].odd_srca_entry_mask<63:0> != 0};
entry[e].srcb_lcl_rdy_odd =
(local_result_ready<127:1:2> & entry[e] .odd_srcb_entry_mask<63:0> != 0};

entry[e].srca_glb_rdy_odd =

s-12

Compaq Confidential
Instruction Issue and Retire Unit-the Qbox
5 January 2001 - Subject To Change

Component Details

(global_result_ready<l27:1:2> & entry[e].ockl_srca_entry_mask<63:0> != 0);
entry[e].srcb_glb_rdy_odd =
(global_result_ready<l27:1:2> & entry[e].ockl_srcb_entry_mask<63:0> != 0);

entry[e].srca_lcl_rdy_even =
(local_result_ready<126:0:2> & entry[e].even_srca_entI.Y_mask<63:0> != 0);
entry[e].srcb_lcl_rdy_even =
(local_result_ready<l26:0:2> & entry[e].even_srcb_entry_mask<63:0> != 0);
entry[e].srca_glb_rdy_even =
(global_result_ready<126:0:2> & entry[e].even_srca_entry_mask<63:0> != 0);
entry[e].srcb_glb_rdy_even =
(global_result_ready<126:0:2> & entry[e].even_srcb_entry_mask<63:0> != 0);

5.2.3.3 Physical Organization

The dependence arrays are built from a block that is replicated sixteen times. Each execution pipeline picker is connected to an EVEN half dependence array and and ODD
half dependence array. (See the floor plan).

5.2.4

Picker Arrays {PKs)

5.2.4.1 Design Considerations

Given that we've found a set of instructions that are ready to issue, we need to choose
the "best" from the set to send to an execution unit. As described in the overview section, we divide the instruction queue into eight blocks. Each block keeps track of
instruction dependencies for all instructions in the queue and chooses one data ready
instruction on each tic to send to a particular execution pipeline. The core of this decision process is called the "bid/grant loop". The bid/grant loop is implemented, for the
most part, in the dependence arrays and the picker arrays.
Choosing the "best" instruction is an optimization problem. It is most likely not computable in bounded time. For this reason, we adopt a simple heuristic: we choose the
oldest data ready instruction for a pipeline on each tic. Choosing the oldest is a simple
algorithm, and we've got a simple implementation.
5.2.4.2 Design Architecture

As an instruction Z enters the instruction queue it is given a bidding token. The token is
sixty-four bits wide and has all its bits set to 1 except for the bits corresponding to the
Z's entry number and all the instructions in Z's chunk that are before instruction Z.
Each time a new chunk Y is written into the queue, every instruction will clear bits
<4*Y+3:4*Y> in its bidding token.
When an instruction Z bids, it performs a bitwise AND between Z's bidding token and
all other bids in this picker. If the result is zero, then Z wins this round of bidding.
This mechanism guarantees that the oldest bidding instruction (age being determined
by time-of-entry into the instruction queue) will win any bid.

5 January 2001 - Subject To Change

Compaq Confidential
Instruction Issue and Retire Unit-the Qbox

5-13

Component Details

5.2.5

Bid Enable Logic (BID)

5.2.5.1 Design Considerations

Even though both of an instruction's operands are data ready, the instruction may not be
able to bid for a given pipeline. First, the resource required to serve the instruction may
not be available. (For example, floating point square root and divide operations are not
pipelined. Additionally, resources in the Mbox are limited, so some load and store
instructions may be prevented from bidding until they fall under the "high water mark"
-- see the "High Water Mark" widget.)
Further, an instruction that bids in one of its "followme" pickers, may, unfortunately,
bid in its preferred picker on the very next cycle. The solution to this particular problem
is still under discussion.
5.2.5.2 Design Architecture

The bid enable logic keeps track of issue preconditions that are separate from the datareadiness of an instruction's operands. In particular, the bid-enable logic monitors the
results of high-water-mark comparisons, the occupancy of the non-pipelined floating
point pipes, and, for resources that "ping-pong" between halves of the issue queue (e.g.
the weak load pipe is shared between two pickers -- on "odd" cycles it can be used by a
picker that chooses among odd instructions, on "even" cycles it can be used by an even
instruction picker) such as loads, JSR, and some MTPR operations.
5.2.5.3 Physical Organization

The bid enable logic is replicated in two copies. One serves the four pipelines in the
south of the instruction queue, the other serves the north pickers. The logic is identical
and replicated for electrical reasons.

5.2.6

FPCR Control Unit (FCR)
The floating-point control register (FPCR) control unit controls the update of the FPCR
in the Fbox. FPCR is implemented as a speculative-committed pair of registers. First,
the FCR ensures that only the oldest in-flight MT_FPCR instruction in a TPU will
update the speculative FPCR. Second, the FCR updates the committed FPCR from the
speculative FPCR when the oldest in-flight MT_FPCR instruction in a TPU becomes
retirable. This mechanism, along with the native mode FPCR trap and PAL mode fetch
barrier, guarantees the correct architecture (in-order) behavior of writing and reading
the FPCR register.

5.2. 7

Profile-Me Data Collection (PRM)
The IQ Profile-me data collection unit collects performance data for the two in-flight
profile-me instructions. The data collected in this section include instruction data ready
time, instruction bid time, instruction issue time, instruction de-allocation time and the
instruction queue chunk de-allocation time. To be more specific, the real data collection
storage for the data collected in this section is in the Ibox. PFM is mainly responsible
for generating the control signals for the Ibox to capture the cycle time information so
that the profile-me software can calculate all those data mentioned above.

5-14

Compaq Confidentia I
Instruction Issue and Retire Unit-the Qbox
5 January 2001 ·-Subject To Clumge

Component Details

5.2.8

Source Register Number Arrays {SRNs)

5.2.8.1 Design Considerations

The 21464 is blessed with a very large Register File which has a non-trivial access time.
When an instruction issues, its source register IDs need to be sent to the register file to
begin looking up the values as early as possible. The Register File also runs extremely
hot; anything that can be done to save power by averting unnecessary lookups is a boon
to the chip.
5.2.8.2 Design Architecture

The Source Register Number Arrays (SRNs) store the renamed source physical register
(PReg) IDs for each instruction. There are actually two SRN sections; one covers the
PRegA and PRegB values for instructions in even map block positions, and the other
covers the ones in odd map block positions.
Associated with each PReg ID is a valid bit. Register numbers with their valid bit deasserted do not cause a lookup in the Register File (or the Ebox register cache), thus saving power. These bits are deasserted in cases where there is no valid instruction in that
issue block position, the source does not exist for that instruction (e.g. LDQ has no
valid PRegA), or the source value is zero (e.g. Ra == R31 ). Since the Ebox does its own
instruction decoding, it knows the difference between these cases - i.e. when to simply
ignore the value returned by the Register File and when to substitute in a vector of
zeroes. The values of these bits are determined by the Pbox.
Because of their critical timing, the SRNs sit in the IQ core itself rather than in the
"late" IQ, and send out their data to the Register File and Ebox as soon after instruction
grant as is physically possible, before any of the other information associated with an
instruction leaves the box.

5.2.9

Destination Register Number Array {ORN)

5.2.9.1 Design Considerations

The register file needs physical register indices for instruction destinations, not just
sources (see the Source Register Number Arrays description), in order to store operation results. The timing is somewhat more relaxed as result writeback happens later in
the pipeline than source reads. Similar power issues apply, but the urgency is reduced
by the fact that each instruction has a maximum of two source operands but at most one
destination. A more pressing problem is the fact that superfluous reads are merely
wasteful, but spurious writes are destructive.
5.2.9.2 Design Architecture

There is a single Destination Register Number Array (DRN) section which stores the
renamed destination physical register (PRegD) IDs for all instructions in the IQ. It sits
in the "late" IQ partition, to the side of the IQ core, as its timing is aligned with the bulk
of the information going from the Qbox to the execution boxes. When an instruction
issues, its PRegD value is forwarded to the Register File and Ebox to address the register file and register cache, respectively.

5 January 2001 -- Subject To Change

Compaq Confidential
Instruction Issue and Retire Unit-the Qbox 5-15

Component Details

Each PRegD ID has an accompanying valid bit. If this bit is deasserted, it means either
that the instruction has no destination register (as in the case of an ordinary store), or
that the destination register is R31. In either case, the Register File and Ebox will not
write the register file/cache for that instruction. They rely entirely on the Pbox to correctly assert these bits to avoid spurious or dropped writes.

5.2.10

Load/Store Number High-Water Marker {HWM)

5.2.10.1 Design Considerations

As documented in the description of the Pbox Load/Store Serial Number Allocator, the
aggressive, out-of-order execution of the Mbox, combined with a finite number of load
and store queue entries, can lead to deadlock, unless there is some way to guarantee
dependencies are resolved before the queues fill up. The Pbox addresses this by assigning a 8-bit serial number - an LSNum - to all memory operations. Each operation
type has its own serial number class: loads get LNums, and stores get SNums. Each
TPU also has its own LNum and SNum spaces.
The Mbox keeps track of the fullness of its load and store queues on every cycle, and
sends the Qbox a per-TPU, per type (load or store) "high-water mark" value. The Load/
Store Number High-Water Marker (HWM) must insure that only memory operations
below the applicable high-water mark can issue.
On a different but related subject, there are also a number of challenges associated with
maintaining the consistency of virtual to physical memory mappings while a DTB miss
is in the midst of being processed. This problem and our general policy for solving it
are at length in How the 21464 Does DTB Fills. To summarize the important points,
DTB misses lead to a PAL flow which services the miss. The flow contains a so-called
"DTB writer block", a group of instructions which put the new translation into the
"speculative" DTB entry (there is one per TPU). Each TPU sees its own speculative
entry as well as the common, committed DTB state. When the DTB writer block retires,
the contents of its TPU's speculative entry are written into the committed state.
While this new translation is being written into the DTB, any memory operations which
depend on this new translation must not be allowed to issue since they will lead to more
misses, or, even worse, potentially erroneous behavior. This process is further complicated by the fact that there are actually two copies of the DTB - one accessed through
each of the two Mbox strong load ports - which must be kept coherent. The DTB writer
logic in the HWM must prevent memory instructions which are "DTB-dependent" from
issuing, and handle the situation when bad path code generates spurious DTB writer
blocks.
5.2.10.2 Design Architecture

The HWM affects the issue behavior of memory instructions through the "load/store
bid enable" (LDST_BIDEN) signals - one per instruction in the queue - which it passes
along to the Bid Enable Logic. Queue entries with their LDST_BIDEN signals asserted
are free to bid, provided the other conditions enforced by the Bid Enable Logic (e.g.
data readiness) are satisfied. Entries with deasserted values may not bid, with the
exception of their very first cycle in the queue. Computing the correct LDST_BIDEN
values takes a cycle. Since most queue entries are not memory ops, and since it turns
out that the majority of incoming memory ops are eligible to bid on entry into the
queue, the Bid Enable Logic speculatively assumes that the LDST_BIDEN signal for

5-16

Compaq Confidential
Instruction Issue and Retire Unit-the Qbox
5 J~1nuary 2001 ~·Subject To Change

Component Details

each entry is true on allocation. If that assumption should prove false, and the instruction gets granted in the meantime, there is enough time to shoot it down before it leaves
the IQ.
The LDST_BIDEN signals are always asserted for instructions which are neither loads
nor stores; the HWM knows which these are by virtue of ISLOAD and IS STORE signals conveyed by the Pbox. These are maintained on a per-entry basis, while the TPU
ID is stored on a per-chunk basis. Note that ISLOAD and ISSTORE are asserted for any
instruction that executes in an Mbox load queue or store queue, respectively, and has a
valid LSNum. Not all memory operations are loads or stores in the conventional sense
(e.g. WH64).
LDST_BIDEN is deasserted for loads and stores which are below their high-water
mark, as determined by a comparison of the the LSNum of the individual memory operation in a given entry and the high-water mark for its chunk TPU. The HWM receives a
high-water mark update from the Mbox for each TPU on every cycle. Once a memory
operation falls below its high-water mark, its LDST_BIDEN signal remains asserted
until that chunk is reallocated.
Finally, LDST_BIDEN is deasserted for any memory operation which is DTB-dependent. For implementation purposes, we use a somewhat coarse definition of this concept: any memory operation which is allocated after a valid, active DTB writer block in
the same TPU is considered DTB-dependent on that block. This includes, unfortunately, operations which do not actually use the translation being modified by the DTB
writer block. But in the average case, the memory operations that are mapped after a
DTB miss will be dependent on same page as the reference that missed.
When it is known, in the due course of time, that every instruction in the entire DTB
writer block - in both halves of the IQ - is not poisoned (i.e. not the victim of a missed
load), the DTB-dependent instructions are free to bid and be granted. The Load/Poison
Re-arm Widget (LPR) indicates via its NO_REISSUE signals that a DTB writer
instruction has passed its poison point. (For a general understanding of the concept of
"poison", consult An Overview of the Poison Mechanism in the 21464.) The dependent
instructions will read from the speculative entry for their TPU until such time as the
DTB writer block retires and is written to the committed state of the DTB. Throughout
this entire process, memory operations in other TPUs are free to bid and issue, as are
memory ops in the same TPU which are older than the DTB writer. Non-memory operations are completely unaffected.
The DTB logic in the HWM tracks DTB dependencies throughout the IQ in each TPU,
and deasserts the LDST_BIDEN signal of every DTB-dependent entry until it sees
every applicable NO_REISSUE signal. To make this task easier, it receives "DTB
writer" (DTBWRT) flags from the Pbox indicating which instructions are members of a
DTB writer block. It also handles the important and dangerous scenario of spurious
DTB writers entering the queue when a valid one is already active. There is only one
signal per TPU indicating that there is a valid DTB writer in the queue. If a spurious
DTB writer were to come in after a valid one and seize control of that signal, it is possible that older instructions could become dependent on this younger DTB writer, leading
to deadlock. To avoid this situation, the HWM DTB logic ignores DTB writers entering
the queue while there is already an active one in the same TPU, since these are known a
prioi to be on a bad path.

5 January 2001 --·Subject To Change

Compaq Confidential
Instruction Issue and Retire Unit-the Qbox 5-17

Component Details

5.2.11

Load/Poison Re-Arm Widget (LPR)

5.2.11.1 Design Considerations

Consider Figure 5-4.
Figure 5-4 Tracking Data-Ready Instructions

Remember the fundamental rule for hyper-complex super-scalar deep-pipelined hellbent-for-leather microprocessor design: if you aren't sure, guess. So, when the Qbox
issues a load instruction we always guess that the load will hit in the first level cache.
This is a pretty good guess, as it is correct far more often than it is wrong. But if we're
going to guess, we need some way of backing out of incorrect guesses. The Load/Poison Re-arm Widget (LPR) keeps track of all instructions that became data ready
because we guessed that a load would hit. In Figure 5-4, instruction 1 is a load. The
Qbox scheduled its first dependent, instruction 2, in tic 3 assuming that the load would
return data at the beginning of instruction 2's execute phase. Instruction 2 has a latency
of 1 cycle and caused instruction 4 to become data ready in cycle 3, though it didn't
issue until cycle 5. When instruction 4 issued it caused instruction 5 to become data
ready in cycle 5 and then issue (broadcast its INum) in cycle 6. When instruction 1
missed in the cache, all the instructions that depended on the load (or whose ancestors
depended on the load) operated on "bad" data.
Fortunately, because of register renaming and all kinds of neat stuff that happens in the
Mbox, we don't need to worry a whole lot about the bad data that the load shadow
instructions write. It will be overwritten later or it will be ignored completely. We do
need to replay the instructions however.
5-18

Compaq Confidential
Instruction Issue and Retire Unit - the Qbox
5 Jc1nuary 2001 -· Subject To Change

Component Details

In order to replay the instructions in the load miss shadow we need to identify them.
The 21264 declared that all instructions that issued in the shadow of a load miss would
be "replayed". Using this approach, instructions 2, 3, 4, and 5 would all be re-issued at
the appropriate time. This works well for the 21264 becaue they don't issue a whole lot
of instructions in the shadow of a load. But notice that instruction 3 didn't depend on
the missed load at all; the 21264 would replay this instruction unnecessarily. But for a
short load shadow the cost is relatively low.
The 21464, on the other hand, could issue lots of instructions in the shadow of a load,
many of which aren't related to the load. (Some might not even be in the same TPU).
Replaying all of them would simply be too expensive. Instead, we replay only those
instructions that are actual dependents of a missed load.
In order to be replayed, instructions either have to stay in the queue up to the point they
are re-armed or, if they are allowed to deallocate on issue, be injected back into the
queue. The latter alternative is essentially tantamount to a trap, which is too expensive
an event for something that happens as relatively frequently as a load miss. But not only
is queue space is limited, but it is allocated and deallocated on a chunk granularity, and
our performance is very sensitive to how long a chunk stays in the IQ. Therefore, we
need to remove instructions from the IQ as soon as we know that they may no longer be
victimized by a missed load. The Load/Poison Re-arm Widget (LPR) tells the IQ core
which entries contain instructions that are victims of missed loads and need to be rearmed, and which entries are free to be reallocated.
5.2.11.2 Design Architecture

The problem of how to identify which instructions are actual victims of a missed load is
solved by ''poison". To learn more about poison across the chip, consult An Overview
of the Poison Mechanism in the 21464. The basic idea is that the property of a load
missing in the cache is propagated from the load to all instructions that issue and consume its data, to all of their issuing descendents, and so on, and so forth - just like the
data itself. Instructions that are poisoned need to be re-armed and replayed. This is
achieved by resetting their data readiness to its original state and then reissuing the culprit load via a bubble. When the load's result becomes ready once again, its dependents
will become data ready, bid, and eventually issue, and so on down the line.
Note that this cycle may repeat several times - for instance, if the load misses in both
the first-level and second-level caches. Each failed attempt returns a poisoned value,
which leads to a chain of poisonings, re-arming, a new bubble, and so on, until the load
finally comes back poison-free.
The Load/Poison Re-arm Widget receives the poison information from the Ebox and
passes it to the Dependency Arrays in the form of REARM_A and REARM_B signals.
These indicate if an instruction's A or B operand, respectively, has been poisoned. The
LPR additionally sends out a "reset result ready" (RESET_RES_RDY) signal to each
poisoned instruction, indicating that it should clear the "result ready" (RES _RD Y) state
signaling to the rest of the IQ that it has produced a valid result. The RES_RDY bits
will be set again as a consequence of instruction replay.
The Mbox also communicates with the LPR, indicating when a load has missed via a
WILL_RETRY signal. The LPR asserts the RESET_RES_RDY signal for load which
misses, poisoned or not. Missed loads will set their RES_RDY state again when they
are reissued via a bubble, triggering a chain of reissues.
Compaq Confidential
5 January 2001 -· Subject To Change

Instruction Issue and Retire Unit - the Qbox

S-19

Component Details

The LPR is additionally responsible for telling the Queue Chunk Allocator/Deallocator
(ALC) when individual instructions are eligible for deallocation. This is a function of
both poison and instruction type. Single-cycle operations, for example, can deallocate
immediately after their "poison point" - i.e. the point in the pipeline where the Ebox
returns poison information. Loads, by contrast, must wait until the Mbox has indicated
whether they have hit or missed in the cache. Long-latency operations must wait until
either their poison point or the time they bubble, whichever comes later. Multicycle
operations must linger in the queue an extra L-1 cycles (where Lis instruction latency)
after producing a result - enough time for any poisoned descendents to be replayed and
see the new RES_RDY signal. The ALC asserts an "okay to deallocate" (OK_DEALC)
signal to the ALC for each instruction that may leave the queue.
Finally, the DTB logic in the Load/Store Number High-Water Marker (HWM) needs to
know when a DTB writer block had passed its poison point in order to free any DTBdependent chunks (see the HWM description for more information). The LPR sends
NO_REISSUE signals to the HWM to indicate instructions that have passed their poison points and thus will not reissue; the HWM checks this information against its
record of which instructions are elements of a DTB writer block to determine when to
release dependent memory operations.

5.2.12

Post-Issue Logic (PIL)

5.2.12.1 Design Considerations

So what happens after a load has missed and the Mbox eventually gets the correct data
from the second level cache or the system? How does the Qbox re-fire all of the load's
dependent instructions? (They were re-armed by the Load/Poison Re-Arm Widget.)
For that matter, how do we handle long latency operations like square root or divide?
These operations produce results many cycles after other concurrently issuing instructions are ready to write theirs back to the register file. We need to let fast operations
deliver their results quickly - it would be absurd to stage out the results of 1-cycle integer adds to line up with those of 14-cycle floating-point divides. But when long-latency
ops eventually do complete, they need to have exclusive access to a result bus and register file port. (Building separate buses and ports just for long-latency ops is far too
costly.) This means that the issue slot for the instruction that would otherwise consume
those resources at that point in time must be empty. That is, there needs to be a "bubble"
in the pipeline into which the completing long-latency operation can slip its result.
5.2.12.2 Design Architecture

In both cases the IQ provides a "bubble request widget", which sits in the Post-Issue
Logic (PIL) for operations that need to signal late completion. In the case of load
misses, bubble requests are fielded by the strong load pick rs in the same bank of the
queue where the instruction originated. The bubble request will cause the load to set its
"result ready" (RES_RDY) bit and allow all its dependents to become data ready. Note
that load miss bubble requests can still result in a cache miss, so the Load/Poison Rearm Widget (LPR) must be prepared to kill dependents of a load bubble request.
The Mbox first gives early warning of an impending bubble by asserting the
WILL_RETRY signal to the issuing picker, causing the LPR to signal to the Dependency Arrays that the corresponding RES_RDY state must be cleared. Later, the Mbox
sends along the actual bubble request in the form of the IQ entry number of the load
Compaq Confidential
5-20

Instruction Issue and Retire Unit - the Qbox

5 Jc1nuary 2001 - Subject To Change

Component Details

which needs to bubble. This goes to the strong load picker in that bank of the IQ. The
Mbox always stashes away the queue entry number of loads on issue in the event of a
bubble - this saves the Qbox the time and logic needed to do an INum-to-queue-entry
lookup.
Note that two loads may be issued concurrently in the same bank - one in the strong
load pipe and one in the weak load pipe - and may both miss and need to bubble. But
bubbled loads always reissue on the strong load pickers only. So while the
WILL_RETRY values are always asserted at a fixed time relative to issue, the bubble
requests themselves may be queued up and are therefore accompanied by a "retry
valid" bit.
In the case of long latency Fbox ops, bubble requests are routed to the picker that originated the operation. When the instruction first issues, all "operand ready" matches
against its IN urn are discarded. (The dependent instructions knew that they were dependent on a long latency op - they ignored the non-bubble requested broadcast of their
parent's INum.) When the bubble request is honored, the INum is rebroadcast and all
dependents then become data ready.
Note that the Fbox does not generate its own bubbles; counters in the PIL keep track of
when floating-point divide and square root operations are due to complete and send the
requests to the appropriate picker. The PIL is also responsible for telling the Fbox when
a floating-point instruction has been killed off as the result of a disruption, since the
Fbox does not monitor the Retire/Kill Bus. Should something untoward happen to a
long-latency Fbox op (such as a division by zero), the PIL will find out only after the
retire-time exception trickles through the disruption logic and appears as a kill on the
RKBus.

5.2.13

Oldest CBR Selector (OCS)

5.2.13.1 Design Considerations

Mispredicted conditional branches need to be resolved as quickly as possible to avoid a
major performance penalty. When a branch misprediction does occur, the Ibox needs to
know the INum of the culprit in the shortest possible time so that it can re-fetch it and
start the replay process. If more than one branch mispredicts, we want to handle the oldest one first, since any younger instructions in the thread are on a bad path and are wasting processor resources. This is especially important for multiple mispredicting
branches in the same thread - younger branches are rendered moot by the older
branches' misprediction, so handling the oldest branch first minimizes the chances of
replaying bad path code.
The problem is that quickly sorting out which branch instruction caused the mispredict
- and if there is more than one mispredicting branch, which one is oldest - requires
an exacting amount of logic. Certain cases, such as multiple mispredicting branches in
one cycle, are also very uncommon. This puts the necessity of building the vast infrastructure required to find the correct answer immediately into question.
5.2.13.2 Design Architecture

What if we were to always assume initially - in keeping with the speculative philosphy of 21464- that the oldest conditonal branch to issue in a given cycle is also the
oldest mispredicting branch? It turns out that most of the time, according to benchmark
simulations, this is a very astute guess. So this is what the OCS does: it identifies the
Compaq Confidential
5 January 2001 ··· Subject To Change

Instruction Issue and Retire Unit - the Qbox

5-21

Component Details

oldest conditional branch instruction to issue in a given tic and forwards its INum, TPU,
Pipe ID, and Predicted Taken/Not Taken Bit to the Ibox. The Ibox uses this data to
query the altenate PC table in preparation for a replay. If it turns out that no branches
mispredicted, the Ibox ignores this information.
If we discover upon Ebox branch resolution that some other branch (not the oldest)

mispredicted, we undo the mispredict trap, and add the INum(s) of the actual culprit(s)
to the trap pool. Between the Pipe ID and the misprediction signals from the Ebox, the
Ibox infers that the INum we gave them was for the wrong branch without any need for
a special signal from the Qbox (unlike what was orginally thought). Finally, if the
branch we selected did mispredict but was the victim of a load miss, the instruction
needs to be replayed after the data comes back from main memory. In these two latter
cases, the Ibox fakes a line mispredict, falling back on pre-existing error-handling
mechanisms, which keeps it from causing any further mischief. For more details on the
branch resolution process, take a look at the document Branches, and How To Resolve
Them.
The OCS maintains internal state which records which of the instructions in the IQ are
CB Rs, and the Predicted Taken/Not Taken bits for each one, information it obtains from
the Pbox instruction decoder. INums are obtained from the Exception Kill Logic since
the OCS does not store INums.
Note that the implementation of this widget is in many respects similar to that of a
picker. However, since it does not acutally influence the issuing of instructions from the
IQ- and also operates one cycle behind the pickers - calling it a picker (as was originally proposed) would be somewhat misleading.
Note also that this mechanism only applies to integer conditional branch instructions.
Floating-point CBRs go into the standard trap pool and are processed like generic disruptions.

5.2.14

Queue Chunk Allocator/Deallocator (ALC)

5.2.14.1 Design Considerations

The instruction queue is divided up into two banks, west and east. Each bank is further
divided into sixteen chunks of four entries each. When a map block arrives at the IQ it
is written into one chunk in each half of the IQ. We need to allocate these chunks efficiently, since if we run out of entries in the IQ, we need to stall at the Pbox Post-Map
Skid Buffer (PSB) and signal back to the Ibox to stop sending map blocks. Needless to
say, we don't want to do this very often.
Along with allocation comes the complementary problem of deallocation. Our queue
isn't big enough to hold all in-flight instructions, so we need to re-use chunks for new
instructions, evicting old ones from the IQ as soon as they are able to leave. It turns out
that our performance is very sensitive to how long chunks stay in the queue, so this is a
significant issue.
5.2.14.2 Design Architecture

The Qbox does not cause IQ full traps. Ever. Not ever. We don't do that sort of thing; in
fact, we don't even support it. Instead, we do the next worst thing, we stall.

Compaq Confidential
5-22

Instruction Issue and Retire Unit -

the Qbox

5 Jc1nua1ry 2001 -- Subject To Clumge

Component Details

A queue chunk is free only if all of its instruction are eligible to be re-allocated. This is
an important point; because of our chunk granularity, 16 "problem" instructions could
conceivably fill one bank and thus the entire IQ. The exact conditions under which a
given instruction may be deallocated are a function of instruction type and poison status, and are described in more detail in the Load/Poison Re-arm Widget (LPR) description. The LPR tells the ALC when a given instruction may be deallocated. In the
simplest, most common case, valid single-cycle instructions may be deallocated as soon
as they are known to not be the victims of a missed load.
The queue chunk allocator keeps track of which chunks are free. It then selects one
chunk from each of the two queue banks to be written on the next cycle. If one or both
banks are completely full, then the allocator signals an IQ full condition to the PSB and
to the Map Thread Chooser in the INum Allocator. The Thread Chooser tells the Ibox
that no threads can accept instructions until the IQ full condition is resolved. This is
timed in such a way that the first freshly fetched and mapped blocks will arrive in the
IQ immediately after the last buffered block is passed along from the PSB.

5.2.15

in-Flight Table (IFx)

5.2.15.1 Design Considerations

The issue queue does not keep information about an instruction that has issued. Several
cycles after an instruction issues, its entry in the issue queue is marked as free and can
be reused. This reduces the occupancy of the queue and reduces IQ full stalls. However,
certain information - such as the issued instruction's INum and destination register - are
needed for later stages in the instruction's life (completion and writeback). The InFlight Table is responsible for maintaining this state. Most importantly, the In-Flight
Table checks - for every issued but not completed instruction - whether the instruction
has raised an exception. If an instruction has raised an exception, its INum is marked
accordingly in the In-Flight Table, and this information follows it to the Completion
Unit. This is done to prevent the completion unit from thinking that the excepting
instruction can retire.
5.2.15.2 Design Architecture

Basically, the In-Flight Table is a bunch of registers which mirror the 21464 pipeline
from issue to writeback. Logically, the In-Flight Table sits between the IQ and the
Completion Unit. Every cycle, the (up to) 8 issued INums enter 8 different staging pipelines - one per picker - in the In-Flight Table. Each cycle, the 8 staging pipelines
"shift down."

Compaq Confidential
5 January 2001 -·Subject To Change

Instruction Issue and Retire Unit - the Qbox

5-23

Component Details

The staging pipe for pipeline 0 is shown below.

5.2.16

Completion Unit {CMP)

5.2.16.1

Design Considerations

The 21464, like other out-of-order processors, issues instructions out-of-order but commits (retires) them to architectural state in program order. Here, architectural state
refers to software-visible state: registers, memory, and IPRs. The completion unit is
responsible for this reordering. Logically, the completion unit determines which INum
is the ''next to retire" for each TPU. This INum is driven to the Pbox RKU, where it is
eventually driven out on the RK bus. It is important to retire instructions as early as
possible, because many critical resources (INums, physical registers, load/store queue
entries) are freed at retire time.

5-24

Compaq Confidentia I
Instruction Issue and Retire Unit-the Qbox
5 January 2001 -·Subject To Chang~~

Component Details

5.2.16.2 Design Architecture

The CMP maintains a vector ranging from INum 0 to INum 255. The state associated
with each entry (INum) is as follows:
State

Description

c bit
IO bit
RTE bit

INum is past the point at which it itself may raise an exception
INum has been tagged by the Mbox as an IO load instruction
INum has a retire time exception associated with it
INum has been killed
Retire time exception code

Kbit
Etype<5:0>

The following sections describe completion, killing, retirement, and Mbox processing
in turn.
5.2.16.2.1 Completion

Instructions are issued from the instruction queue out of program order. They enter the
In-Flight Table, and are checked against all possible exceptions. If the instruction
makes it through the In-Flight Table without getting an exception, it sets the C bit for its
INum. If the instruction gets hit with a retire-time exception while in the In-Flight
Table, it sets the RTE bit in the completion unit and stores the exception type in the corresponding Etype field. If the instruction gets hit with an execution-time exception,
such as a branch mispredict, the INum is removed from the In-Flight Table, and the
CMP state is unaltered.
5.2.16.2.2 Kills

The 21464 indicates kills by posting a kill INum on the Retire/Kill bus and asserting the
Kill signal. The completion unit receives this kill INum and simply marks all instructions younger than the kill INum (and in the same TPU) as complete, by setting the C
bit. Also, for all killed instruction, we set the K bit to indicate that this instruction was
completed due to a kill.
When an instruction associated with an RTE is the "next-to-retire" instruction, the CMP
first sends its INum and TPU ID to the Pbox Retire/Kill Unit (RKU) as a "next-to-retire
INum". The CMP then passes the INum, TPU ID, and RTE type information to the
RKU again as a kill. Please consult Driving the Retire/Kill Bus (RK Bus) for more
detailed information.
5.2.16.2.3 Retirement

To retire an instruction Yin TPU X, instruction Y must not cause an exception, and all
older instructions for TPU X must also not raise an exception. Said another way, Y must
be complete and all older instructions must also be complete. The CMP works by finding the oldest uncompleted instruction A. INum A is then driven out of the CMP to the
Pbox as the next-to-retire INum. This indicates that all INums older than A can retire.
The CMP then waits for INum A to complete. Until it does, TPU X has nothing to
retire, since we must retire in order. Finally, when A completes, we search for the oldest
uncompleted instruction in block B. If all instructions in block B are complete, the
CMP indicates that the entire block B can retire, and we now advance to the next INum
block for TPU X and search for the first uncompleted instruction.

5 January 2001 ···Subject To Change

Compaq Confidential
Instruction Issue and Retire Unit-the Qbox

5-25

Component Details

The 21464 retires from only 1 TPU per cycle. Therefore, we need an arbitration mechanism to decide which TPU to retire from in a given cycle. We use a retire chooser,
which simply chooses round-robin among those TPU that actually have something new
ready to retire.
The 21464 uses a shared bus for driving retires and kills. Therefore, it is possible that
in a given cycle, both a retire and a kill request access to the shared RK bus. The policy
for resolving this conflict is that kills always take precedence over retires. Therefore,
the CMP supports a stall mechanism to maintain state in the presence of simultaneous
kills and retires.
5.2.16.2.4 Mbox Interface

To facilitate the processing of certain memory operations, there are special hooks from
the CMP to the Mbox. The Mbox can signal that the next-to-retire INum is an 1/0 load
operation. The CMP sets the corresponding IO bit in the completion vector, and
advances the retire pointer. This allows merging of 1/0 operations, an important performance enhancement. In a similar vein, the Mbox can force the completion of the nextto-retire INum. This is used to complete memory barrier instructions.
Load/store ordering presents another interesting problem for the Mbox and CMP. A
load instruction may complete normally, but still cause an exception. This exception is
triggered when an older store issues and writes data to the same address as the load. To
handle this, the Mbox has a "stall retire and zap" interface to the CMP. The stall retire
interface freezes the CMP's retire pointer at the current instruction. It can not advance.
The zap interface in essence "uncompletes" an instruction. The C bit of the violating
instruction is forced to 0, thereby ensuring that the retire pointer can not advance past
the trapping instruction.

5.2.17

Payload Array {PAY)

5.2.17 .1

Design Considerations

From the Qbox core's point of view, the only significant features of an instruction are
its dependencies, its latency, and anything else that might complicate scheduling, like
being part of a DTB writer block. Certain other generic instruction attributes, namely
INums and TPU IDs, are stored in the Exception Kill Logic, and the physical register
numbers have structures devoted to them. But what about all of the other minor details
that make up an instruction, such as its opcode? Where do they go?
5.2.17.2 Design Architecture

The Payload Array (PAY) contains all of the information about an instruction in which
the Qbox has no direct interest, including opcode and function/displacement fields, and
other flags and attributes derived by the Pbox (e.g. LSNums, the IS JUMP bits which
flag indirect jumps) or passed through from the Ibox (e.g. the MAP_PAL_MODE bit
which flags PALcode blocks). All of this type of information associated with a given
instruction pops out of the PAY upon issue and is forwarded to the executing box. The
PAY sits physically in the so-called "late" portion of the Qbox, off to the side of the IQ
core with its sensitive timing paths.

5-26

Compaq Confidentia I
Instruction Issue and Retire Unit- the Qbox
5 January 2001 ···Subject To Change

Component Details

5.2.18

Exception Kill Logic (EKC)

5.2.18.1 Design Considerations

When an exception occurs, there may be instructions on that code path which have been
allocated space in the instruction queue. These instructions must be removed from the
queue, in the interests of correct program execution, and the conservation of scarce
queue space and other processor resources.
5.2.18.2 Design Architecture

The Exception Kill Logic eliminates from the queue any younger instructions with the
same TPU as the excepting instruction. It also has the incidental function of storing the
INums and TPU IDs of every instruction in the queue and passing them along to the
Ebox and Mbox at issue time.

Compaq Confidential
5 January 2001 -- Subject To Change

Instruction Issue and Retire Unit - the Qbox

5-27

Component Details

5-28

Compaq Confidentia J
Instruction Issue and Retire Unit-the Qbox
5 January 2001 ~·Subject To Change

Major Components

6
Integer Execution Unit -

the Ebox

The 21464 microprocessor is organized into several major processing sections called
boxes. The Ibox handles instruction fetching and program flow prediction, the Qbox
schedules, often out of order the instructions fetched by the Ibox and the Ebox executes
most of the non-floating point Alpha instructions scheduled by the Qbox. The Ebox
contains multiple copies of its various processing elements so the Qbox can schedule as
many instructions per cycle as possible. The upper limit is eight instructions issued
simultaneously.
Structurally, the Ebox processing elements are organized into twelve functional units,
but not all units are alike. In an attempt to keep the functional units small, fast and
tightly coupled to each other, each unit executes a predefined subset of the instruction
set. For example, of the eight integer-units, only two can execute store instructions, and
only two units can handle any multimedia instruction. See the instruction breakdown
table for a complete list.

6.1 Major Components
The Major Ebox components are:
Table 6-1 Ebox Major Component Summary
Component

Description

Integer Units (8)

The integer functional units execute the traditional integer arithmetic
and logical instructions as well as performing the address generation
and data formatting of memory instructions.

Multimedia Units (4)

The multimedia units execute the newer integer instructions targeted
at accelerating multimedia operations and also perform integer multiplication.

The register caches store recently written register values allowing
dependent instructions to issue before the register file is updated.

5 January 2001 -· Subject To Change

Compaq Confidential
Integer Execution Unit - the Ebox

6-1

Major Components

Figure 6-1 Ebox Block Diagram

EBOX Block Diagram

M.Jltimedia
Adc:IMJVfree

M.dtimedia
Pack/Slift!Wax

M.Jltimedia
Pack/Shift/Nlax

IVUtimedia
AddAVU/Tree

hteger

MUtipier O

M.Jltiplier 1

Cluster A

Cluster C

Cluster D

Cluster B

Functional
Unit 2

Functional
Unit O

Functional
Unit 1

Functional
Unit 3

Rcache A

Rcache C

Rcache D

Rcache B

Functional
Unit 6

Functional
Unit 4

Functional
Unit 5

Functional
Unit 7

6.1.1 Datapath
The key to understanding the architecture of the Ebox is an understanding of how operands and results flow through the Ebox. Shown below are the major datapaths and a
rough representation of their layout. The rest of this document will mostly discuss the
details of the elements within the Ebox and how they interact with these datapaths or
the rest of the 21464. Detailed descriptions of the high-speed circuits used to implement
the adders, shifters or multipliers can be found in the implementation documents.

6-2

Compaq Confidential
Integer Execution Unit - the Ebox
5 Jc1nuc1ry 2001 ···Subject To Change

Major Components

Figure 6-2 Ebox Datapath Block Diagram

mox ommmm.

1ru~~inedia
IR~~~ flShift/

EBOX bypass

. . . . . . . . QQQX. .·.·.~. . . . . . . .
, ~X results

!~ax

IN~~~~lie l 1
!
!
!
!

f;1,mr'.~W}t1~F1~f fi~g~~:~~:~;,!;~:,!~','~;i~'-r;;~~A,~'~;~-~: -~
~c: : :,· WL1nf:t~1
81:

µn~ ~I

~:: W~n~iq 1$1

a::

Wn~ ~

~:: W~n~ooH$1

a::

Wti~ ~

~:: J'74nptio t~

a::

~n~ ~

~:· ~~i!} ~] R¢aci~ , ~]r-t R~cJ ~] R~ch~
~~

~:J Jf4~t~I ~::
~

JfL n~t~ 1$1

~::

Jf ~nbfi4_)n<ll1

l : : : ~ ;4~t~<

µf1~l $1
~·
Wti~l ~
~·
µp~ a. ! _g.
ILri~d
__..
;_ {- -- --- -_(.)__ - -- -- --- ---~- --- --- -_8 __ - -- -- --- --- --: -+--_8__ --- -- --- --- --1- - +

·r ;- - - - t= ._-_-_ ___._.._ __, ._.: __: : :

t--H -+--·rt' - -- ....___. ._____ _
---!-+--'...._:_ .

rr========.:t=======================:= ====::========.:i====================:=t=r

The Ebox bypass busses supply the Ra and Rb input operands to each functional unit.
Values are driven onto the bypass busses from the register file, the register caches or
directly from the functional units based on controls from the operand steering unit. The
Ebox generates three types of results; the adders, shifters and logic units produce results
in a single cycle, load instructions need three cycles to produce a result and multimedia
instructions take five cycles to produce a result. To be available as operands to future
instructions, all results are distributed throughout the Ebox and written into the register
caches.

6.1.2 Timing
Cycle mnemonics are used throughout this document to identify the relative timing of
signals. Table 6-2 identifies the cycle relationships assumed by this document.

5 January 2001 ···Subject To Change

Compaq Confidential
Integer Execution Unit-the Ebox

6-3

Integer Clusters

Table 6-2 lnterbox Timing Relationships
Qbox

Ql Q2 Q3 Q4

Reg.File

Ebox

Fbox

Mbox

Each cycle is further subdivided into two phases, the first half of a cycle is the 'A'
phase, the second half of a cycle is the 'B' phase. A timing specification EOA refers to
the first phase of cycle EO.
All timing references in this spec refer to the latch that launches the data, when significant transit time may be involved, that time or an expected arrival time will be separately stated.

6.2 Integer Clusters
An integer functional unit is a logical collection of processing elements that collectively
execute a specific set of Alpha instructions. The 21464 has eight integer functional
units organized as four clusters of two units each. The cluster grouping is significant
because single-cycle results from previously executed instructions can be utilized
immediately only by elements within the same cluster, but require a cycle propagation
delay before they are available for use as operands to instructions executing in other
clusters. The eight units are not identical but contain a predefined mix of processing
elements. To ease implementation, the four clusters will be implemented as identical
copies. Functions not needed in a cluster will be left unconnected. Table 6-3 identifies
the major sections within an integer cluster and which functional units contain copies of
those elements.
Table 6-3 Integer Cluster Sections
Section Name Mnemonic

In Units

Description

Adder

Ep_ADx

0-7

A full 64-bit signed integer adder that produces a complete result each
cycle.

Shifter

Ep_SHx

0-3

A full 64-bit shifter that produces a complete result each cycle.

Logic Box

Ep_LGx

0-7

Performs logical and arithmetic operations

Ep_RFx

0-7

Interfaces the operands from the register file to the Ebox opbusses. Also
bypasses literals onto the opbusses.

Virtual
Address
Generator

Ep_VAx

4-7

Computes the 16-bit displacement add and factors the big/little endian
control to form a correct virtual memory address.

Load Data
Interface

Ep_LDx

4-7.

Interfaces the data returned from the Mbox to the functional units and
register caches.

6-4

Compaq Confidential
Integer Execution Unit - the Ebox
5 Jc1nwiry 2001 -·Subject To Change

Integer Clusters

Table 6-3 Integer Cluster Sections
Section Name Mnemonic

In Units

Description

Multimedia
Operand
Interface

Ep_MOx

4-7

Forwards the instruction operands from the corresponding integer functional unit to the multimedia clusters. Each multimedia cluster is associated with the lower integer functional unit in a cluster and derives its
operands from that functional unit.

Ep_RPx

0-3

Handles staging of different result latencies, floating load format conversion and forwarding of results to the register file.

Cross Cluster
Result
Interface

Ep_XCx

0-7

Receives one-cycle results from the other functional units, bypass the
data onto the operand busses if needed immediately and latches the data
for writing into the local register cache.

Global Control

Ep_GCx

0-7

Decodes the instruction information sent by the Qbox and coordinates
the various processing elements within a functional unit.

Store Data
Interface

Ep_STI

Interfaces to the Store Data buses to the Mbox. This unit is not actually
part of the integer clusters but resides in a separate partition to the right
of the integer clusters

Figure 6-3 shows how the sections are organized within a cluster. Elements that consume operands are generally positioned together in either the upper or lower part of the
cluster. This also maps to physical position with the exception of the register file and
multimedia interfaces. The operand busses span the full length of the cluster to allow
elements in one half of the cluster to bypass results directly to any other unit in the cluster.

Compaq Confidential
5 January 2001 -- Subject To Change

Integer Execution Unit - the Ebox

6-5

Integer Clusters

Figure 6-3 Cluster Section Organization
~

i
!

ii
a.:

LMU!irledia

'l:Mlm

~»«n

Shifter

1£. Local By~ss

Regiser Rle
lnerface

M.llimedia
lnerface

IL MM Resut ~pass

X-cluster
L~h.Jster

lntP.r&:ir.P.

1----4

Adder

IL Local Bypass

is:

l-4f

Logic Box

IL Local Bypass

Regis1er
Cache

Operand
Seering

48-entries

~
~
I'

i,,_

Local Bypass

Mdress
Generator

~ ~

.~ ~

~+-=r
Local Bypass

Adder

.c~

~&IA-r

Logic Box

L~luster
~~

+-=T il
&IA-nii

cµ1

X-Cluster

1£. XClister Bypass

Jnterfaca

Load Bus
Interface

I/..

l\lbox Mdress
Interface

...lk
l..ca:I

Load Data

B,ipm

6.2.1 Adder
The adder unit completes a full 64-bit signed add/subtract operation in a single cycle.
The inputs to the adder are adjusted swapped, complemented sign-extended, or shifted
as necessary to allow the core adder to handle the various combinations of add and subtract operations. The instructions serviced by the adder are:
Table 6-4 Instructions Serviced by the Ebox Addr Unit
Type

Instructions

Add

ADDL, ADDLN, ADDQ, ADDQN, S4ADDL, S8ADDL, S4ADDQ, S8ADDQ

Compaq Confidential
6-6

Integer Execution Unit -

the Ebox

5 J(1nuary 2001 ···Subject To Change

Integer Clusters

Table 6-4 Instructions Serviced by the Ebox Addr Unit
Type

Instructions

Sub

SUBL, SUBLN, SUBQ, SUBQN, S4SUBL, S8SUBL, S4SUBQ, S8SUBQ

Compare

CMPBGE, CMPULT, CMPEQ, CMPULE, CMPLT, CMPLE

Other

LDAH, LDA, RS, RC

Subtractions are handled by twos-complementing the Rb operand and setting the carryin bit to the adder.
The S4 and S8 variants simply require the Ra operand to be shifted left by two or three
bits.
For LDA and LDAH the register file interface has placed the displacement value onto
the A operand bus. For RS and RC instructions, the register file interface placed the
intr_flag passed by the Qbox in INST_INFO<O> onto the B operand bus. For all these
instructions, the adder unit simply performs the equivalent of an ADDQ instruction.
The overflow trap signal is computed during EOA and passed to the EQ partition where
it is latched and driven to the Qbox from an ElA latch.
The Adder section also allows direct bypassing of its result onto any or all of the four
source-operand busses in the cluster. This allows dependent instructions to execute the
following cycle. The operand steering unit detects the local bypass cases and drives the
enable mask to the adder. If the Adder is active, it bypasses the result to all enabled busses.

6.2.2 Shifter
The shifter unit handles the arithmetic and logical shift instructions as well as the byte
insert, extract, mask and zap instructions. All operations complete in a single cycle. The
instructions serviced by the shifter are:
Table 6-5 Instructions Serviced by the Ebox Shifter Unit
Type

Instructions

Shift

SRL, SLL, SRA

Mask

MSKBL, MSKWL, MSKLL, MSKQL, MSKWH, MSKLH, MSKQH

Extract

EXTBL, EXTWL, EXTLL, EXTQL, EXTWH, EXTLH, EXTQH

Insert

INSBL, INSWL, INSLL, INSQL, INSWH, INSLH, INSQH

Zap

ZAP,ZAPNOT

The shifter receives the opcode and other instruction information from an EYA latch in
the GCx section. The shifter decodes the opcode/function and if one of the above
instructions is detected, controls and clocks are sent to the datapath to enable execution.
When no match is detected, suppression of the clocks prevents any further action by the
shifter.

5 January 2001 -- Subject To Change

Compaq Confidential
Integer Execution Unit - the Ebox

6-7

Integer Clusters

When active, the shifter latches the operands at EOA and signals the opbus precharge
logic. Results are computed in EO and can be directly bypassed onto any or all of the
four local operand busses in the cluster for use the next cycle. The results are also
latched at ElA and driven onto a shared result bus to both the register cache and crosscluster interfaces.
For big-endian threads, the shifter reverses the byte mask when computing MSKxxx,
EXTxxx and INXxxx instructions.
The Shifter also forwards the operands to the multimedia cluster whenever an instruction handled by either the multimedia cluster or the store interface unit is issued.
Multimedia instructions require both Ra and Rb, store instructions only use Ra. The
Shifter must not latch Rb for store instructions because the virtual address unit will be
using Rb to compute the target VA.
Due to size and wiring constraints, only four instances of the shifter are implemented in
the 21464. One in the upper pipe of each integer cluster.

6.2.3 Logic Box
The logic box handles the logical and conditional instructions producing all results including the conditional
branch mispredict flag in a single phase. The instructions serviced by the logic box are:

Table 6-6 Instructions Serviced by the Ebox Logic Box Unit
Type

Instructions

Cmove

CMOVLBS, CMOVLBC, CMOVNE, CMOVLT, CMOVGE, CMOVLE,
CMOVGf

Branch

BLBC, BEQ, BLT, BLE, BLBS, BNE, BGE, BGf

Logical

AND, BIC, BIS, ORNOT, XOR, EQV

Special

AMASK, IMPVER, SEXTB, SEXTW

For conditional branch instructions, the result is compared to the result predicted by the
Ibox. Mispredicts are execution time traps which are reported directly to the Qbox and
Pbox for corrective action. To minimize the penalty of a mispredicted branch, the Qbox
has identified the oldest CBR issued this cycle and has prepared the Ibox for quick
recovery. The logic box separately signals the Ibox if the CBR instruction mispredicted
and was also the oldest executing this cycle.

6.2.4 Register File Operand Interface
The register file operand interface places operands supplied by the register file onto the
Ebox operand busses. The register file supplies operands whenever the parent instructions results are not in bypasses or the register caches (ie. Parent issued more than eight
cycles before the child).
In the case of integer operate instructions that use a literal field as the Rb operand, the
operand will be marked as invalid by the Qbox and the OSU will default to enable the
register file interface as the supplier. The register file interface detects integer operate
instructions that use a literal and drives the literal value onto the bottom byte of the
opbusses. The literal is zero-extended in the register file interface datapath.
6-8

Compaq Confidential
Integer Execution Unit - the Ebox
5 J,1m.uiry 2001 - Subject To Change

Integer Clusters

For LDA and LDAH instructions, a 16-bit displacement is forced onto the Ra operand
by the register file interface. To support these instructions, 16-bits are extracted from
the instruction word and driven across the datapath. Additional multiplexing in the
datapath places the 16-bit displacement on Ra<31:16> for LDAH or Ra<l5:0> for
LDA. The displacement is sign-extended in the register file interface datapath.
For RS and RC instructions, the flag passed by the Qbox is the instruction result. The
flag bit is zero-extended onto the literal field and forced onto Rb. Since Ra must be
invalid (ie. Forced to zero), almost any functional unit could drive the result. The current plan is to execute an RS or RC instruction as an ADDQ.
The other special case instructions are AMASK and IMPLVER. For these instructions a
CPU specific constant is needed. The AMASK constant is driven onto Ra using the 16bits needed by LDA/LDAH. The IMPLVER constant is driven onto Rb as a literal. For
both instructions, the logic box performs the operation (Re = Rb & !Ra) operation.
To correctly handle the propagation of poison even for Fbox instructions and to service
FTOI and Fstore instructions that will receive their Fa operand from the register file, the
register file interface will drive the opbusses for any active (TPU != 0) instruction
whose operands did not hit in the OSU even if it is not handled by the Ebox.

6.2.5 Virtual Address Generator
The virtual address generator is a specialized adder that computes the virtual address
for instructions that reference memory. Virtual address generation involves adding a
signed displacement to Rb and adjusting the low bits to account for endian and alignment constraints. Although the main adder could have been extended to handled this
function as was done in previous Alpha chips, feasibility studies showed that a combined adder with the additional input multiplexing and output control was too slow for
the 21464.
Because the Mbox acts as the conduit for addresses passed to the Ibox. This unit also
decodes JMP instructions and passes the target PC to the Ibox via the Mbox.
There are three basic equations:
va =Rb+ SEXT(disp<l5,0>) LDx, STx
va =Rb+ SEXT(disp<l0,0>) HW_LD, HW _ST
JMP, ECB, FETCHx, WH64, HW_MTPR, QUIESCE
va=Rb

The instructions serviced by the VAx section are:
Table 6-7 Instructions Serviced by the Ebox Virtual Address Generator Unit
Type

Instructions

Load

LDL, LDQ, LDQ_U, LDL_L, LDQ_L, LDBU, LDWU, LDQ LDS, LDT, LDF

Store

STL, STQ, STQ_U, STL_C, STQ_C, STB, STW, STQ STS, STT, STF

Jump

JMP, JSR, RET, JSR_COROUTINE

Special

TRAPB, EXCB, MB, WMB, ECB, FETCH, FETCH_M, WH64, HW_LD, HW_ST,
HW_MTPR, LDx_ARM, QUIESCE

5 January 2001 -~ Subject To Change

Compaq Confidential
Integer Execution Unit - the Ebox

6-9

Integer Clusters

The virtual address generator is implemented identically in each integer cluster but not
all of the above instructions can be issued to all clusters. The Mbox has a limit of three
load instructions, two store instructions or a combined maximum of four per cycle. The
Ibox can only accept one jump per cycle. Slotting restrictions in the Qbox will guarantee that the instructions issue to the correct pipelines.
Alignment and overflow/underflow errors are detected by the generator and reported to
the Mbox. These errors are retire-time traps that the Mbox prioritizes with other traps
before reporting to the Qbox.
The address generator receives the opcode and other instruction information from an
EYA latch in the GCx section. The opcode/function is decoded and if one of the above
instructions is detected, controls and clocks are sent to the datapath to enable execution.
When no match is detected, suppression of the clocks prevents any further action.
When active, the Rb operand is latched at EOA and the opbus precharge logic is signaled. For non-store instructions, the address generator also activates the opbus precharge drivers for Ra. For store instructions, the SHx section captures the store data and
handles Ra precharging and only Rb is latched and precharged by this section.
The address generator does not produce a result value that must be stored in the register
caches or register file. The virtual address is sent directly to the Mbox from an early
EOB latch and exceptions flags and poison status flags follow from and ElA latch. For
load operations, the LDx section will eventually receive the load data and handle forwarding it to the register cache and register file.
The address and exception information is not passed directly to the Mbox from the VAl
section, but goes through the VA2 interface block at the bottom of the Ebox where the
single-ended ADDR and the differential INDX busses are formed. For clusters EC and
ED, the addresses are alternately driven onto the weak-load (P2) and store-only (P3)
busses to the Mbox. To keep the ability to replicate the cluster, the VA units in the EA
and EB clusters will also contain the logic to drive one of two busses to the Mbox, but
connections will only be made to a single bus.

6.2.6 Load Data Interface
The instructions serviced by the load data interface unit are:
Table 6-8 Instructions Serviced by the Ebox Load Data Interface Unit
Type

Instructions

Load

LDL, LDQ, LDQ_U, LDL_L, LDQ_L, LDBU, LDWU, LDQ LDS, LDT, LDF

Special

HW_LD, STx_C

6.2.7 Multimedia Interface
The multimedia operand interface forwards instruction decode and payload information
to both the MM cluster and store interface. Since the MM unit only handles instructions
from opcodes 13 (MULx), 14 (12F), and lC (multimedia), and the operand interface
needed to perform the opcode decode to correctly latch the operands; pre-decodes are
forwarded to the MM unit instead of the opcode and TPU values. The function code
(inst_info<6:0>) is forwarded so the MM unit can complete the specific instruction
decoding.
6-10

Compaq Confidential
Integer Execution Unit-the Ebox
5 Janu,1ry 2001 -·Subject To Cfumge

Integer Clusters

To avoid the need to pass the opcode, TPU or inst_info fields over to the store interface
block, the multimedia operand interface unit generates the specific control signals
needed by the store interface to control latching, muxing an format conversion.
For floating store or Ftol instructions that source their operands from the register file,
the Ebox provides the conduit to the store interface unit through the multimedia operand interface unit.
The instructions serviced by the multimedia unit or store interface unit are:
Table 6-9 Instructions Serviced by the Ebox Multimedia Interface Unit
Type

Instructions

Multiply

MULL, MULL/V, MULQ, MULQ/V, UMULH

Multimedia

Opcode 1C. *, except SEXTB, SEXTW

Store

STL, STQ, STQ_U, STL_C, STQ_C, STB, STW, STG, STS, STT,STF

Special

ITO FF, ITOFS, ITOFT, HW_ST

6.2.8 Global Control
The global control section decodes the instruction issued to the pipeline and detects
valid single-cycle instructions as well as all illegal instructions. The OP_l CYCLE status flag is used by the cross-cluster interfaces and register caches to control distribution
and updating of results. The illegal instruction decode is combined with overflow information produced by the adder and multiplier to produce the exception status vector sent
to the Qbox.

6.2.9 Store Data Interface
The instructions serviced by the Store Data Interface unit are:
Table 6-1 O Instructions Serviced by the Ebox Store Data Interface Unit
Type

Instructions

Store

STL, STQ, STQ_U, STL_C, STQ_C, STB, STW, STG, STS, STT, S1F

Special

ITOFS, ITOFF, ITOFT, FTOIS, FTOIT

Figure 6-4 shows the ITOFx and FTOix instruction store data paths.

Compaq Confidential
5 J~muary 2001 ···Subject To Change

Integer Execution Unit - the Ebox

6-11

Operand Steering

Figure 6-4 Ebox ITOFx and FTOlx Floating-Point Store Data Paths

e_mo:it->t:fala_fmt

e_mo:it->opbus_a

i2f_Da1a_E2A

Com,ert

e->sLda1a_e2a

Fsbre
Conwrt

fSbreDala_FOA

12F, F21 and Floating Store data paths

6.3 Operand Steering
The operand steering unit tracks the physical register numbers of all instructions that
have issued in the past eight cycles and performs compares against the physical register
numbers of the four source operands issued to each cluster each cycle. The destination
physical register numbers are staged to match the result staging and the write pointer
into the OSU CAM is identical to the write pointer used to write the register cache. This
structure generates match lines that directly equate bypass enable signals to the interface units and register cache.

6.4 Register Caches
The register caches locally store copies of recently generated results allowing instructions which depend on these results to execute sooner than if the results needed to be
written back to the main register file and subsequently re-read. Without the register
caches, the parent to child issue delay on the 21464 would have been at least three
cycles longer.
The register caches also equalize the issue to result latency and therefore eliminate the
contention for register file write ports the varying E-box and F-box instruction latencies
would have created.
Logically, the register cache can be thought of as a shift register. Results are entered
based on their execution latency, shift out to the register file in E4, and finally out of the
register cache in E7 after which the register file will source the value.

6-12

Compaq Confidential
Integer Execution Unit-the Ebox
5 Jc1nuary 2001 ···Subject To Change

Figure 6-5 Ebox Register Cache Block Diagram
Functional

RF resut

Un~

The logical representation above only shows how a single functional unit can access its
own results, in reality the result multiplexing is much more complex and allows a functional unit to use any result produced by any other functional unit. Drawing a picture to
represent that level of multiplexing is an exercise left for the reader.
Although easy to conceptualize, physically building a register cache out of latches as
diagrammed above would waste both area and power. The Ebox register cache is built
with multi-port static ram cells. Instead of moving the data through a fixed fifo, the ram
version keeps the data in place and moves the read and write pointers.

Figure 6-6 Ebox Register Cache Multiport Static RAM Block Diagram
Functional

Un~

To provide the necessary locality to meet timing goals, each integer cluster contains a
private copy of the register cache. The copies are identical, each containing the full set
of available results from every instruction the Ebox executed in the past seven cycles.
The term available is actually a key point. Single cycle instruction results are not stable
until the late in the cycle. The results can be locally driven onto the opbus wires, but
there is insufficient time to write the register cache or send the results to any of the
other clusters. The El cycle is used to transport results to the other clusters, but the

5 January 2001 ···Subject To Change

Compaq Confidential
Integer Execution Unit - the Ebox

6-13

transport delay also consumes much of a cycle leaving only enough time to bypass onto
the remote opbusses. The register caches are actually written the second cycle after the
results are produced. Table 6-11 and Figure 6-7 show the single-cycle result flow.

Table 6-11 Ebox Register Cache Single-Cycle Result Flow

EO
Eb ox

Execute

Local
Bypass

Transmit
Xcluster

Xcluster
Bypass

Write
Re ache

Read
Rcache

Figure 6-7 Ebox Register Cache Single-Cycle Result Flow

Local

:=!=;::::+=!====~~

Bwass

~ ICD<D-a><D-r--:---1
r~-lD----.-1-~.'--CIXD<l>-(J[>--~'·'----O>~~-r-+---L---OXl~~

~luster

Bwass

~luser

Bwass

~~cn--~r1~~~-i-r1~~D<D---L-~l~E><fXI><lC)----L-,

6-14

Compaq Confidential
Integer Execution Unit -the Ebox
5 Jam.1c1ry 2001 -·Subject To Cl'Jange

Multi-cycle instruction results are produced outside the integer clusters and broadcast
to all clusters simultaneously. All multi-cycle instructions have either a three cycle
latency (Loads) or a five cycle latency (multimedia, Ftol, Jumps, IPR reads). Each cluster independently bypasses these results if needed immediately then writes the register
cache the following cycle.
Table 6-12 Ebox Register Cache Multi-Cycle Result Flow
E2

Eb ox
3cycle

... Finish
Execute

Bypass
Load

Write
Re ache

Read
Re ache
... Finish
m-media

Eb ox
5cycle

Drive
result

Bypass m- Write
media
Re ache

Read
Rcache

Figure 6-8 Ebox Register Cache Multi-Cycle Result Flow
Reg~>!t:r Flit:
C~'!B;ands

Multi-media
B}Pass

Load
B;pass

Combining the single and multi-cycle cases shows each register cache receiving up to
15 results per cycle. Eight single cycle latency results from each functional unit, three
three-cycle latency results from the memory load interfaces coupled to functional units
4, 5, 6 and 7, and four five cycle latency results from the multimedia clusters also associated with functional units 4, 5, 6 and 7.

5 January 2001 ···Subject To Change

Compaq Confidential
Integer Execution Unit-the Ebox 6-15

Each register cache will source up to four operands, two to each of the functional units
in the cluster. Because a result can be used as either or both inputs to any number of
future instructions, every register cache entry can drive all four of the operand busses
within the cluster.

7 entries, 4 read ports, 1 write port
l--+--+---+~+-~+---+---+~+---+---+~+-----1

7 entries, 4 read ports, 1 write port
l--+---+---+~+--~+---+---+~+---+---~~---

2
3
4

7 entries, 4 read ports, 1 write port
,__------~--~------~----~~~--- 7 entries, 4 read ports, 1 write port
l--+---+---+~+--~+---+---+~t--~~~~---

7 entries, 4 read ports, 3 write ports
l--+---+---+~+--~t---+---+~+--~~~~-+-1

7 entries, 4 read ports, 3 write ports
1--+--+---+~+-~+---+--+~~~~~+---+-I

6
7

7 entries, 4 read ports, 3 write ports
1--+--+----+~+-~+---+----+~~~--~~-+-I

7 entries, 4 read ports, 3 write ports

iit.2 i~I i~I i~I

readport=
wrileport=

0
e-

6.4.1 Writing the Rcache
The fixed interval between instruction issue and register file update eliminates the need
for any form of busy or free status to be associated with register cache entries. Entries
are assigned in a round-robin fashion and are guaranteed to be free for reallocation on
the next pass. The assignment sequence is a simple modulo "cache depth" counter
implemented as a 1-bit, one-hot shift register.
The single bit allocation pointer is directly combined with the instruction latency information to produce the enable signals for each of the write ports.
en_wrO_El [n]

= lcycle_OP_El && ptr [n] ;

en_wrl_E2 [n]

= 3cycle_OP_E2 && ptr [ (n+l} %RCACHE_ENTRIES];

en_wr2_E4 [n]

= 5cycle_OP_E4 && ptr [ (n+3} %RCACHE_ENTRIES] ;

Remember, single-cycle results are not written until E2, where three-cycle results are
written in E3 and five-cycle results are written in E5, and because the allocation pointer
is just a shift register, skewing the indices is equivalent to a time delay.

6-16

Compaq Confidential
Integer Execution Unit -the Ebox
5 Januc1ry 2001 ·-Subject To Change

Figure 6-9 Writing Entries in the Ebox Register Cache
* () indicates
reset stae

en_wrO_E 1[5:0]

en wr3 E25:0

en wr5 E45:0

6.4.2 Reading the Rcache
Each register cache entry has four read ports, one to each opbus in the cluster. Read
control information for each opbus is driven from the operand steering Unit. OSU CAM
matches to upper pipe results can be used directly as read enable signals to the register
cache. Matches to lower pipe results are more complex because of the ambiguity
between a load result and a single-cycle operation result that occurred two cycles earlier. To resolve the ambiguity, the OSU also drives a set of bypass active signals for
each of the four lower pipes. If a load or multi-media bypass is active, the register cache
should not be read.
The cycle timing of the operand control information is shown below. The OSU performs the CAM operation in cycle EYB, the match lines are distributed as bypass
enables in EZA and the register cache is read in EZB.
Table 6-13 Ebox Cycle Timing of Operand Control Information
Q4RO

A
Qbox
Reg. File

Q5R1

Xmit Src Pointers

R3EZ

Drive
enables

Bypass
opbus

execute

Xmit Dest Pointers
Decode

Read

Transmit

Mux

osu
Ebox

5 J<muary 2001 - Subject To Change

CAM

Compaq Confidential
Integer Execution Unit - the Ebox 6-17

Multimedia Unit

6.5 Multimedia Unit
The Multimedia Unit consists of three major sections shown below. The Control Logic
and occupies the left side of the unit. The computational logic is divided into two sections. The first section handles integer multiply instructions. The other section handles
the MVI instructions.
Figure 6-10 Ebox Multimedia Unit Block Diagram
....._WMUX El2r11<63:00>
~

....
0...Q.Codel.rtl.<05:00>

.....
...

...

Fune Codelnl<17:06>..a...

MVI Ir structions

~
Instruction
Decode I
Control

+
~

......
"II

Integer Multiplier

RMUX El2nJ<63:00>

...

RMUX_El2n+ 11<63:00>

6.5.1 Inputs and Outputs
The opcodes arrive from below in a wiring channel from the integer execution units.
The operands arrive from the bottom and are shared with the Ebox lower units. The
result bus exits from the bottom of the box and goes to the register caches.

6.5.2 Signal Nomenclature
All signals belong to the E box and the Media partition. Signal names start with the EM
prefix. The three section prefixes are CTL for the control section, MUL for the Multiplier section, and MVI for the MVI section. Thus, the three valid prefixes for multimedia unit signals are EM_CTL, EM_MUL, and EM_MVI.

6.5.3 Timing
Figure 6-11 Ebox Multimedia Unit Pipeline Timing
R2

Transport &
decode

6-18

A
Data
xport

Execute pipeline

Drive
result

cluster

X-

Write
Read
rcache rcache

Compaq Confidential
Integer Execution Unit -the Ebox
5 Jc1m.1c1ry 2001 ···Subject To Change

Multimedia Unit

The Op Code and Function Code are clocked in the E box on R3A. The operands and
final control signals are latched on EOB. Execution begins on EOB. The longest operation completes by E4A. All instructions are delayed until E4A before being driven on
the Result Bus.

6.5.4 Instruction Decode/Control Section
The Instruction Decoder looks at the OpCode and Function Code to determine the operation to be performed. From this information, it extracts the fallowing fields:
•

Arithmetic/Logic Function

•
•

Byte/Word/Longword
Signed/Unsigned

The OpCode and Function Codes are decoded into instruction names and latched on EO.
Each signal is named Ep_CTL%"inst.name"_EOA_H. There are 8 opcode decodes for
the Multiply section. There are 24 opcode decodes for the MVI section. The signal
(Ep_CTL%quiece_EOA_H) is asserted if no instruction is recognized. A 2 bit code represents the Byte/Word/Longword state. For the normal IMUL opcodes (MULL, MULL/
V, MULQ, MULQN, and UMULH, the data types are implicit in the opcode and are
not included in the Byte/Word/Longword decoded state. The code is defined in the table
below:
Value

State

00
01
lx

Byte
Word
Longword

Signed/Unsigned is represented by a 2 bit code Ep_CTL%SGN_EOA_H<l :0>. This is
defined in the table below:
Value

State

11
00
10

Signed
Unsigned
Signed * Unsigned (TMUL)

6.5.5 MVI Section
The MVI section accepts instructions on EOA and operands on EOB. It produces results
on E4A. The block diagram of the MVI section is shown below:

Compaq Confidential
5 January 2001 -- Subject To Change

Integer Execution Unit - the Ebox

6-19

Multimedia Unit

Figure 6-12 Ebox Multimedia Unit MVI Section Block Diagram
Result

Shifter

Pack

Min /Max L
ALU

6.5.6 ALU
The ALU serves a number of instructions. It computes the magnitude of (Ra-Rb) for the
TABSERR and the TSQERR instructions. It also performs the additions and subtractions for the TADD, TSUB, PADD, and PSUB instructions. In addition, it performs the
first level of compares for the MINMAX instruction, the MIN instruction, and the
MAX instruction. Finally, it performs the compares for the CMPWGE instruction. The
block diagram is shown below:
Figure 6-13 Ebox Multimedia Unit Arithmetic Logic Unit
Diff

Aop

MulDiff

Bop

Mux
cmp[1 :0]<15:08>

cmp[1 :0]<07:00>

Ra
Rb

6-20

Compaq Confidentia I
Integer Execution Unit-the Ebox
5 Janw~ry 2001 ···Subject To Change

Multimedia Unit

The Pre-MIN/MAX Mux shuffles the bytes appropriate bytes to the two adders for each
instruction. It generates 4 busses; the a and b inputs to Add/Sub X and the a and b
inputs to Add/Sub Y. The X adders in Ra and Rb to present the performs a+b or a-b on
bytes, words, or longwords .. The Y adders perform b-a on bytes, words, or longwords.
CMPLT and OVFLO for each byte from both adders is brought out for control.
6.5.6.1 TADD, TSUB PADD, PSUB, CMPWGE, MIN, MAX Instructions

The Add/Sub X block gets all the Ra inputs on it's a inputs and all the Rb inputs on its b
inputs in normal byte order. The MUX passes the sum. Signed/Unsigned does not matter to ALU block. For byte operations, each byte grows to 9 bits, which is passed
through the Mux to the Tree Adder (TADD) or the saturation logic (PADD). For Word
operations, each word grows to 17 bits, which is passed on to the Tree Adder or Saturation Logic. The remaining instructions send the sign bits to the control logic.
6.5.6.2 TABSERR Instruction

Both Add/Sub blocks get the same data as the TADD, TSUB instructions. However, the
X adder performs a-b and the Y adder performs b-a. The sign bits of the X adder are
used to control the MUX. The MUX selects the positive result for each byte or word.
This is passed on to the Tree adder.
6.5.6.3 TSQERR Instruction

A separate 8 or 16 bit subtract computes the difference between A and B and passes the
result to the multipliers. This is done in one phase so the multipliers can start one phase
sooner than they could if the other ALU structure were used.
6.5.6.4 Min/Max Instruction

The min/max instruction uses 12 of the 16 adders to perform the first level of comparison for finding the min and max. This divides a register into 2 groups of 4 bytes and
compares the bytes in each group as shown below:
Figure 6-14 Ebox Multimedia Unit Computation of the Min/Max Instruction
Byte 7

Byte 6

The byte reshuffling is defined in Table 6-14.

Compaq Confidential
5 January 2001 ··· Subject To Change

Integer Execution Unit - the Ebox

6-21

Multimedia Unit

Table 6-14 Ebox Multimedia Unit Min/Max Instruction Byte Reshuffling
Bytes

Byte4

Byte3

Byte2

Byte 1

ByteO

RaO

Ra 1

Rao

Ra2

Ral

Rao

Ra 1

Ra2

Ra3

Bus

Byte7

Byte6

Ra5

Ra7

Ra6

Ra4

Ra6

Ra5

Ra4

Ra5

Ra4

The CMPLT and OVFLO bits for each byte are sent to control logic which generates
signals to control the remainder of the MINMAX logic further down the pipeline.

6.5. 7 Multiplier Array
The multiplier is used for PMUL, TMUL, TSQERR instructions. It takes inputs from
the ALU section and is configured as 8 8X8 multipliers or 4 l 6Xl 6 multipliers. It can
handle signed* signed, unsigned* unsigned, or signed* unsigned input operands. It
selects inputs either from the A and B bus (PMUL, TMUL) or the multdiff output from
the ALU (TSQERR). It passes either 4 32 bit results or 8 16 bit results on to the tree
adder.
It is configured as 2 bit Booth coded stages followed by an array of carry save adders.
The l 6Xl 6 /dual 8X8 multiplier structure is shown in the figure below. Four copies are
required for the full multiplier box.

Compaq Confidential
6-22

Integer Execution Unit -

the Ebox

5 Jc1nuary 2001 ·-Subject To Change

Multimedia Unit

Figure 6-15 Ebox Multimedia Unit Multiplier Array Block Diagram
Di ff

Aop

MulDiff

Bop

B Operand Mux

Booth Encode

Partial P oduct Mux

The two data paths show 8X16 multipliers. For word operations, the TADD Mux shifts
the right data path 8 bits to the right before adding to the left data path. This is required
to form a 16Xl6 multiply from two 8Xl6 multiplies.
For Byte operations, the Partial Product muxes sign extend each byte into bits< 15 :08>.
The TADD Mux does not shift the right data path. The two CSA* blocks are used as the
first stage of the Tree Adder for byte operations. (The tree adder combines 8->4, 4->2,
2-> 1 for bytes. For words, it only combines 4->2, 2-> 1 The Tree Add Muxes also bring
in the data for byte Tree Operations that do not use the multiplier (i.e. Tree Add, Tree
Sub, Tree ABS Val) so the CSA* blocks can be used as the first tree adder stage.

5 January 2001 --· Subject To Change

Compaq Confidential
Integer Execution Unit - the Ebox 6-23

Multimedia Unit

Figure 6-16 Ebox Multimedia Unit Multiplier Array Tree Adder
Multiplier 2

Multiplier 3

Multiplier 1

Multiplier 0

CSA

Tree I NonTree Mux

Full Adder

The results from the 4 multiplier blocks are combined in the tree adder as shown below:
The Tree/NonTree mux selects the output from the tree adder for all operations except
PMULH and PMULL. For those operations, the high or low 16 bits of product from
each multiplier are selected. The full adder combines the sums and carries from the
carry save adder array to form the final result. It must be multiplexed with the other
sources of final results before being sent to the Register Cache.

6.5.8 Count Logic
The Count Logic is used to support the CTPOP, CTLZ, and CTIZ instructions. CTPOP
counts the number of bits that are "1" in Rb. CTLZ counts the number of leading zeros
in Rb. CTTZ counts the number of trailing zeros in Rb. These were implemented in the
21264 by building logic to look at each bit pair and indicate whether 0 to 8 items to be
counted are present. This information is then fed to the tree adder, which produces the
final tally. The implementation in this section is done the same way.

6-24

Compaq Confidential
Integer Execution Unit-the Ebox
5 Jam.1c1ry 2001 ···Subject To Change

Multimedia Unit

6.5.9 Compare Word, Saturation, and the 21264 Min Max
From this point on, all logic blocks take inputs from the ALU and produce results that
will eventually be multiplexed with the Tree Adder block. The results from the rest of
the logic blocks must be delayed to line up with the Tree Adder output. The first step in
this delay is to latch the inputs to this block and drive the latched data to the remaining
logic blocks.
When the Compare Word instruction is executed, the ALU does 4 unsigned word subtracts and sends the sign and carry bits to the control logic. The Compare Word logic
gets the control bits and forms 8 bits to be output in bits<7:0>.
Saturation is required for the PADD and PSUB operations. The ALU performs the
appropriate add or sub for bytes/words, signed/unsigned and sends the compare bits to
control logic. It sends the arithmetic result down the data bus allowing it to overflow.
The Saturation logic either passes the result or forces the appropriate saturation result
based on the control bits.
The 21264 Min and Max instructions select the minimum or maximum between A and
Bon a byte by byte or word by word basis. The appropriate add or subtract is performed in the ALU, the compare bits are sent to the control logic. The 21264 MIN
MAX logic receives the A and B bus with control bits from the control logic. It multiplexes between the A and B inputs to select each Min or Max byte or word.
These three functions are implemented with a multiplexer controlled by bits derived
from the ALU Sign and Carry bits.

6.5.10 MinMax Logic
The MinMax logic performs the second stage of the new MINMAX instruction. The
first stage was performed by the ALU, which generated 12 compare results for bytes, 6
compare results for words, and one compare result for longwords. The first step for the
second stage is to take the compares generated by the first stage and assemble the minimums and maximums from the A bus inputs.
For bytes, the comparisons generated two sets of minimums and maximums; one for the
first low 4 bytes and one for the high 4 bytes. For words and longwords, one minimum
and maximum are selected.
Next, these must be compared with the previously found minimums and maximums in
the B register. This is done with partial difference circuits. The actual difference is not
needed, just the results of the comparison. This information is then used to control a
second multiplexer which selects the minimum and maximum from the three (bytes) or
two (words or longwords) candidates. The results are sign extended to longwords and
sent to the result bus. The block diagram is shown below:

5 January 2001 -· Subject To Change

Compaq Confidential
Integer Execution Unit - the Ebox 6-25

Multimedia Unit

Figure 6-17 Ebox Multimedia Unit Min/Max Logic Block Diagram
Aop

Bop

Minimum Selection Mux

Byte Min
mparison

Byte Max
omparison

Final Max Selection Mux

Pack and Sign Extend

---"""""'E3A

Result

6.5.11 Pack, Unpack, Permute Byte
The Pack, Unpack, and Permute Byte logic generate control signals for the Shifter
logic. Pack must detect when word->byte or longword->word overflow. The Shifter
logic will be commanded to saturate. Unpack must sign extend the result for signed data
types. The Permute Byte instruction decodes the B register inputs and commands the
Shifter Logic to reorder bytes from the A input, force zeros, force 1 's, sign extend from
the result byte to the right, or select bytes from the high longword of the B register.

6.5.12 Shifter
The shifter has a horizontal bus that permits each byte to select any other byte from the
A register, any of the 4 high bytes of the B register, or various littorals the support saturation, sign extension, force to ones, or force to zeros. The total bus structure and one
byte slice are shown below:

6-26

Compaq Confidential
Integer Execution Unit - the Ebox
5 Jc1nuary 2001 ··· Subject To Cf1ange

Multimedia Unit

Figure 6-18 Ebox Multimedia Unit Shifter
A<07:00>
A<15:08>
A<23:16>
A<31:24>
A<39:32>
A<47:40>
A<55:48>
A<63:56>
8<39:32>
8<47:40>
8<55:48>
8<63:56>
Sign Extend
Saturate
Force Ones
Force Zeros

16 Way Byte Mux

._
,.-

•

Result<63:56>

[_ ~

16 Way Byte Mux

-....----

J ...............

........................

Result<55 :48>

6.5.13 Delay
The Delay block aligns all the different instructions in time so that one result bus may
be shared. There are 3 different busses carrying results with different timing. The result
bus from the Tree adder is the longest latency bus. It is available to be latched on E4A
and is sent to the result bus with no delay. The next longest delay is the result coming
from the MINMAX logic which is available to be latched on E3A. It is delayed one
clock and sent to the result bus. The third bus is available to be latched on E2A . It is
delayed two clock cycles and sent to the result bus. No bypassing is performed during
these delays.

6.5.14 Integer Multiplier
The integer multiplier handles the MULL, MULL/V, MULQ, MULQ/V, and UMULH
instructions. The integer multiplier is implemented as a 2 bit Booth encoded signed
multiply. It must support unsigned multiplies for the UMULH instruction. The block
diagram is shown below:

5 January 20()1 ··· Subject To Change

Compaq Confidential
Integer Execution Unit - the Ebox 6-27

Multimedia Unit

Figure 6-19 Ebox Multimedia Unit Integer Multiplier
Ra In ut

Rb Input

Multiplier

Multiplicand

Partial roduct
Mux s (8)
2-bit Booth
Recode Logic
Partial Product
Muxes (8)

Sign
Extend
Logic

CSA Array, T read 1

Sign
Extend
Logic

LSB
Logic

CSA

Unsigned CSA

Full Adder

Mux

Result

The Ra inputs are encoded into 32 2 bit Booth partial products. Each Booth encoder
looks at its own 2 bits plus one bit to the right. The three bits are encoded as shown in
the following table.

6-28

Code

Partial Product

000
001

xO
xl

010
011
100
101
110
111

xl
x2
(-l)x2
(-l)xl
(-l)xl
xO

Compaq Confidential
Integer Execution Unit-the Ebox
5 Jc1mJc1ry 2001 - Subject To Change

Debug Features

Each partial product is created with a multiplexor the selects Ox, lx, or 2x the multiplicand and inverted or uninverted outputs. The minus is formed by the inversion and a
carry that is inserted in an open position in the Carry Save Adders.
The Carry Save Adder arrays are divided into two "threads". Multiplier<! :0> , <4:5>,
<9:8> ... partial products are summed in one thread. Multiplier <3:2>, <7:6>, <11 :10> ...
are summed in the other. This significantly reduces the number of levels of gate propagation required for the final product. The two threads are summed together in two more
ranks of CSAs. One more CSA is required to support unsigned multiplies. This also
serves as the place to put the last carry in for the minus case of the highest order partial
product.
The LSB logic begins the process of propagating the carry and producing the low order
bits of the product while the higher order bits are working their way through the CSA
array. The Full Adder combines the sums and carries from the Carry Save Adder array
and from the LSB logic to produce the final product.
The multiplexor selects either the low order bits or the high order bits depending on the
instructions that was decoded.

6.6 Debug Features
Debug features in the 21464 come in several flavors:

•
•

Error detection logic to halt trap to PAL or halt trace collection

•

A trace bus to collect internal state .

CYA bits to disable performance features or select simpler algorithms .

There are currently no CYA bits defined for the Ebox.
The Ebox will be able to signal a trap based on a programmable decoder in the global
control section. A value and bitmask for pipeline, tpu, opcode and function will be compared to each valid instruction issued and a signal will be sent to the global debug handler whenever a match is detected. The decoding will not be limited to Ebox
instructions but will not be able to detect the NOP and MB instructions retired immediately by the Qbox.
For observability, the Ebox is considering allowing collection of the following signals
onto the debug trace bus:

•
•
•

Pipeline active flags which indicate the latency of the instruction issued to the pipe
cbr_mispred and opx_poison status flags
tpu, opcode, function bits for a specified pipe .

The objective is to incorporate all debug logic into the EQ partition and take advantage
of the fact that most interesting control wires flow over the top of the EQ partition.
One of the most interesting problems is how to write the IPR bits necessary to control
this logic. Some hack where the bits are actually stored in the Ibox or Mbox and captured in the Ebox on an IPR read operation might make the most sence.

Compaq Confidential
5 J~muary 2001 -~ Subject To Change

Integer Execution Unit - the Ebox

6-29

Testability Features

6.7 Testability Features
The Ebox is considering a boundary SCAN based methodology for manufacturing fault
detection.
Scan latches would be implemented in the EQ partition where virtually all control
inputs from the Qbox enter the Ebox. Operation of all functional elements in the Ebox
including the register caches can be achieved through this interface. To allow depositing and examination of results, the scan chains will be extended across the top of the
Ebox through the latches that hold the operands and results flowing to and from the
Register File. With this level of control and observability, the only major structures not
covered would be the virtual address generation and load data interface blocks that
interact with the Mbox.
The current belief is that the scan-based features are adequate to test the register caches
and BiSTengines will not be required in the Ebox.

6.8 External Interfaces: lbox, Qbox, Pbox, Mbox, Register File, Fbox
6.8.1 lbox
The Ebox needs to communicate instruction flow information with the Ibox. The
instructions that control program flow are Branches and Jumps.
For conditional branches, the Ibox has already predicted an execution path and needs to
be notified if it chose the wrong path. Since up to eight conditional branch instructions
can be executed by the Ebox at once, the Ibox requires INUMs to identify which
branches mispredicted and which predicted correctly. The Ebox does not have access to
INUMs so it returns a set of branch mispredict flags to the Pbox. The Pbox then associates the flags with INUMs and notifies the Ibox of the oldest mispredicted branch. The
Ebox drives the mispredict flags to the Pbox exception funnel from an ElA latch in the
EQ partition.
The mispredict flags are not conditioned with poison so the Pbox must correctly handle
branches that mispredict due to poisoned data.
To speed-up the branch mispredict path, the Qbox pre-determines the oldest issued conditional branch and guesses it will mispredict. The Ibox prefetches the PC of this branch
and the Ebox sends a single bit to the Ibox indicating if that branch actually mispredicted. If it did, we are several cycles into recovery, if not, the Ibox must wait for the
Pbox to figure out which CBR (if any) was the oldest mispredicted branch.
The target virtual address of a Jump instruction is sourced from register Rb. When executing a jump instruction, the Ebox forwards register Rb to the lbox for comparison
against the predicted target PC. Since the Qbox only schedules one Jump instruction per
cycle and only into Functional Units 4 or 5, the multiplexed weak load address bus is
used to transmit the target address to the Ibox. The Ibox was told the jump would issue
in Q5 and drives the return PC in time to be returned to the Ebox over the IPR_RD data
bus in cycle E3. Return PC's from jump or unconditional branch instructions flow
through the Ebox multimedia result path.

6-30

Compaq Confidential
Integer Execution Unit-the Ebox
5 Janwtry 2001 ···Subject To Cfumge

External Interfaces: lbox, Qbox, Pbox~ Mbox, Register File, Fbox

The Ebox also uses these paths to execute the Ibox HW_MFPR and HW_MTPR Internal Processor Register (IPR) read and write instructions. IPR write data is sent on the
weak-load address bus, IPR read data is returned with the same timing as a jump return
PC through the IPR_RD data bus.
The only other signals the Ebox receives directly from the Ibox are the
KERNEL_MODE and PP_ENABLE status vectors. If a thread attempts to execute a
privileged CALL_PAL instruction when its KERNEL_MODE bit is not set, the Ebox
will report an illegal instruction exception. If a thread attempts to execute any floating
point instruction when the PP_ENABLE bit is not set, the Ebox will report an illegal
instruction exception. These signals lack a timing specifier because the pipeline must be
flushed before the bits can change. Because of the flush, the signal is stable for many
cycles before the next instruction reaches the Ebox.

6.8.2 Qbox
Instructions are passed to the Ebox from the Qbox. With each valid instruction, the
Qbox sends most of the original instruction longword, source and dest operand pointers
and some control information. Exception information is the only information the Ebox
returns to the Qbox.
Instruction information like opcode and function code is sent to the Ebox through the
payload array in the Qbox. The opcode bits are transmitted in tact, but the rest of the
original instruction longword is packed based the instruction format. For CALL_PAL
instructions, the upper 11 bits of the 26-bit function code are ORed together and packed
into the 16-bit info field.
Table 6-15 Instruction Information From the Qbox to the Ebox
Format

Field

Instruction Bits

INST_INFO

Memory

Displacement I Function

15:0

Branch/Jump

None

Operate

Literal & Function

20:5

15:0

Floating

Function

15:5

10:0

RS/RC

Function + Intr_flag

15: 1, intr_flag

15:1, 0

CALL_PAL

PALcode Function

OR(25: 15), 14:0

15, 14:0

MFPR/MTPR

Index & Class

Pal, 24:21, 3:0, 11:5

15, 14: 11, 10:7, 6:0

The data is read out of the Qbox payload in cycle Q4B, transmitted to the Ebox from a
Q5A latch and received by the Ebox in an EYA latch for decoding.
Since the operand data for an instruction can be found on result busses, in the register
cache or in the main register file, the Ebox needs to determine where the source data is
located. The operand steering unit in the Ebox keeps track of instruction results and
generates the control signals necessary to drive the correct source operand select lines.
The Ebox compares the instruction source operand pointers to the destination pointers
from the recently issued instructions. A match is a bypass or register cache read, a miss
is a register file access.

Compaq Confidential
5 January 2001 - Subject To Change

Integer Execution Unit - the Ebox

6-31

External Interfaces: lbox, Qbox, Pbox, Mbox, Register File, Fbox

The thread processor unit is needed by the Ebox to select the correct IPR bits. The
Mbox provides a per-TPU copy of the B_ENDIAN IPR bit to the Ebox for use when
computing byte shifts or when generating memory addresses. The Ibox drives a perTPU copy of the KERNEL_MODE and FP_ENABLE context bits for use decoding
illegal instructions. The thread processor unit is a one-hot structure indicating the thread
associated with this instruction. If no thread ID bits are set, the pipe is defined to be
inactive.
The PAL_MODE bit is used to report illegal (due to insufficient privilege) instruction
errors only, it does not effect Ebox processing of instructions. See the exception handling section for more information on Ebox exception processing.

6.8.3 Pbox
The Pbox handles prioritization and notification of conditional branch mispredict information to the Ibox. After issuing a set of instructions, the Qbox speculates the oldest
issued branch will mispredict. True or False, the Pbox must still scan the mispredict
vector for any other branches that missed. The conditional branch mispredict signals
sent to the Pbox are not conditioned with poison. The Pbox exception logic must ignore
all exceptions resulting from poisoned operands as neither the Ebox or Fbox factor poison into any exception reporting.

6.8.4 Mbox
The Ebox interface to the Mbox is primarily used to resolve instructions that reference
or manipulate the memory system.
The load path is a super tight timing path. The Ebox needs to computes the virtual
address and send it to the Mbox. The Mbox then accesses the Dcache and returns the
data to the Ebox all within three cycles. The goal is for the Ebox to compute the address
early in cycle EOA and begin transmitting it to the Mbox. The assumption is that transmission delay will account for most of EOB and that the Mbox will latch the address
and begin processing in ElA. The Ebox intends to create adders specially tuned for the
16-bit displacement-add.
The 21464 architecture limits the parallelism to no more than three load type instructions, two store type instructions and a maximum of four memory instructions total per
cycle. The load data arrives in the Ebox late in cycle E2B and the Ebox will drive store
data early in E2A expecting the data will be available in the Mbox by the end of cycle
E2.
For STx_C instructions, the lock flag must be returned to the Ebox and stored in the
destination register. The Mbox will bubble STx_C instructions and drive this flag in
cycle E2B relative to the bubble.
IPR writes will be performed through the LD/ST interfaces to the Mbox. For Ibox
HW_MTRP instructions the Ebox will send the data (Rb) as the address on the weakload address port in cycle EOB and the Mbox will forward the data along to the Ibox.
Mbox HW_MTPR instructions will issue to the strong-load pipes and can issue up to
two per cycle. All HW_MFPR reads issue to the weak-load pipes, one per cycle. Both
the Ibox and Mbox return data on the IPR_RD bus in cycle E3A.

Compaq Confidential
6-32

Integer Execution Unit - the Ebox

5 Janwiry 2001 - Subject To Cfumge

External Interfaces: lbox, Qbox, Pbox~ Mbox, Register File, Fbox

The Ebox drives the pipe 0 (PO) signals from partition EA, the Pl signals from partition
EB and the P2 and P3 signals alternately from partitions EC and ED. The P2 and P3
signals are therefore outputs of the EY partition indicating there were multiple driving
partitions in the Ebox.

6.8.5 Register File
The Register File sources operands that are not currently in the register cache. Each
instruction issued can take up to two operands and eight instructions can be issued at
once for a total of 16 operands per cycle.
Because of the way CMOV instructions are split into two instructions, each operand is
actually 66 bits instead of the expected 64 bits; one extra bit is used to store the intermediate result for CMOV instructions. In addition to the CMOV condition bit, poison status is stored in the register file with the data. Poison is only sent to and received from
the Ebox.
Although the Ebox bas more than eight functional units and instructions can complete
in different amounts of time, the register caches equalize the instruction latencies so no
more than eight results will ever be generated to the Register File in any given cycle.

6.8.6 Fbox
The Ebox and Fbox directly exchange data relating to floating store, ltoF, and FtoI
operations. For floating store operations, the Fbox sources the store data whenever the
operand is resident in the Fbox register caches. Since the Ebox owns the final multiplexing of store data to the Mbox, it is responsible for forwarding floating store data
located in the register file through the same datapaths used to send integer store data.
This eliminates the need to route the store pipe operand from the Register File to the
Fbox.
FtoI operations work just like floating stores to the Fbox. Instead of sending the data to
the Mbox, the Ebox pushes the value back through the multi-media result busses into
the Ebox register cache. FtoI format conversion is handled by the same logic that converts floating store data sent to memory.
ItoF data is format converted by the Ebox and sent to the Fbox register caches. The data
is also pushed back through the multi-media result busses, into the Ebox register caches
and eventually to the register file. This is necessary since the Fbox does not have a
result path back to the main register file for these functional units.

6.8.7 Global
The intention is to clock the Ebox primarily off GCLK+2. Minimal thought has been
put into reset requirements and there have minimal discussions about test requirements
for the Ebox.

Compaq Confidential
5 January 2001 --· Subject To Change

Integer Execution Unit -

the Ebox

6-33

IPRs

6.9 IPRs
The Ebox needs access to three IPR fields. The B_ENDIAN bit of the Mbox VA_CTL
IPR and the KERNEL_MODE and FP_ENABLE fields of the Ibox Process Context
IPR.
The B_ENDIAN bit is used by the Ebox in computing the virtual address for load and
store instructions as well controlling the byte extract, insert and mask instructions. The
Mbox will supply a per-TPU vector shadowing the committed state.
The KERNEL_MODE field is used to detect threads with insufficient privilege to execute a CALL_PAL instruction and flag these instructions as illegal. The Ibox will
decode the current process context IPR and drive a per-TPU structure that shadows the
committed state.
The FP_ENABLE field is used to force a trap whenever a floating-point instruction is
executed. Software uses this bit during process context switches to detect the need to
save or restore the floating-point registers.

6.10 Exceptions
The Ebox reports instruction status back to the Qbox for each executing pipeline. Prioritization, reporting and any other exception based actions are left to the Qbox. In general the Ebox does not stall or take any special action in the presence of an exception
event.
There are several types of exceptions reported by the Ebox:
Table 6-16 Exceptions Reported by the Ebox
Exception

Description

EQ%ADD_OVERFLOW_ElA_H<7:0>

Integer add/subtract operation overflowed

Ep%MUL_OVERFLOW_E4A_H

Integer multiply operation overflowed

EQ%ILLEGAL_INST_E1A_H<7:0>

Illegal opcode or function code issued

Ep%Px_BAD_VA_ALIGN_E1A_H

Address Alignment error

Ep%Px_LD_PAR_ERROR_E4A_H

A parity error was detected on a memory load from the Dcache.

EQ%CBR_MISPREDICT_E1A_H<7:0>

Branch prediction was incorrect

EQ%0P{A,B }_POISON_E1A_H<7:0>

The operand to this instruction was poisoned

EQ%DRAINT_INST_E1A_H<5:4>
EQ%MTFPCR_INST_ElA_H<5:4>

An IFETCHB or a non-PAL-mode MT_FPCR instruction was
issued to the pipe.

The Ebox also decodes each instruction and detects the cases where exception status is
either known or guaranteed to be available early and the instruction can be retired early.
If late status can occur, like with MULL/V or many Fbox instructions, the Qbox must
delay retirement. Early retirement frees-up resources in the Qbox allowing more
instructions to enter the queue earlier.

6-34

Compaq Confide11tial
Integer Execution Unit - the Ebox
5 Jc1nw~ry 2001 ·- Subject To Change

Poisoned Data

There are two classes of illegal instructions, reserved opcode/function combinations
and insufficient privilege. The Ebox decodes the following cases as reserved opcode
exceptions:
Table 6-17 Ebox Reserved Opcode Exceptions
Opcode

Function

00 - 3F and not in Kernel Mode 40 - 7F > BF

01 -06

All

Codes not defined by SIMD FP extension
All when FP_ENABLE is not set.

Codes not defined in SRM V7.0
Codes 14.xx8 through 14.xxF when FP_ENABLE is not set.

All when FP_ENABLE is not set.

Code<5:4> = 012 OR Code<8:5> = 11012
All when FP_ENABLE is not set.

All when FP_ENABLE is not set.

Codes not defined in SRM V7.0 or the 21464MVI extensions

19,lB,lD,lE,lF

All when not in PAL mode

20-27

All when FP_ENABLE is not set.

The Fbox exception information although more complex, is also driven in cycle E4
(F4). Multiplexing E and F box exceptions is a task left to the P/Qbox.

6.11 Poisoned Data
Poisoned is the term given to a value that is the product of a load-miss.
With each load operation, the Mbox returns a status bit indicating if the address hit in
the cache or a queue. Data returned for operations that do not 'hit' is garbage and care
must be taken to ensure that this data does not alter program state. The method used is
to tag each data word with a poison bit. Poison is contagious so any future product of
poisoned data is also poisoned. This includes instruction results, CBR mispredict signals, load addresses, store data, target PCs of jump instructions, etc.
Eventually, all instructions issued in the shadow of a load miss will be replayed and the
results of those instructions will be overwritten in the register file. Poisoning store data
and load addresses protects bad data from entering the memory system and factoring
poison into jump addresses or CBR mispredict signals prevents the predictors in the
Ibox from training against false data.
To the Ebox, maintaining poison state simply involves ORing the poison status bits
from the instruction inputs. The process is complicated slightly because the poison status bit is returned from the Mbox later than the load data, but it is still early enough to
catch-up to any inflight instruction.

5 January 2001 ··· Subject To Change

Compaq Confidential
Integer Execution Unit - the Ebox

6-35

Format Conversions

6.12 Format Conversions
Traditionally the only type of data formatting the Ebox handled with was sign or zero
extension of data loaded from memory. In the 21464, the Mbox performs the sign/zero
extensions, but the Fbox does not have a path to the register file for data loaded from
memory, so the Ebox must handle conversion of floating-point load data. The recently
added FTOI and ITOF instructions also define data format conversions. These instructions allow for direct movement of data between Integer and Floating-point registers.
Table 6-18 Ebox/Fbox/Mbox Data Conversion Matrix

6-36

SRC

DST

Description

Mbox

Ebox

Integer Loads

Mbox

Fbox

Floating Load

Eb ox

Mb ox

Integer Store

Fbox

Mbox

Floating Store

Ebox

Fbox

ITOF Instruction

Fbox

Ebox

FTOI Instruction

Compaq Confidential
Integer Execution Unit - the Ebox
5 Jc1nw1ry 2001 -·Subject To Change

7
Register File
Although the Alpha architecture only defines 64 registers, the 21464 is a multithreaded, out-of-order machine that requires many more than just 64 registers to keep
its pipelines full. The four independent threads require 64 registers each and an additional 256 temporary registers are used to rename registers of inflight instructions to
eliminate write-after read and write-after-write conflicts. At 65 bits per entry, 512entries totals to a 4KB register file.
Eight parallel execution units can consume up to 16 source operands and can produce
up to eight results per cycle. Although implementing 32K 'not-so-little' ram cells with
16 read and 8 write ports each is not trivial, defining a register file with fewer than 16
read or 8 write ports would create many other problems. The Qbox would either be
forced to issue instructions based on the number of operands needed from the register
file, or trap whenever the set of issued instructions needed more than the available
number of ports. Brute force was deemed preferable to further complicating instruction
picking or sacrificing performance to traps so the current Register File has a full 16
read and 8 write ports.
Internally, the Register File is structured as two iden ical 512-entry register groups each
with eight read and eight write ports. Each group services half the operands needed.
Coherency is maintained by writing both groups at the same time. To keep the physical
structures more controllable, each group is further partitioned into two 256-entry banks
where the high-order address bit serves as a bank select.

Compaq Confidential
5 January 2001 ··· Subject To Change

7-1

Test Structures

Figure 7-1 Register File Block Diagram
0
64
..-=0.___ _----=6__,,4

r-=0,____ _--=6-=.5

BANK 0

BANK 1

BANK 2

BANK 3

256 entries
x 65 bits

Result
Select

Fbox Results
Fbox Operan s
<I)

<I)

"'O

x !!!

x0 !!!
Q)

0 Q)

c..

UJO

c..

UJ 0

7.1 Test Structures
7.1.1 Timing
Cycle mnemonics are used throughout this document to identify the relative timing of
signals. The following table identifies the cycle relationships assumed by this document.

Qbox Ql
Reg. File
Ebox
Fbox
Mbox

Q4
RO

Q5
Rl

Q6
R2

Q7
R3

Q8
R4
EO
FO

RS
El

Fl
MO

R6
E2
F2
Ml

R7
E3
F3

R8
E4
F4

Each cycle is further subdivided into two phases, the first half of a cycle is the 'A'
phase; the second half of a cycle is the 'B' phase. A timing specification RlA refers to
the first phase of cycle Rl.
All timing references in this spec refer to the latch that launched the data, when significant transit time may be involved, that time or an expected arrival time will be separately stated.
Compaq Confidential
7-:2.

5 J<1nw1ry 2001 ~-Subject To Change

External Interfaces

7.1.2 Read Timing
Table 7-1 shows the Register File read timing.
Table 7-1 Register File Read Timing

Q4
RO

Q3
B

Q5
R1
B

Q6
R2
B

Read

Bank
Mux

QS
EO
FO

Q7
R3
B

Bypass
Opbus

Execute

Ebox

Bypass
Opbus

Execute

Fbox

Qbox

Lookup
src

Drive Source
Pointers

Registry
File

Decode

Drive to Ebox &
Fbox

7.1.3 Write/Read Timing
Table 7-2 shows the Register File write/read timing.
Table 7-2 Register File Write/Read Timing

Q13
R9
E4
F4
A

Registry
File

l B

Q13
Rw
E5
F5
A

R1
E6
F6
B

Internally distrib- Decode/ Write
ute mux & write Mux
control

A
Decode

R2
E7
F7
8

Read

Bank
Mux

R3
B

Drive to Ebox &
Fbox

Drive Results

Bypass
Opbus

Drive Results

Bypass
Opbus

Ebox
Fbox

7.2 External Interfaces
7.2.1 Qbox to Register File Interface
The Register File is controlled completely by the Qbox. As soon as the set of instructions to execute next is known, the Qbox sends the set of source operand pointers and
the Register File begins the lookup process. Since there are often situations where fewer
Compaq Confidential
5 January 2001 ·- Subject To Change

7-3

External Interfaces

than eight instructions are picked, some of the instructions need fewer than two operands or some of the operands map to architectural registers R31 or F3 l, a valid bit is
also passed with each source operand pointer. Operand source pointers and valid flags
are driven from a Q4A latch in the Qbox, spend a cycle in transit and are received by
the Register File in an Rl A latch.
Most instructions that issue eventually return result to the Register File. The Qbox supplies a set of destination physical register numbers so the Register File knows where to
write results. The Register File must also know 'if' it should write a result. When fewer
than eight instructions are issued or instructions are issued that either do not write a
result or write registers R31 or F3 l, the Register File must be prevented from trashing
valid physical register contents. A valid bit is also provided with each destination
pointer to disable updates.
To kick-off the decoding as early as possible, the Register File needs the write control
information before the actual result data. The Qbox sends the write control signals from
a Q5A latch; a cycle is spent in transit before being received by the Register File into an
R2A latch. The Register File then pipes the data along for seven cycles before decoding for the write. Placing the FIFO in the Register File was convenient for the Qbox
and allows the Register File to push the distribution delay back into the FIFO stages.
The Register File receives separate result busses from the Ebox and Fbox and merges
them into a common result stream since the instruction picking and result caching guarantees that there are no conflicts. The Qbox supplies a control bit for each of the four
Fbox result pipelines that are shared with Ebox results to indicate which result is valid.
This bit is sent with the other write control information from a Q5A latch. Since the
Register File can do no wrong, there is no need for any status or return information.

7.2.2 Ebox to Register File Interface
The Ebox has eight execution pipes; each pipe requires two input operands and produces a single result. The operand and result vectors are directly mapped such that the
A operand for picker N is simply the Ebox_OPA[N] and the B operand is
Ebox_OPB [N].

7 .2.3 Fbox to Register File Interface
The Fbox only has four execution pipelines corresponding to pickers 0, 1, 2 and 3. The
Fbox also has store pipes on pickers 4 and 5, but the Ebox forwards floating-store data
to the Mbox whenever the operand is in the Register File eliminating the need to forward the extra two operands to the Fbox. The Register File latches the corresponding
Ebox operands and sends them to the Fbox from an R3A latch.

7.2.4 Global Register File Interface
The intention is to clock the Register File primarily off MAC+2. Minimal thought has
been put into reset requirements.

Compaq Confidential
7-4

5 Jc1nuc1ry 2001 ·-Subject To Change

8
Floating-Point Execution Units -

the Fbox

The Fbox executes all Alpha floating-point instructions, in addition to the new paired
single-precision instructions. It receives instructions from the Qbox via the Ebox, and
operands from the Register File, the Load Data buses (up to three), or its own Register
Caches. The Fbox returns floating-point results to the Register File and floating-point
store data to the Mbox, again via the Ebox. The Fbox returns exception information to
the Qbox.
The Fbox is organized as four identical clusters, each cluster consisting of one execution pipeline. The four pipelines, referred as F_PO, F_Pl, F_P2, and F_P3, allow up to
four floating-point operate instructions to be issued each cycle. Two copies of a register
cache - one for each set of two pipelines, are included to allow the results of recently
completed instructions to be used with minimal delay. Each pipeline contains the functional units needed to execute the various floating-point instructions. The functional
units, their latencies, and the instructions they execute are shown in Table 8-1. Figure
8-1 shows a high-level Fbox block diagram.
Table 8-1 Fbox Pipeline Functional Units, Instructions, and Latencies
Functional Unit

Instructions

Graphics ADD: F_GAD

Paired SP except PMUL, PARCPL, and PARSQRT 4 cycles

Graphics MUL : F_GML

Paired SP MUL type instructions: PMUL,
PARCPL, PARSQRT

3 cycles

Mull Unit : F_MUL

MUL

3 cycles

Divider : F_DIV

DIV

13 cycles - double precision
8 cycles - single precision

Square root : F_SQR

SQRT

33 cycles - double precision
18 cycles - single precision

Add pipe 1 : F_APl

ADD,SUB,CMP

3 cycles

Add pipe 2: F_AP2

ADD/SUB (align>l), CVTff, CVTfq, CVTqf,
CVTql, CVTlq

3 cycles

Short pipe : F_SHP

CPYSx, FCMOV, FBxx

1 cycle

Special operands (Zeros, Denormal OPD, NANs,
INF,RES.OPD),INPUT EXCEPTIONS,
Mx_FPCR

3 cycles

5 January 2001 -~ Subject To Change

Latency

Compaq Confidential
Floating-Point Execution Units - the Fbox

8-1

The F_SHP unit can supply a result for CPYSx, FCMOV, and FBxx
instructions in one cycle. This pipeline is also used to compute results for
all non-finite operands such as Denormals, NaNs, Infinity as well as zero
operands. The F_SHP pipeline also detects all input exceptions and supplies the appropriate result.

NOTE:

Figure 8-1 Fbox Organization

Cluster 0
F_PO

PO_OP[1:0]<64:0>
~

F_SQR

F_GAD

:
:

RF_RD [7:4]<64:0>

_.....

....

--+

F_GAD

--+

F_DIV

F_MUL/F_GML--to

F_APl

--Joo

F_AP2

--Joo

F_SHP
(+FPCR)

F_AP2

--Joo

F_SHP
(+FPCR)

•

:I!
xclstr30[2:0]<64:0>
xclstr03[2:0]<64:0>

~ t._ t._

_ij ..£. JI

_Ij _£_ _£_

_i
~

I!'

....

t•

F_SHP
F_AP2

,. ...

·~

... .4

'!.... "!.... ~

J!j

~
~

--Joo

___..

F_SHP
F_AP2

F_APl

F_MUL/F_GML ~

F_MUL/F_GML --to

F_GAD

F_SQR

Cluster 1

Floating-Point Execution Units -

•

--+

F_Pl

~~~

xclstr2 !l2:Q1<64:0>
xclstrl2fl:Q]_<64:0>

F_DIV

I+

~·

--+

F_APl

14-

~~~

.......
_..,,,..

_Ij _L ....-1

~ -£

RF_RD [3:0]<64:0>

14-

~·

•

......

8-2

F_SQR

F_MUL/F_GML___.

___..

~
~_DAT[2:1]<63:0> _I!

~
~DAT[0]<63:0>

---+

F_APl

.......

--'"1

~_DAT[1]<63:0>

F_P3

Pl_OP[1:0]<64:0>

F_DIV

RF_WR[3:0]<64:0>

~-DAT[0]<63:0>

Cluster 3

--+

:
~

F_DIV

F_GAD

--+

F_SQR

--Joo

i,.

Pl_OP[l :0]<64:0>

F_P2

P2_0P[1:0]<64:0>

Cluster 2

Compaq Confidential
the Fbox
5 Januc1ry 2001 ~ Subject To Change

Major Sections

8.1 Major Sections
The Fbox consists of an interface section and four pipelines organized as four clusters.
Each of the pipelines has several functional units. The following sections describe each
of these units and the last section describes the instruction flows for each of the floating-point instructions.

8.2 Interface Section
The Interface section is responsible for communications between the Fbox and the rest
of the chip, and for internal communications between the four Fbox pipelines and two
register caches.

8.2.1 External Interface
The Fbox Interface can receive incoming operand data from the Register File or the
Mbox, and instructions from the Qbox, both of which it transmits to any of four Fbox
pipelines (F_PO, F _Pl, F_P2, & F_P3). The instructions and load operands from the
Mbox are piped through the Ebox before reaching the Fbox. The interface is subdivided into the following three subsections:
•

Register Cache (F_RGC) - contains staging logic and static ram which latch and
hold recently generated result data of the Fbox pipelines as well as copies of incom..;
ing floating point loads. The result data is eventually dispatched to the Register
File. However, this result and load data can be used in subsequent floating point
operations without incurring the transit time delay in returning data from the Register File.

•

Operand Steering Unit (F_OSU) - performs comparisons against incoming Physical Register (Preg) numbers to determine the source of input operands to the Fbox
pipelines.

•

Interface Control (F_INT) - performs a partial decode of opcode, function code and
thread processor unit (tpu) to determine if a valid floating-point instruction has
been issued. It also contains logic which allows direct access to internal operand
buses from Register File operand buses, and logic to dispatch floating point store
data to the Ebox from either result data of Fbox pipelines F_PO and F_Pl, or from
the register cache.

8.2.2 Qbox Timing to Fbox
Floating-point instructions are issued by the Qbox, which transmits the opcode, function code, thread select information, and source and destination physical register numbers to the Fbox Interface. The Preg numbers go to the Operand Steering Unit (OSU)
to control operand bypassing and reads from the register cache. The thread select information controls updates of the Floating-Point Control Registers (FPCR), and is used to
determine if a valid instruction has been issued by the Interface Control. Source Preg
numbers are transmitted from the Qbox to the Ebox in Q4 (same as FW), are latched
and travel through the Ebox in Q5 (FX), reach the Fbox and are latched in Q6 (FY), to
begin comparisons in the F_ OSU in that B phase. The destination Preg numbers are dispatched by the Qbox a cycle later in Q5, travel through the Ebox in Q6, and are latched
in the F_OSU in Q7 (FZ). The opcode, function code, and thread select information

5 January 2001 ··· Subject To Change

Compaq Confidential
Floating-Point Execution Units - the Fbox

8-3

Interface Section

leave the Qbox in cycle Q5 and travel through the Ebox to be latched near the Fbox in
cycle Q6 (FY). This allows approximately 1 1/2 cycles for internal Fbox routing and
decoding prior to execution in cycle FO.
The Fbox vcu returns exception information back to the Qbox. The Fbox VCU returns
branch mispredict signals in FOB, to arrive at Pbox in F2A.
The following diagram illustrates the timing of the Fbox pipelines. Tables describing
the Fbox /Qbox interfaces are also shown.

8.2.3 Fbox Pipeline Timing
Table 8-2 shows the operation of a single Fbox pipe with all operands coming from the
Table 8-2 Operation of a Single Fbox Pipe Cycle

SRC OPC
PREG INFO

all Operands From Register File

R3
FZ

XMT

RF
Fbox EXECU'IE
DATA

RES
BACK

RES
TORF

8.2.4 Register File/Operand Bus
Input operands for issued floating-point instructions can be supplied without delay from
another functional unit in the same pipe, with a one cycle delay from a functional unit
in another cluster, from the register cache, or from the Register File. The Register File
supplies up to eight operands per cycle, corresponding to the maximum issue rate of
four floating-point instructions per cycle. Up to four results are returned to the Register
File per cycle, one for each Fbox pipe.
Operands from the Register File are input to the Fbox pipelines on differential, lowswing, operand buses. These busses are also used for bypassing results from other
functional units within the same cluster, results from other clusters, and incoming load
data. They are also used to transfer operands to the functional units from the register
cache.
The operand buses begin evaluation at the start of the B-phase (FZB). The rising edge
of clock at the start of the following A-phase is used to sense the differential data on the
operand bus, while the operand buses pre-charge in the same A-phase in preparation for
a new transaction in the following B-phase. Source operands from either the register
cache or the Register File must be valid at the input to the Fbox pipelines by phase R3B
(FZB ), one phase before instruction execution begins in FOA. The output result buses
are sent back to the Register File early in phase F4A.

8-4

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1nuary 2001

Subject To Change

Interface Section

8.2.5 Loads/Stores to/from Fbox
Fbox doesn't have direct access to store or load data moving to and from the memory
hierarchy and the Register File. The Ebox is responsible transmitting floating-point
load data to the Fbox from Mbox. The Mbox is expected to re-align the load such that
the sign bit and exponent are contained in the most significant byte of the data quadword for VAX style floating point formats, as is already the case for IEEE style formats.
The only portion of the floating point format the Fbox is responsible for is extending
the eight bit exponent of single precision data to eleven bits. Three dedicated load busses are used to transfer up to three loads to the Fbox each cycle. The load data can be
bypassed directly onto the operand buses, and they are also written to the register cache
for later use. Floating point load operands are generated and transmitted to the Fbox
with a four cycle latency, one cycle longer than for the Ebox, and arrive near the Fbox
Interface at the beginning of phase F3A, to complete formatting for a potential bypass
to an internal operand bus in phase F3B.
When a load misses in the cache or the Mbox queue, the data returned to the Fbox is not
correct, and the operation using the load data itself is retried at a later time when the
load data is ready. To insure that the load data as a result of miss does not corrupt rest of
the processor state, the Mbox supplies a bit called poison bit a few gate delays after the
load data. This poison state for floating point loads is maintained in the Ebox. The logical state of the poison bit in a resulting operation is maintained by a logical OR of the
poison bits of both input operands.
If the source of floating-point store data originates from the Fbox, it is forwarded to the
Ebox, which is responsible for formatting the store data and transmitting it to the Mbox.

The Fbox can deliver up to two store results per cycle, each on a dedicated bus. Each
store bus can source data directly from the result buses of one of the four Fbox pipes,
either F_PO or F_Pl, or the register cache. A previous result from any of the four Fbox
pipes or three floating-point load pipes located in the register cache can be the data
source on either store bus. There are also direct connections between the load and store
buses in the Fbox. A floating-point load that is the source operand for a store the following cycle is allowable without first having to access the register cache.
The floating point store data buses are driven to the Ebox in FOA.
The Ebox dispatches the data for integer-to-floating-point convert instructions [ITOFx]
to the Fbox as if it were a floating-point load, over the weak load data bus. Fbox is
responsible sign extension of the exponent as in the case of a normal floating-point load
operation, before use in the Fbox pipelines. Ebox is responsible for sign extension of
the exponent before transmitting ItoF data to the Register File. The Qbox can issue
only a single ItoF instruction per cycle.
Floating-point-to-integer (FtoI) convert instructions [FTOix] operate in a similar manner to floating-point stores. Two FtoI instructions can be issued per cycle by the Qbox,
and only in place of stores. Therefore, Ebox asserts control information to Fbox to indicate a floating-point store, whether it is a store or a FtoI instruction, and the Fbox Interface is not aware of the distinction. As with stores, Ebox is responsible for conversion
of the data.

5 January 2001 -· Subject To Change

Compaq Confidential
Floating-Point Execution Units - the Fbox

8-5

Interface Section

Table 8-3 shows a timing diagram for load data.
Table 8-3 Timing for Load Data
Cycle

OPC
INFO

FOLD

OPC
INFO

LOAD
DATA

RF
DATA

Floating-point instruction using LD data

Table 8-4 Pipeline Stages of Fbox Register Cache
Cycle

Rwrt

Frgc
Stages

lCyc

Latency Byp

Stgl

Stg2

Stg3

Stg4/
RcO

Rel

Rc2

Rc3

Rc4

RgFl

Byp

3Cyc

Latency

Byp

Stg3

Stg4/
RcO

Rel

Rc2

Rc3

Rc4

RgFl

Byp

4Cyc

Latency

Byp

Stg4/
RcO

Rel

Rc2

Rc3

Rc4

RgFl

Byp

8.2.6 Register Cache {F_RGC)
The Fbox Register Cache is used to store local copies of recent results from all four
Fbox pipes, as well as incoming loads forwarded from the Ebox. The stored data can be
used as source operands in subsequent floating-point operations without waiting for the
data to traverse the round trip to and from the Register File. The register cache is subdivided into two separate structures. First the staging logic is used to equalize the cycle
latency of the result data to match the longest execution time of the functional units in
the Fbox pipes. The execution times of the functional units in each Fbox pipe are one,
three, and four cycles, with a result bus required for each different latency. The staging
logic serializes the results from three data buses to one after four cycles. This single
result is forwarded to the Register File and to the static RAM, the final data structure of
the Register Cache. There are seven sets of staging logic in each copy of the Register
Cache, one set for each of four Fbox pipes and three floating-point load pipes. A single
SRAM structure per copy of the Register Cache is organized physically into an array of
35 rows or entries, each 65 bits wide, consisting of 64 data bits and a single predicate
bit for the FCMOVx instructions. Logically the SRAM is organized into seven banks
of five entries each for the four Fbox pipelines and three floating load pipelines.

8-6

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1mJc1ry 2001 ··· Subject To Change

Interface Section

The Register Cache will hold result data for a window in time that varies from five to
eight clock cycles, depending on the execution time of the floating-point operation.
After this the result data is available in the Register File and it is dropped from the Register Cache. Floating-point load data is held in the Register Cache for five cycles.
There are two copies of the register cache, one each for two pipes (or clusters) in the
Fbox. Each copy of the register cache is an exact duplicate of the other. The staging
logic in the register cache consists of level sensitive d-type latches and 2-to-1 multiplexers. It can be viewed logically as a shift register, with the ability to transfer result
data from each stage to the internal operand buses for use in subsequent operations.
Each entry in the SRAM is also capable of transferring stored result data to the operand
bus. The result of a floating-point operation with a one cycle execution time in an Fbox
pipe must be latched and held in the Register Cache for eight cycles until the result data
is available in the Register File, as shown in the timing diagram above. The staging
logic holds this result for four cycles (Stgl, Stg2, Stg3, and Stg4), and the SRAM holds
the result for five cycles (RcO, Rel, Rc2, Rc3, and Rc4), with one overlapping stage
between the staging logic and the SRAM (Stg4 and RcO). This A-phase latch in the
staging logic is used chiefly to hold the result data valid to ensure it is written successfully to the SRAM, but it also helps alleviate a critical timing path. In the event that
result data is accessed or read from the Register Cache the same cycle it is written into
the SRAM, the hold latch in the staging logic is used to transfer the result to the operand bus instead of the SRAM. This prevents a read after write access to the same
SRAM entry within a single cycle.
The result of a three cycle floating-point operation is held in the Register Cache for six
cycles before it is dropped, two cycles in the staging logic (Stg3 and Stg4), and again
five cycle in the SRAM, with one overlapping stage. A result of a four cycle flop is
held in the Register Cache for five cycles, which is also the case for floating-point load
data.
Figure 2 is a diagram of one copy of a Fbox Register Cache. Each of the four Fbox
pipelines can produce a maximum of three results per cycle, one for each flop execution
latency, plus up to three floating-point loads must be stored each cycle. Therefore, each
copy of the Register Cache must latch and hold up to fifteen results and loads per cycle.
Since this data first enters the staging logic, which serializes the data to one result from
each of seven pipes, the SRAM receiving this data requires only a single write port,
though seven separate inputs.
Each Register Cache can source up to five operands, requiring five read ports in both
the SRAM and staging logic. The pair of Fbox pipelines associated with each Register
Cache copy require two operands each for a total of four, while the fifth read port
sources one of two floating-point store buses driven to the Ebox. Each Register Cache
independently supplies store data, thereby supporting two floating-point stores per
cycle.
Register Cache entries are read in the B-phase, and they directly drive the operand bus
in the same phase. Writes occur in the A-phase of the cycle immediately following the
arrival of the results from the functional units (also driven during the B-phase).
Because results from each pipe are written into both copies of the register cache, the
register cache serves as the transfer point for cross-cluster data between pipes (clusters)
in the Fbox. A one-cycle delay penalty is imposed for bypassing results of one cluster
to any other cluster.

5 January 2001 -~ Subject To Change

Compaq Confidentia I
Floating-Point Execution Units - the Fbox

8-7

Interface Section

Figure 8-2 Register Cache
PO PO

RESULTS

4
3
1
CYCLE CYCLE CYCLE
OP
r - - - - - - - 9~ -

OP OP
A

_ST -

I
I

1Staging
: Logic

I
I

1-1-1--

LL -'

_I_
PO

.______.___.,I

SEL

-1-1- I

I
I
I

: Staging
1 Logic
I

1-1-1ST_DATAO ...
-•-----------+--+---+-+-+----

t8-8~t~-::---.----+----1-+--__,..........+ - - "":
.....1 - - - - - - + - - - + - - - - - 1

RF-WRO
RF:::WR1

...
~1-----+---+-----+---1
WR
WR
WR
WR

•

---~--

SRAM
7 Banks
35 entries

OP ST
OP

Staging Logic

RD (5 ports}

~·~ ~ ~

:
-------1--1--,-1-1-1-1-1-

I-·~ 1-1-5 - - - - - 4 -------1--1--,-1-15
•
I
~·~ • •
- - - - - - - - - - - - - - -~-,-1-1-1-1-1~
H n
1-15
•
I f~ ~ ••
- - - - - - ---------r-1-1-1-1-1-1-

1
•
------r-------~---,-l-1-

5 -----

~ I ··~ • •
-------r-t--1-1-1-1-1-1-

RF_WR2
RF _WR3

WR_DATA

•

··~ • •

IY!_R
WR
~R
~
...
:::1------1------+---1

~:::8~~~-~~---~----t--t--------e-+---1----1-1-1-1-1-

1
I

I
I

:staging
: Logic

:
:

.----!--'__,-JI :

-H -I

,---

-1--1-1-1

1Staging

:_Lm--"" _:
P1

P1 P1

RESULTS RESULTS OP OP

A B

8.2.7 The Operand Steering Unit (F_OSU)
The structure of the Operand Steering Unit is analogous to the register cache described
above. It consists of a ten-bit control datapath which stores the destination Physical
Register (Preg) numbers, and is organized into shift registers that feed into a content
addressable memory (CAM). This corresponds to the staging logic and static ram of
the register cache. There are two copies of the OSU, one for each copy of the Register
Cache. Each copy of the OSU contains seven sets of shift registers, one set for each of
four Fbox pipes or three floating-point load pipes. Each set of OSU shift registers controlling read access to a set of staging logic in the Register Cache for an Fbox pipe contains four stages of registers, and in some cases a fifth stage to handle local bypasses of
8-8

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1nuc1ry 2001

Subject To Change

Interface Section

result data in an Fbox pipe. These pipeline stages are named as Byp, Stgl, Stg2, Stg3,
and Stg4 in the above timing diagram and correspond to cycles FO, Fl, F2, F3, and F4
in the Fbox pipeline. Only two stages of shift registers are necessary for controlling the
staging logic of a floating load pipe in the Register Cache. Each copy of the OSU also
contains a single CAM physically organized into one array of 35 entries that are 10 bits
wide. Logically the CAM is organized into seven banks of five entries each, one ban:k
for each of the seven Fbox and floating-point load pipes. The structure is identical to
the SRAM in the Register Cache.
At each shift register stage and entry of the OSU a nine-bit exclusive or (XOR) is performed to compare incoming source Physical Register numbers to the stored destination Preg numbers every cycle. The tenth bit or valid bit of both source and destination
Preg numbers is logically anded to validate the comparison. Typically there are five
XOR's per stage of shift registers or entry in the CAM, one for each read port of the
Register Cache (the bypass stage has only three). A match in an XOR of a shift register
stage or CAM entry of the OSU indicates that the equivalent stage of the staging logic
or SRAM entry of the Register Cache is the source for an input operand to one of the
Fbox pipes or store buses. As a consequence of the hit in the OSU, result or load data is
transferred to the appropriate operand bus in the Register Cache. A hit in the XOR of
the bypass shift register stage of the OSU indicates the source of an input operand to an
Fbox pipe is the result of a functional unit in the same Fbox pipe. The result data is
bypassed locally to the operand bus, without incurring any delay passing through the
Register Cache. If there is no hit in the shift registers or CAM of the OSU, source operands to the Fbox pipelines are supplied from the Register File.
The implementation of the internal Fbox operands as low-swing differential buses constitute a large distributed multiplexer with connections to one Register Cache copy, two
Fbox pipes, and a store bus. If the Qbox should issue identical destination Physical
Register numbers too frequently, these are loaded into the F _OSU and would cause
multiple XOR matches in this logic. The consequence of this is multiple sources driving the operand buses, causing invalid data and indeterminate results in the Fbox pipelines. This problem must be prevented at the architectural level by ensuring the Qbox
can issue a destination Preg number only once within a nine cycle window of time that
it would remain in the F_OSU. In other words, once the Qbox issues a destination Preg
number to the F_OSU, it can't be issued again for another ten cycles.
The valid bits of source or destination Physical Register numbers have several uses in
the F_OSU (the most significant bit). If a source Preg number is not valid in an otherwise valid floating-point operation, this indicates that architectural register F31 is
intended as the source operand, and the internal operand bus is grounded. Should the
Qbox issue an non-pipelined or bubble operation to an Fbox pipe such as a divide or
square root, it is presently required to issue control information (opcode, function code,
tpu, Pregs) to the Fbox twice, the second time once the operation is nearly completed in
the Fbox. To ensure that the architectural constraint described above is met, the first
time the Qbox issues an non-pipelined operation to the Fbox, the destination Preg number must be invalidated. The second time the instruction for the same bubble operation
is issued, the source Preg numbers should be invalidated.

5 January 2001 ···Subject To Change

Compaq Confidential
Floating-Point Execution Units - the Fbox

8-9

Interface Section

8.2.8 Interface Control (F_INT)
The Interface Control does a partial decode of the opcode, function code and thread
processor unit (tpu) passed to it each cycle by the Ebox. It performs this function to
determine if a valid floating-point operation (flop) has been issued, and ignores any
integer instructions that have been issued. (The Ebox indicates via separate control signals to F_INT whether any floating-point loads or stores have been issued). The
instruction decode is also used to determine the execution time of the flop, either one,
three, or four cycles, and also to detect non-pipelined or bubble operations, such as
floating-point divide or square root operations.
This section also takes in control information from the F _OSU indicating whether any
successful comparisons have occurred between source and destination Preg numbers.
If not, data from the Register File operand buses is transferred directly to the internal
operand buses of the Fbox. The F_INT also contains multiplexers to transfer result data
from an Fbox pipe to the store bus in the event of a store bypass. This would be indicated by a match in an XOR of the bypass shift register stage of the F_OSU. Only Fbox
pipelines F_PO and F_Pl of bypassing result data directly to a store bus. Result data
from pipes F _P2 and F_P3 can only reach a store bus from the Register Cache, and
incur a delay of at least one cycle from the completion of an operation.

8.2.9 Divide and SQRT - Qbox interface
The Divide or the Square Root units in the Fbox pipelines are not pipelined and require
multiple cycles to finish the operation. The latencies of the operations are shown in
Table 1. The divide unit computes the fraction result and uses the multiplier for exponent result and the multiplier result bus to write the results. The square root unit computes the fraction results and relies on the add pipe (F_AP2) to calculate the exponent
results, final rounding, and for exception detection. For this reason, at the end of a
square root operation, the square root unit sends the results to the F_AP2 pipeline in the
second stage of the pipeline. After computing the final result the F_AP2 pipeline actually writes the result. This also eliminates extra write ports in the register cache and
drivers for the operand bus.
In order to reinsert the divide or square root unit results in the Mui or add pipe, issue of
all other instructions have to be stopped and a 'bubble' has to be created by the Qbox.
The Qbox keeps track of the divide and square root completion and inserts the bubble
appropriately. The following time diagrams show the relationship between done and
bubble signals.
It is possible to have a divide and square root to request a bubble at the same time. This
can be seen in the timing diagrams below, by lining up a divide and square root which
are issued at different times. This problem is solved as follows: For divides and square
roots, the Qbox detects a collison and always delays the square root by one cycle and

8-10

Compaq Confidential
Floating-Point Execution Units-the Fbox
5 Jc1nuc1ry 2001

Subject To Change

Interface Section

inserts two bubbles - one for divide followed by another for square root. The sequencer
in the square root detects this condition and delays the result transfer to the F _AP2
pipeline.
Table 8-5 FDIV_SP (9 cycles) , FDIV_DP (14 cycles)

.!. Drive Exceptions

FDIV in Divider

t-FDIV_SP

FlO

Fll

Fl2

t-FDIV_DP

Bubble ---1'-1'---Result Bypass

Bubble req - 1'
at Qbox

Table 8-6 FSQRT_SP (12 CYCLES), FSQRT_DP(28 CYCLES)
Drive
Exceptions

Sqrt resu It in
AP2

FSQRT IN SQRT Unit

FlO

F11

Fl2

F13

Fl4

Fl5

Fl6

Fl7

+-FSQRT_SP

F21

F22

F23

F24

F25

F26

F27

F28

F29

F30

F31

F32

+-FSQRT_DP

IBubblereq to
Qbox

I-Bubble req
atQbox

I-Bubble

Result bypass

SQRT reinjected

8.2.10 Fbox Exceptions
The Fbox detects the following arithmetic exceptions.
Table 8-7 Arithmetic Exceptions
Exception

Description

Integer overflow

Detected and generated for CV1FQ and CVTQL instructions.

Invalid operation

In addition to illegal operations and invalid operands, VAX reserved operands are
included.

Floating overflow

Generated during operate instructions

Floating Underflow

Generated during operate instructions

Inexact result

Generated during operate instructions

DIV by zero

Generated by Divides, and approximate .reciprocal and square root instrcutions.

The Ibox detects reserved opcodes and generates traps. The Ebox detects reserved values in the function fields of valid Fbox opcodes and generates a RESOPC trap.

5 January 2001 ···Subject To Change

Compaq Confidentia I
Floating-Point Execution Units - the Fbox

8-11

Interface Section

During the floating-point instruction execution it is possible to generate more than one
exception. The possible multiple exceptions are Inexact with floating underflow or
floating overflow. The graphics units can generate exceptions for each half of the result,
including multiple exceptions on each half as mentioned before. Within the exceptions
the input exceptions (Invalid operation, Div by zero) take precedence over output
exceptions.
When an exception is detected, the Fbox needs to record the exception status in the
FPCR. These status bits are sticky bits i.e. once set only an explicit write using
MT_FPCR can clear them The Fbox implicitly reads the status bit and requests an
update by PALCODE only if the corresponding status bit is clear. In addition, depending on the trap enables, it needs to signal traps. The trap enables can be part of the
instruction as a qualifier or from the FPCR in the form of trap disable bits. The Fbox
looks at the opcode and the FPCR bits to enable traps. The Fbox writes the result
(IEEE specified non-trapping result for IEEE instructions) to the destination whether
traps are enabled or not. Since there are four pipelines in the Fbox, at any given cycle it
transmits exception status and traps for four instructions. The various pipelines in the
Fbox have different latencies( 1 cycle, 3 cycles, or 4 cycles). However, the Fbox signals
exception status and traps at one fixed point, F4 cycle of the pipeline as shown in the
diagram. Since the Fbox processes instructions in SMT mode, up to four threads, it also
sends back the thread ID along with the exceptions to match the trap with the instruction. In addition, when a trap occurs, the software completion flag encoded in the
opcode has to be transferred to the trap handler through the operating system. For this
purpose it sends the 'IS' bit along with the exception information. The PALCODE corresponding to the arithmetic exceptions also updates the IPR - EXCEPTION SUMMARY REGISTER. The mechanism used for updating the FPCR is detailed in the
FPCR section.
Table 8-8 Fbox Exception Signaling Timing

R3
EO

OPCODE VALID
RESULT BYPASS
EXCEPTIONS DRIVEN

vv
--VV
lCYC

--VV --VV
3CYC 4CYC
--VV

The Fbox encodes the exception information, taking multiple exceptions into account
and to signal traps as one vector exc_enc. Table 8-9 shows the legend for Table 8-10,
which lists the various combinations:

8-12

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Januc1ry 2001 ·- Subject To Change

Interface Section

Table 8-9 FPCR Update/Floating-Point Arithmetic Trap Legend
Symbol

Meaning

DZE

Division by zero

INE

Inexact result

INV

Invalid operation

IOV

Integer overflow

OVF

Floating-point overflow

Software completion flag

UNF

Floating-point underflow

Table 8-10 Fbox Retire-Time Exception (RTE) Encodings
Encoding

Retire-Time Disruption

Encoding

Retire-Time Disruption

ooxxxx

No Fbox exception

100111

INVDZE

OlXXXX

FPCR Update

101000

INVUNFINE

010000

IOVINE

101001

INVOVFINE

010001

INV

101010

INVINE

010110

DZE

101011

DZEUNFINE

010011

UNFINE

101100

Reserved

010100

OVFINE

101101

Reserved

010101

INE

101110

UNFOVFINE

010110

IOV INEINV

101111

Reserved

010111

INVDZE

llxxxx

FP Arith Trap (SW= 1)

011000

INVUNFINE

110000

IOVINE

011001

INVOVFINE

110001

INV

011010

INV INE

110110

DZE

011011

DZEUNFINE

110011

UNFINE

011100

Reserved

110100

OVFINE

011101

Reserved

110101

INE

011110

UNFOVFINE

110110

IOVINEINV

011111

Reserved

110111

INVDZE

lOxxxx

FP Arith Trap (SW = 0)

111000

INVUNFINE

100000

IOVINE

111001

INVOVFINE

100001

INV

111010

INVINE

100010

DZE

111011

DZEUNFINE

Compaq Confidential
5 January 2001 ··· Subject To Change

Floating-Point Execution Units - the Fbox

8-13

Fbox Floating... Point Control Register {FPCR)

Table 8-10 Fbox Retire-Time Exception (RTE) Encodings (Continued)
Encoding

Retire-Time Disruption

Encoding

Retire-Time Disruption

100011

UNFINE

111100

Reserved

100100

OVFINE

111101

Reserved

100101

INE

111110

UNFOVFINE

100110

IOV INEINV

111111

Reserved

8.3 Fbox Floating-Point Control Register {FPCR)
The FPCR contains rounding information and trap disable bits used by the floatingpoint operate instructions, and exception status information from floating-point operate
instructions. The FPCR is read from and written to the floating-point registers by the
MF_FPCR and MT_FPCR instructions. In addition, all operate instructions use the
dynamic rounding mode bits to round the results and the trap disable bits to signal traps
when an exception is detected. The Fbox implements all bits specified by the Alpha
architecture except the Denormal operand exception disable bit (DNOD). The Fbox
does not implement Denormal operand processing. The FPCR format is shown below.
Since the 21464 issues the floating-point instructions out of order, a mechanism to correctly read (for both implicit and explicit readers) and write the FPCR is used. In addition, in SMT mode there can be four threads with their own FPCRs. The floating-point
instructions from each thread can be issued to any pipeline in the Fbox. In order to support these features, the Fbox implements two sets (copies) of FPCRs, one for each
group of two pipelines. Each set of FPCRs contain four FPCRs. - one FPCR per thread.
The thread ID is used to access the correct FPCR. Each FPCR has two elements, a
'committed state' and a 'speculative state'. In order to avoid the score boarding of the
registers, the Fbox uses PALCODE and trap mechanism to synchronize the updates.

8.3.1 FPCR Format
Table 8-11 shows the format for the Floating-Point Control Register.
Table 8-11 Floating-Point Control Register Format

6 6 6 6 5 5 5 5 5 5 5 5 5 5 4 4 4
3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7

~
U'.l ~

:;g

~~z

U-t Ul
g> ~ ~ ~ ~ ~> >
N
U-t

~
0 0 ~
0

0
RAZ/IGN

The FPCR is read and written as follows, shown in Figure 8-3:
1. The FPCR needs to be updated for two reasons:
a.

As a result of MT_FPCR instruction.

b. To update the status information from each operate instruction.

8-14

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Januc1ry 2001 ··· Subject To Change

Fbox Floating.. Point Control Register (FPCR}

The data in the FPCR needs to be read as a result of MF_FPCR and for the Fbox to
use round mode information and disable bits for executing a floating-point operate
instruction(implicit read).
2. Whenever a MT_FPCR instruction is issues, Qbox compares the INUM of the last
instruction that changed the FPCR(corresponding to the FPCR in the 'speculative
register') with the INUM of the current instruction. If the current instruction is
older then it signals 'update speculative register' and the data is written to the speculative register; otherwise the data is ignored. When the MT_FPCR is retired a trap
to PALCODE is taken where an IFETCHB instruction is executed. At this point,
Qbox sends a 'commit FPCR' signal to the Fbox and the speculative register is copied to the committed register. Whenever FPCR status update is required, the Fbox
signals a trap to the PALCODE and supplies the exception information on exc_enc
signals.
3. This exc_enc information is written to the exception summary register by the hardware. When this trap occurs, all instructions younger than the trigger instruction are
invalidated. The PALCO DE reads the exception summary register and executes one
HW_MT_FPCR to write the status to FPCR as mentioned before, followed by an
IFETCHB instruction and exits. The HW_MT_FPCR instruction is executed by the
Fbox (in F_SHP pipeline) and the data is written to the speculative FPCR.
4. Since the status bits (IOV, INV, DZE, OVF, UNF, INE) are sticky bits, whenever
one of these bits need to be set, Fbox checks if the old bit is already one. If it was a
one, Fbox does not request a trap to update the FPCR. Otherwise it transmits the
exc_enc, the encoded exception information. This can produce up to six trap
requests - one for each status bit for a program. Once the corresponding sticky bit
is set, no further traps occur. The Fbox uses the speculative FPCR status information for this purpose.
5. The Qbox sends a commit_FPCR signal to the Fbox as soon as the instruction that
triggered the FPCR change is retired.
6.

The Fbox uses the committed FPCR DYN_RM and disable bits to round the results
and to signal traps.

The FPCR is implemented in the F_SHP units of pipelines corresponding to PO
and P3. Whenever a MT_FPCR is executed in any of the two pipes, the second
speculative register is copied using the cross cluster bus for the 3-cycle result.

5 January 2001 - Subject To Change

Compaq Confidential
Floating-Point Execution Units -the Fbox 8-15

Fbo:x Multiplier Unit -

F___ MUL and F____GML

Figure 8-3 FPCR Update Mechanism

F_P3

F_PO
xclstr03

xclstr30

FPCR_ISSUE_pO

upd-spec_p3

COMMIT
FPCR

FPCR

NEW
STATUS
IT

ROUND MODE,
Disable bits for
F_PO/Pl use

ROUND MODE,
Disable bits for
F _P3/P2 use

COMPARE

FPCR UPDATE

8.4 Fbox Multiplier Unit -

FPCR UPDJ\.TE

F_MUL and F_GML

The floating-point multiplier executes the following instructions:
MULF, MULG, MULT, MULS
PMUL, PMULL, PMULH, PARCPL, PARCPLH, PARCPLL, PARSQRT, PARSQRTH, PARSQRTL

8.4.1 FMUL Operation
The 21464 FMUL is fundamentally different from previous Alpha implementations.
Unlike previous processors, which used an odd-even array multiplier, the 21464 uses a
Wallace tree. This one feature has many far reaching implications:

•
•
•
•

3 cycle FMULs are possible
The array datapath is 106 bits wide
Radix-4 booth recode will be used instead of Radix-8
The least significant bits of the product are not known early.

In phase FOA, the two source operands are read off the operand busses with sense amps.
The B operand goes to f_mul_mpc, which does the swizzle and drives the multiplicand.
Because radix-4 booth recoding is used, a 3x add which would normally be needed
isn't. The swizzle is there to support PMUL instructions. Specifically the PMULL and
PMULH instructions require that the low (or high) B operand must be used for both
8-16

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1nw~ry 2001 ···Subject To Change

Fbox Multiplier Unit - F____MUL and F___.GML

multiplies. To accomplish this I potentially needed to move either the low or high operand to the opposite location. Also, in phase FOA, the A operand is booth recoded for
radix 4. No swizzle is needed but a 3 bit shift is required for PMUL operations. After
recoding, the 53 bit fraction results in booth control signals which will result in 27 partial products. One extra partial product is also generated which corrects for deficiencies
in the array. (ie the array doesn't fully sign extend all partial products or add the+ 1 term
needed for two's complement arithmetic)
In phase FOB, Wallace compression begins. To sum up the 28 partial products 7 stages
of CSA's are required. The quantity of CSA's for each stage are:
Stage 1- 9
Stage 2-6
Stage 3-4
Stage4-3
Stage 5- 2
Stage 6-1
Stage 7 -1

Total-26
Both phases FOB and FlA are required for CSA stages 1-6. Stage 7 will be performed
in phase FlB. The stage 1 CSA has some extra logic incorporated which can conditionally force zero's into the array. This is done to support PMUL's. Ordinarily the array
would compute: Ah*Bh<<52+Ah*Bk<29+Al*Bh<<29+Al*Bk<O
This would be gibberish for a PMUL, but by selectively introducing zero's into the
array the correct result can be had: Ah*Bh<<52 +Ah*0<<29 + Al*0<<29 + Al*Bl =
Ah*Bh<<52 + Al*Bl
Notice that Al*Bl can never result in a number that is big enough to affect the Ah*Bh
sum which is sitting 52 bits to the left.
Phase FlB and F2A are used for rounding. Two round adders will be built. The first
only handles double precision multiplies. This will be the most critical. The second
adder handles single precision multiplies, including PMULS, and an add required for
the approximate instructions. Because a single precision add is inherently faster than
double precision, this adder can be a degenerate copy of the double precision version.
Having two round adders significantly simplifies the design and speeds up the hard
double precision add. The additional area required for this scheme is approximately lK
cdu's. These round adders differ from the 21264 in two respects. First, the carry in from
the least significant bits of the product are not known ahead of time. Instead, they have
to computed at the same time the add of the high bits is being done. The second complication is that the sticky bit is also not known ahead of time. It's possible to compute
sticky early but it requires a trailing zero count and an add. Because of the PMUL
instructions this logic would be doubled. By making the round adder tolerant of a late
sticky bit a fair amount of hardware and complexity can be saved.
The final F2B phase is used to route the result back to the operand drivers and then
drive the operand bus.

5 Jam1~1ry 2001 ···Subject To Change

Compaq Confidential
Floating-Point Execution Units -the Fbox 8-17

Fbox Add Pipeline

Two parallel exponent additions are required prior to F2A. There is plenty of time for
these so they don't merit further discussion.
The approximate instructions are not handled in the main multiplier array. Instead a
ROM is used plus 3 PP mux's and 2 CSA's. The most significant 6 bits of the source
fraction (or for 1/sqrt, 5 fraction bits plus the lsb of exponent) are used to index into a
64 entry ROM. Three numbers are retrieved: slope(9 bits), slope*3(10 bits), offset(l8
bits). While the ROM lookup is happening the next less significant 8 bits of the operand
are radix-8 booth recoded. The slope and slope*3 signals act as a multiplicand to 3 PP
mux 's. These 3 partial products plus the offset are then compressed to 2 numbers with
the help of 2 CSA's. The 2 number's will be muxed into the single precision round
adder. The ROM is 64x2x2x19=4864 bits. The error of the result will be less than 1 part
in 2Al4.

8.5 Fbox Add Pipeline
The Fbox ADD pipeline executes ADD, SUB, CMP, CVTXX instructions, and completies the DIV and SQRT instructions. The Add pipeline is divided into two pipelines
F_APl and F_AP2 pipelines. The F_APl pipeline is used for CMP instruction and for
effective subtract operations with an exponent difference of 0 or 1. The F _AP2 pipeline
executes all the other instructions. The partitioning of the add pipeline is based on the
requirements of the effective subtract operation and is described below.
The steps required for implementing an effective subtract operation in a straight forward manner follow:
1. Find the exponent difference of the two operands to align the fractions.
2. Align the smaller operand by shifting the smaller fraction right by the absolute difference of the exponents. This alignment shift can be very large and a 54-bit shifter
is required.
3. Subtract the aligned operands (smaller operand from the other). The result can have
many leading zeroes if the result is positive or leading ones if the result is negative,
when the operands are close.
4. Find the leading zero or the leading one position in the result to normalize. When
the exponent difference is zero, the result of subtraction can be negative. In this
case the position of leading zero is needed to shift left.
5. Normalize the result by an amount indicated by the leading 1/0 position. When the
operands are very close, many leading bits may be canceled and a large left shift
may be required.
6. Round the result.
These steps can be minimized by separating the operation into two domains - 1) for
exponent difference of 0 or l labeled 'near domain' and 2) for exponent difference
greater than 1, labeled 'far domain'. This separation uses the following observations:
1. In the near domain, the alignment shift is atmost 1, which can be accomplished by a
mux. Thus, there is need for a huge alignment shifter.
2. In the near domain, if any normalization is performed there is no need for rounding.
The reason behind this is since the alignment is atmost 1, only the round bit(the bit
below the LSB) can be one and when a normalization is performed this bit also
shifts back into fraction. Hence no roundin is required.
8-18

Compaq Confidential
Floating-Point Execution Units-the Fbox
5 J<1nuc1ry 2001 ~·Subject To Change

Fbox Add Pipe1 -

F.___AP1

3. In the far domain, the maximum normalization required is 1. Since the original
operands are normalized(~ 1.0 for IEEE,~ 0.5 for Vax), and the aligned operand
with a right shift~ 1 has a value< 0.5 for IEEE, a value< 0.25 for VAX, the result
of the subtract has to be ~ 0.5 for IEEE, or ~ 0.25 for VAX.
With those observations, the effective subtract can be performed using the following
steps:
Near Domain( Exp difference =0, 1)

Far Domain(Exp difference> 1)

IN.Predict Exponent difference and align.
Determine Leading-1/0 position using the
input operands.

lF. Determine exponent difference

2N.Subtract the smaller operand.

2F. Align the smaller operand

3N.Normalize the result

3F. Subtract the smaller operand with rounding Observations 2,3

Observation 1.

In step lN above, the least two significant bits of the exponent are used to predict the
exponent difference of 0 or 1. If the actual exponent difference turns out to be > 1, the
far domain computes the result.
After step 2N, the most significant bit of the result can be used to determine if normalization is required. By observation 2 above, if no normalization is required, rounding
may be necessary; in this case we can switch to step 3F. If normalization is required
there is no need for rounding and the operation can be completed in step 3 N.
With the above principles, the F_APl implements the 'near domain' and F _AP2 implements the 'far domain' .Since the COMPARE instruction is similar to an effective subtract with exponent difference of 0, it is implemented in the F_APl pipe. The far
domain pipe (F_AP2) implements all the other instructions.

8.6 Fbox Add Pipe1 -

F_AP1

The Fbox add pipe is a 3-cycle pipe. F_APl is used for the effective subtract (Ediff =
0,1 case) and for the compare instructions. The data into the add pipe is assumed to be
binary vectors coming from the register file, register cache, or from the other pipes, etc.
The input operands are always assumed to be non-zero.
The F_APl pipe does not handle the cases when one or both the operands is a true zero,
NaN, infinity or denormal or when it sees reserved operands or dirty zeros. The short
pipe handles these exceptions. The output latch that drives the F_APl result bus is disabled in these cases. The add pipe also does not do rounding. Addition is done in an
earlier stage taking advantage of the fact that rounding is not required for Ediff =0, or
Ediff =1 with normalization. The control is transferred to F_AP2 if normalization is not
required and rounding may or may not be required.
The add pipe F_APl has a fraction data path, exponent data path and control. The fraction data path is 55 bits wide, including the sign bit, hidden bit and the round bit. The
exponent data path is 11 bits wide.
Fig 1 shows the basic outline of the addpipe F_Ap 1 and the main functional blocks,
namely, Ediff Predict, LXD, LXS, LXE, Adder, left shifter, compare_blk, and exponent
adder, and the underflow detect. A more detailed version of the block diagram is also
available.

5 January 2001 -~Subject To Change

Compaq Confidentia I
Floating-Point Execution Units - the Fbox 8-19

Fbox: Add Pipe1 - F___,AP1

Two least significant bits of the exponent are used in 'Ediff predict' to determine if the
exponentl is equal to, less than, or greater than exponent2. This data is used as control
signals to select the vectors A and B from the fractions Fl, F2, Fl/2, and F2/2. The fraction adder does effective subtract on the two vectors A and B. The leading 1/0 in A-B
vector is partly detected in LXD, which determines all the l's in the vector. LXS completes the leading 1(0) detect operation by doing a strip of all l 's(O's) except the first
1(0) from the output of LXD and drives the shifter control that normalizes the output of
the adder. The output of LXS, in other words, the normalization amount is encoded
into a 6-bit vector (ELXD). ELXD is later subtracted from the result exponent (Er).
This operation is done in the exponent adder, the output of which is the final exponent.
Compare instructions for A=l<=I< B are handled by the compare_blk which examines
the signs, exponents and the fractions (in that order). In the event that the vectors A and
B have the same sign and exponent, the difference fracA-fracB is examined to generate
the compare results.
Underflow detection is done in two stages namely threshold and unf_detect. Threshold
determines if there is an underflow or not, or if there is a possibility of having an underflow. In the case of a possible underflow, Unf_ctrll and Unf_ctrl2 produce the necessary controls that select the appropriate fraction and exponent.

Compaq Confidential
8-20

Floating-Point Execution Units - the Fbox

5 Jc1nuc1ry 2001 -· Subject To Change

Fbox Add Pipe1 -

F____ AP1

Figure 8-4 F_API Block Diagram
Fl

Ea eq/ It/ gt/ Eb

LXD

----------- -------------------------------- _EractiQJl._ --------------------------Adder

fap2%
ea
eq/lt/gt
eb

LXS

B-A

ELXD

Sign
opdl
opd2
F
R

FRI

Left
Shifter

Compare

Exponent
adder

FSR

FOUT

Opd1
eq/ltlle
Opd2

EOUT

UNF SIGN

8.6.1 Operation
8.6.1.0.1 Phase FOA

The operands are passed through differential sense amplifiers in this phase. The sense
amps are assumed to be B latches followed by A latches. Hence the data is read from
the sense amps after the rising edge of the clock with some delay due to the sense amps.
We are assuming that the sense amps will introduce a delay of 300 ps. The output of the
5 January 2001 -· Subject To Change

Compaq Confidential
Floating-Point Execution Units - the Fbox

8-21

Fbox Add Pipe1 - F___.AP1

sense amps is a 52 bit vector to which the round bit is concatenated at the LSB position.
The two 53 bit vectors, one for operand fa and the other for operand fb are sent to the
multiplexer which selects either fa, fa/2, fb or fb/2. The select lines of the multiplexer
come from the ediff predict unit. Ediff predict is a 2 bit predict logic that uses bits 0 and
1 of the exponent to determine whether ea = eb, ea > eb or ea < eb.
Table 8-12 Exponent Difference Estimation
Ea<1 :0> Eb<1:0> Potential Exponent Difference

Fraction Operation Performed1

Fa-Fb

-1

Fb-Fa/2

Fa-Fb/2

Fa-Fb

-1

Fb-Fa/2

x
x

Fa-Fb/2

Fa-Fb

-1

Fb-Fa/2

-1

Fb-Fa/2

Fa-Fb/2

Fa-Fb

1 Fa and Fb are the fraction parts of operands.

The outputs of the mux are sent to LXD and the Adder in the next phase along with the
hidden and the sign bits.
8.6.1.0.2 Phase FOB

LXD is the leading 1/0 detect logic. It does t_apl_ml %A-f_apl_ml %Band does leading 1/0 detect. It generates a vector with a 'l' in the ;eft most position corresponding to
the leading 1 and possibly several l's to the right. These spurious l's to the right of the
leading 1 are stripped in LXS. LXD is needed to know the number of bits we need to
shift the result of the effective subtract for normalization. Leading 1 detect is needed if
f_apl_ml %A-f_apl_ml %B >0 and leading 0 is needed if f_apl_ml %A-f_apl_ml %B
<0.

Compaq Confidential
8-22

Floating-Point Execution Units -

the Fbox

5 Januc1ry 2001 - Subject To Change

Fbox Add Pipe1 -

F____ AP1

The Adder for effective subtract and compare instructions is started off in the same
phase as LXD. The adder (FAD) performs A-Band B-A based on the following logic
equations:
F_AP1_M1%A - F_AP1_M1%B = F_AP1_M1%A + - F_AP1_Ml%B + 1
F_AP1_M1%B - F_APl_M1%A = F_AP1_M1%B + - F_AP1_M1%A + 1
- ( F_APl_Ml%A + - F_AP1_M1%B + 0)
8.6.1.0.3 Phase F1 A

LXS is the leading 1/0 strip logic. It keeps only the 1st 'l' and strips all the rest of the
'l's from the leading 1 detect's output. This will be used to directly drive the shifter
control lines. Input to the LXS is the output of LXD with some modifications to the
round bit. Since it is possible to have an LXD output without any l's in the case of zero
result that occurs in ea = eb, we modify the round bit as follows. If ea_eq_eb is true, the
round bit which is the LSB is forced to a logic high. Otherwise, the original bit value of
R bit is maintained. This ensures that LXS will never produce an all zero result which is
necessary for the shifter control lines.
Remaining part of the addition in the Adder block too is completed in this phase. One
of the outputs of the adder is selected in this phase by the mux M2. A-B is selected if
the adder result is positive or if the ediff predict predicted that ea lteb or ea_gt_eb. But
on the other hand, if ea= eb, then there is the possiblity that the adder result is negative,
in which case we need to select the B-A result.
The output of the mux M2 is checked to see the hidden bit. If the hidden bit is a '1 ', it
indicates that normalization is not required. This signal is sent to the add pipe FAP_2.
Once it gets this signal, it does rounding if necessary and drives the output bus. FAP_l
does not drive the output bus in this case.
The mux M3 selects the correct exponent, ea or eb. Logic equations for M3:
F_apl_m3%Exp

f_apl_sa%ea
f_apl_sa%eb

if f_apl_sa%ea
if f_apl_sa%ea

f_apl_sa%eb >= 0
f_apl_sa%eb < 0

For normalizing the adder result, the Left shifter shifts it by the number of bit positions
indicated in the output of LXS
LXE encodes the LXS result into 6 bits. The encoded LXS is used by the exponent
adder to generate the final exponent.
8.6.1.0.4 Phase F1 B

The exponent adder is used to generate result exponent and result exponent+ 1. The
exponent adder used here is a 13 bit exponent adder. The 13th bit in the MS B position is
'O' and is added to the exponent 'exp' to keep track of overflow and sign. The extra 6
MSB bits added to the elxd are also all zeros.
F_apl_ead%Res_exp
F_apl_ead%Res_exppl=

f_apl_m3%exp - f_apl_lxe%elxd
f_apl_m3%exp - f_apl_lxe%elxd

Res_exppl is the result exponent+ 1. This is used when a right shift is done on the leftshifter output to correct the over-estimated LXD.
Compaq Confidential
5 January 2001 -· Subject To Change

Floating-Point Execution Units - the Fbox

8-23

Fbox Add Pipe1 - F....AP1

The compare_blk is used to execute the compare instructions. It looks at the signs of
the two operands, their exponents and the sign of the adder fraction result, in this order.
Depending on which compare instruction is to be executed, it will determine if a = or <
or> b. Each of the compare conditions are calculated as follows:
F_apl %Cmp_eq =
F_apl %Cmp__gt =
F_apl %Cmp_lt =

'1' if ((sa=sb) & (ea=eb) and (fa=fb))
'1' if ((sa='O' & sb=' l ')I (sa=sb & ea>eb) I (sa=sb & ea=eb & fa> fb))
'1' if ((sa='l' & sb='O') I (sa=sb & ea<eb) I (sa=sb & ea=eb & fa< fb))

The compare instruction decision flow is hsown in the following figure. AZ/BZ indicate A/Bis zero, AN/BN indicate sign of A/B.
Figure 8-5 CMP Instruction Logic
AZ=BZ=1

OPA=OPB
IEEE, +/-0 are
equal

OPA > OPB, AN=O
OPA < OPB, AN=1

OPA > OPB, EN=1
OPA< OPB, EN=O

OPA=AOPB

OPA > OPB, FN=1
OPA < OPB, FN=O

Sign-detect detects the sign of the result. The logic equations for sign-detect follows:
Sign = (((ea> eb) & sign of opd a) or ((ea< eb) & sign of opd b) or ((ea= eb) & sign of fraction difference))

8-24

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1nu,1ry 2001

Subject To Cf1ange

Fbox Add Pipe1 -

F....AP1

8.6.1.0.5 Phase F2A

The output of LXD may be off by 1 bit. Hence the output of the left shifter may have to
be shifted right by one bit position for the overestimate in the LXD. F_apl_ls%FS_*
represents the adder result left shifted by what LXD and LXS indicated.
F_apl_ls%FSR_* is f_apl_ls%FS_ *right shifted by one bit position. One of these
two vectors are selected by the mux M4 and one of the vectors, res_exp or res_exppl
are selected by the mux M5 as follows:
F_apl_ls%FS<54> (bit AO)= 'O'

Fraction output= F_apl_ls%FS;
Exponent output= F_apl_ead%res_exp

F_apl_ls%FS<54> (bit AO) = '1'

Fraction output= F_apl_ls%FSR;
Exponent output= F_apl_ead%res_exp + 1

The underflow detect is done in two stages, namely threshold and unf_detect:
•

Threshold looks at the result exponent from the previous phase and the data type,
for example S ,T,G or F and by doing range checking, will determine if there is a
definite underflow, definite no underflow, or a predicted underflow.

•

Unf_detect on the other hand determines if there is an underflow if threshold has
indicated a predicted underflow. It does so based on the following equations:-

Underflow occurs if:

•
•

Threshold gave a definite underflow, or
Threshold gave a predicted underflow & fraction result != 0 & bit AO of fraction =0

UNFl

= def_unf or (pred_unf & FS != 0 & FS<54> or AO = '0')

UNF2

= def_unf or (pred_unf & FSR != 0 & FSR<54> or AO = '0')

No underflow occurs if :

•
•
•

Threshold gave a definite no underflow, or
Threshold gave a predicted underflow & bit AO of fraction =1, or
Threshold gave a predicted underflow & fraction result = 0

NO_UNFl

= def_nounf or ( pred_unf & FS = 0) or (pred_unf & FS<54>or AO=' 1')

NO_UNF2

= def_nounf or ( pred_unf & FSR = 0) or (pred_unf & FSR<54>or AO=' 1')

Mux M4 selects one of the two fraction results, FS or FSR as given in the following
equations:
FOUT = FS if AO = '0'

= FSR

if AO = '1'

= 0 if LXS_H<O> ='1' (zero detect) or
UNFl or UNF2

Similarly,
EOUT = exp - elxd if AO ='0'
= exp - elxd + 1 if AO ='1'

5 January 2001 ··· Subject To Change

Compaq Confidentia I
Floating-Point Execution Units - the Fbox

8-25

Fbox Add Pipe2 - F___.AP2

= 0 if LXS_H<O>= '1' (zero detect} or
UNFl or UNF2

The sign, exponent and fraction outputs are all zeros if there is an underflow or if the
input operands are equal. Otherwise the sign, exponent and fraction results are driven
out on the output bus in phase F2B.
8.6.1.0.6 Phase F2B

This is the output bypass phase. The outputs are driven on the result or the operand bus
at the beginning of this phase.

8.7 Fbox Add Pipe2- F_AP2
F_AP2 is responsible for eff.ADD, eff.SUB with ediff>l, CVTLQ/QL, CVTqf, CVTff
and CVTfq. In the case of eff. SUB with ediff=l and no normalization happening,
rounding is required because there is one bit shifted out of the datapth and wasn't
brought back by normalization. Therefore, we need to account for that bit not in datapath by rounding. Since F_APl doesn't have round adder, F_AP2 will output the result
instead of F_APl. In this case, F_APl will send a signal to F_AP2. The exponent and
rounding of DIV/SQRT will also be handled by F_AP2. Multiplier handles its own
rounding and exponent. The latency of F_AP2 pipeline is three cycles. The datapath
can be divided into 3 sections: fraction, exponent, and control. Fraction dateapath is 64
-bit plus R bit and exponent datapath is 12-bit. At top, the data is driven by sense amps
which are fired by the rising edge of fclk.
In the case of unary instructions, OPB is used and OPA is ignored. Special operands are
handled by F _SP.
The following sections describe the operation of F_AP2.

8.7.1 Cycle 1 Operation
8.7.1.1 Fraction:

Floating-point operands : 1 bit Sgn; 11 bit Exp; 52 bit Frac. -->Sign goes to control
section; Exp goes to Exp datapath.
Fraction part is droped into [B01.. .. B52] bit by bit.
[BOO] is hidden 1 and [AlO .. AOO] are forced 0.
Integer operands : 64 bit -->The operand is passed as [A10 .. B52]
This transformation happens in "format_a" and "format_b". OPB can be signed- magnitude floating-point numbers or 64 bit 2's complement numbers. OPA is always a floating-point number in signed magnitude format. They are transformed, according to data
type and op code, to fit into fraction datapath. The interpretation of datapath format
depends on op code. Note that for IEEE, hidden 1 is located at AOO. Exponent datapath
doesn't need to do anything for this because the 1 will go back to AOO eventually. For
eff. Sub, the result out of round adder will be either O.lxxx or O.Olxxx, which are different from 1.xxx and O.lxxx for eff. Add. To simplify rounding overflow detection and
exp calculation, both eff. SUB operands are always shifted left by 1 to align with eff.
Add. This is done here too. Note that Exp needs to take these situations into account.

8-26

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1nwtry 2001 - Subject To Change

Fbox Add Pipe2 -

F.___AP2

For CVTQL, bits<31 :30> is copied to bits<34:33>. Later, shifter left shifts the operand
by 29 bits. For CVTLQ, <63:62> are moved to <60,59> and <63:62> are sign
extended(by copying <63> to <62> ). Later, shifter right shift the operand by 29 bits.

For D-floating, fraction is moved right by 3 bits and chopped. Bits 1and0 are lost for D
format.
SHF_MUX and PASS_MUX are used to determine which operand will be sent to
shifter. In the cases of CVTxx, OPB is selected by default. For ADD/SUB, the fraction
of the smaller operand is sent to shifter for alignment. The larger operand is sent to
Rounding CSA in Fl A. Since Ediff>=l is always true, the result of round adder is
always positive. Therefore, we don't need magnitude comparator here. The control signal to both MUXes is EN, which is the sign of Ea-Eb. SHF_MUX also handles the first
step of negation. If the instruction is effective SUB or CVTQf/CVTfQ with a negative
operand, the operand will be inverted bitwise before shifting. However, we still need to
add 1 to LSB to complete 2's complement. This is done by the help of TRZ and the one
is combined with sticky bit. For instance, a number to be negated is shifted right by 10
bits. There are 10 bits out of datapath and a 1 need to be added at B52+10, which
doesn't exist in datapath. However, for the 1 to be propagated to B52 inside datapath, all
10 bits shifted out must be all 1's. Otherwise the 1 will be killed somewhere and
ignored. Therefore, to see a 1 coming in at B52, TRZ must be larger than 10. Although
the 1 for 2's complement can be killed, it may still change sticky bit.
The output of SHF_MUX is sent to Lo_mux and Hi_mux controlled by shf_setup.
These cells will setup for left shift or right shift. The shifter needs a 128-bit operand. In
the case of a left shift, the operand is sent to LO_word with HI_word filled with 0 or 1.
In the case of a right shift, the operand is sent to HI_word with LO_word filled with 0
or 1. Left/Righ shift is determined by the sign of elxd. The filling of extension word is
determined by instructions shown as following.
Table 8-13 Filing of Extension Word for F_AP2 Instructions
Hi_Word

Lo_Word

Add

operand

-- Rsh only

Sub

-operand

-- Rsh only

CVTQF(Rsh,posQ)

operand

CVTQF(Rsh,negQ)

-operand

CVTQF(Lsh,posQ)

-- operand is inverted if neg

operand

CVTQF(Lsh,negQ)

-operand

CVTFQ(Lsh,posF)

operand

CVTFQ(Lsh,negF)

-operand

CVTFQ(Lsh,posF)

operand

CVTFQ(Lsh,negF)

-operand

5 January 2001 ··· Subject To Change

Sign ext

Compaq Confidentia I
Floating-Point Execution Units - the Fbox

8-27

Fbox Add Pipe2 - F___.AP2

Table 8-13 Filing of Extension Word for F_AP2 Instructions
CVTLQ(Rsh)

Sign_ext operand

CVTQL(Lsh)

operand

CVTFF

Note:

The inversion of operand is done by SHF_MUX

TRZE(A) and TRZE(B) counts and encodes the number of trailing zeros in A and B
respectively. It's also possible to save one encoder by putting TRZ_mux before encoder.
TRZ_mux pick the eTRZ of the smaller operand, which is to be used to calculate sticky
bit. It may be useful to do TRZ on B_bar for 2's complement negation. TRZ_MUX is
also controlled by EN. CVTQF_FLE detects the position of leading 0/1 for CVTQF
only. It strips all lower order 1's or O's and leaves only the leading 0/1 in output like
LXS. Physically, the stripping is done in encoded domain so that encoding and stripping are done in one step. The output is then decoded to control shifter. For detailed
description of shifter control, see 2.2. If leading one is in [B53 .. BO], the encoded output
is (53 .. 0). If leading one is in [AO ..A9], the output is encoded as (-01..-10) for exponent
calculation. The sign also indicates the direction of shifting.
FLE will also detect if there is a 1 in the upper LW, whcih will cause an integer overflow in CVTQL.
Note the calculation of the sticky bit needs 3 elements : etrz, ediff or elxd, and a constant. This will be explained later.
8.7.1.2 Exponent

In the first phase, in order to compute absolute value of exponent difference(ediff), two
11-bit adders calculate Ea-Eb and Eb-Ea concurrently. On top of the adders, muxes are
used to force value of Ea and Eb for specific instructions. The results of adders determine which fraction to shift and exact_ediff. The sign of Eb-Ea and exact_ediff>l is
sent to F_APl because F_APl doesn't have the adders. Ediff is used to determine shift
amount of Add/Sub alignment, calculate sticky bit, and control many muxes.
In the second phase, if F_AP2 will handle exponents for DIV/SQRT, their exponents(Eb-Ea) will be frozen in a latch, div/sqrt frz, and wait until the fraction part is
almost done. ermux_l picks a constant for certain instructions like CVTQF. SHR_3 is
forCVTDG.
Er_mux determine the result to be used in final exponent calculation. In the case of eff.
ADD/SUB, the exponent of larger operand is picked. For DIV/SQRT, the exponent is
from DIV/SQRT frz. For CVTxx instructions, a constant is supplied.
8.7.1.3 Control

Exp_mux chooses among Constants, Ea-Eb, and Eb-Ea for the ediff to be used in driving shifter and sticky bit calculation. Since ediff is encoded, it needs to be decoded
before driving the shifter. The decoding is done in 2 steps here but this may change
with physical implementation.
CVT_mux, CSA&CASC-LAT, and PGK is the first step of sticky bit calculation. To
calculate sticky bit, 3 numbers are involved: etrz, ediff (elxd, in the case of CVTQF),
and a constant specified by instruction. Ediff/elxd selection is done by CVT_mux. Then

8-28

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1nwiry 2001 -· Subject To Change

Fbox Add Pipe2 -

F____ AP2

a CSA reduce 3 numbers to 2 numbers so they can be used to drive PGK. To save time,
CSA is combined with a cascode header to work as a latch. The width of this datapath
is 6-bit.

8.7.2 Cycle 2 Operation
8.7 .2.1 Fraction

L/R shifter handles the alignment for ADD/SUB and normalization for CVTxx instructions. For alignments, it is always a right shift. For normalizations, both left and right
shifts are possible. To handle all conditions fast, both operand and control are arranged
accordingly before the clock edge. Therefore, we can squeeze in the rounding CSA in
F2A. There are 65 control lines which are also arranged specifically for different
instructions. The 65 control lines are coded as 00-64. The shifter works as a 65-1 mux.
00 selects A10-B53 of INPUT_LOW; 01 selects B52 of INPUT_HI and A10-B52 of
INPUT_LOW; 02 selects B51-B52 of INPUT_HI and A10-B51 of INPUT_LOW, and
so forth. In the case of CVTQF, if leading one is in [A9 .. AO] (AlO is sign), shifter sets
up to do right shift and LXD(A9 .. AO) is mapped to control lines (01..10). For leading
one's in [BOO .. B53], shifter sets up to do left shift and LXD(B53 .. BO) is mapped to
control lines (11 .. 64) respectively. For alignments, shifter sets up for right shift only.
Ediff(00 .. 63) is mapped to control lines (0 .. 63).
Rounding CSA compresses two operands and sticky bit and rounding constant to 2
operands for PGK. Note the sticky bit is only one bit, so we can encode more informaiton into this number. Say STICKY=l and another 1 is required for negation, STICKY is
forced to be 2 (lOb). For different data format, the 1 of 2's complement and the sticky
bit need to be inserted in different bit position. Round Adder starts in F2B and takes
one cycle.
8.7.2.2 Exponent/Control

Exponent adder calculates (er,er+ 1) for eff. Add and (er,er-1) for eff. Sub. The selection
is based on AO and BO of round_adder result.
Threshold logic is to detect conditions which may become overflow or underflow by
the result of rounding adder. In other words, it detects if the instruction is on the verge
of OVF/UNF. It is basically a ROM taking inputs from Exp_adder and Op_code
decoder. Note that in register file, all single precision exponents are extended to fit into
double precision fields by adding 896d=380h.
OVF/UNF Detect is tightly coupled with Threshold logic to detect definite OVF/UNF
conditions and force exponents to Emin or Emax accordingly.
The sticky bit calculation is done in Fl A with the following equations:
eff. sub & fs
eff. sub & gt
eff. add & fs
eff. add & gt
cvtfq &-en

s= etrz - (edif+27) < 0
s= etrz - (edif -2) < 0
s= etrz - (edif+28) < 0
s= etrz - (edif -1) < 0
s= etrz - (edif -1) < 0 right shift

5 January 2001 ··· Subject To Change

Compaq Confidentia I
Floating-Point Execution Units - the Fbox

8-29

Fbox Add Pipe2 - F___.AP2

cvtqf & FS
cvtqf & gt
cvtts
cvtfq & en

s= etrz - (29-elxd) < 0 [53 - elxd - etrz >24 or 53]
s= etrz - (-elxd) < 0
s= etrz - (edif+28) < 0
s= etrz - (edif+63) < 0 left shift

Note thats is a Boolean function, so we only need a carry chain here to determine sign.
The exact sum is unnecessary.

8.7.3 Cycle 3 Operation
8.7 .3.1 Fraction

Round Adder takes 3/4 cycle to complete. Logic is built into the adder to detect potential overflow/underflow. Fraction mux handles possible one bit shift of fraction and
pick the right exponent and merge it into 64-bit datapath for output. There is a special
caese for F _AP2 to handle. For ediff=l, it's possible that there is no need for normalization and thus rounding is necessary. Since F_APl has no rounding capability, the
rounding is done in F_AP2.
In F3B, F_AP2 drives operand buses.
8.7 .3.2 Exponent/Control

If exponent has the potential to ovf/unf, one of the exponent is forced to Emax/Emin.

Compaq Confidential
8-30

Floating-Point Execution Units - the Fbox

5 Jc1nuary 2001 ··· Subject To Change

Fbox Short Pipe - F___.SHP

Figure 8-6 F_AP2 Block Diagram
OPA<0:63~-

I
Format

Format

l OPB<0:63>
:y-

TRZ_A

lMllx

FLE

11-

Shifter

[

MUX

l rrl<0,5>

Constants

EfRZ<0:5>

MUX

Ed.iffMUX
DIV/SQRT

OP_SHF<0:61

OP_PASS<0:52l_

Shifter Setup

~
Left/Right
Shifter

ElxdMUX

_y.

EDIFF2<0:~

]

....

comtant

l_j, l_

rY'

Exponent result
MUX

.....
.....-

Shifter
Control

Sticky
Logic

..:i:.

*
Exponent Adder

l<:>P_APAS<0:63>
OP_AS H<0:631

EDIFF<0:5>

Div/Sqrt
MUX

Eb- ~

Ea-~b

ELXD<0:6>

EfRZ

Pass

ED~

l_ _!

STKY

_.f

Rounding CSA

:_y:

Overflow
Underflow
Detection

Rounding Adder

EXP<52:62>

FRAC<0:63~

Exponent Fraction format

.+
8.8 Fbox Short Pipe -

F_SHP

The short pipeline F _SHP implements single cycle latency instructions FCMOVxx,
CPYSx, and FBRxx instructions. It also implements special or unusual operand handling and FPCR in two of the pipelines. These operations have full 3- cycle latency. The
F_SHP also supplies rounding information to other pipelines and functional units from
the instruction and the FPCR. It collects all the exception information and communicates exception information to the Qbox. The F_SHP mainly consists of three distinct
sections, sharing instruction decode and output drivers but little else.
Compaq Confidential
5 January 2001 -· Subject To Change

Floating-Point Execution Units -

the Fbox

8-31

Fbox Short Pipe - F____SHP

8.8.1 Short Instructions
These ten instructions require little processing - in six cases a zero compare and in the
remaining four just selection of operand bits via a mux.
They could be executed in a single cycle, but wiring constraints 1 may favour a three
cycle unit. Any performance impact of this is likely to be dominated by FCMOVxx
latency - as FCMOVxx is the most commonly used of the short instructions and
FCMOVxx passes through the F_SHP twice, so lengthening the pipe adds four cycles
to FCMOVxx latency.
8.8.1.1 CPYS, CPYSN, CPYSE

These three instructions merely copy fields of the input operands to the result, as shown
below. The fields copied are controlled purely by the instruction - there is no data
dependence.
Instruction

Sout

Eout

Fout

Description

CPYS

Copy Sign

CPYSN

!SA

Copy Sign Negated

CPYSE

Copy Sign and Exponent

FCMOVxxl

Conditional Move part 1

FCMOV2

Conditional Move part 2

FCMOV2

Conditional Move part 2

8.8.1.2 FCMOVEQ, FCMOVGE, FCMOVGT, FCMOVLE, FCMOVLT, FCMOVNE

As apecified in Section 2.11.2.3, FCMOVxx copies the second source operand to the
destination if the first source operand passes the test specified in the instruction else
leaves the destination unchanged. This can often replace sequences of code using conditional branches, avoiding possible branch misprediction giving code that is faster and
has more predictable delays.
FCMOVxx Fa, Fb, Fe

is functionally equivalent to
FByy Fa,

label ; yy = not xx

CPYS Fb, Fb, Fe

label:

but avoids the branch.
See Section 2.11.2 for a complete discussion of FCMOVxx instruction processing.

8.8.2 Unusual Input Operands
There are several unusual input operands that each arithmetic function must handle specially:

1 A single cycle unit requires its own output and bypass busses, whereas a three cycle unit can share the
busses used by the other three cycle units (add, multiply etc.)
8-32

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Januc1ry 2001 -- Subject To Cf1ange

Fbox Short Pipe -

•
•
•
•
•
•

F___.SHP

IEEENaN
IEEE Denormal
IEEE Infinities
Zero
VAX dirty zero
VAX reserved operand

Rather than require that each unit detects unusual input operands and generates correct
results and exceptions, the unusual cases are handled by the F_SHP.
The F_SHP snoops on the operation and operand busses. When it notices a combination
that requires special handling it asserts a pipe-global suppress output signal 1 in phase
FlB. On receiving this an arithmetic unit must suppress its output (and may cancel its
calculation), and the F_SHP will drive the correct result instead, and possibly throw an
exception.
The F_SHP must detect unusual input operands in all floating-point types used by the
Fbox - IEEE single and double precision, VAX single and double, and packed graphics
(two IEEE single format values packed into bits 63:32 and bits 31:0 of the bus).
8.8.2.1 Unusual Cases
•

If Fb is a NaN, propagate the quietened NaN

Else if Fa is a NaN and Fa is used, propagate the quietened NaN
(and throw an INV exception if FPCR<INVD> is clear and either was an SNaN).
•

If either operand is denormal then

if FPCR<DNZ> is set, treat the operand as zero
else throw an INV exception.
•

If both operands are usual numbers (non-zero finite numbers) the F_SHP does

nothing, and the result is generated by the arithmetic unit.
•

If either operand is non-usual the result is generated as shown in the following

tables. For each instruction the result to select is defined by the type of the two
operands A and B, and can be one of the two operands, a true zero, IEEE infinity or
the canonical quiet NaN (sign bit, all exponent and fraction MSB set, remaining
bits of fraction cleared).
•

In the case of VAX or IEEE single or double format unusual result the F_SHP
asserts SUPPRESSLOW_Hand SUPPRESSHIGH_H to force the active arithmetic
unit to release the output bus, and drives the value shown below. The sign bit is
treated as a special case (it may have to copied from A, B, a constant, not B or A
xor B), the fraction and exponent are copied from A, Bora constant (0, Inf or
CQNaN).

1 Actually two signals SUPPRESSLOW_H and SUPPRESSHIGH_H to handle graphics instructions
where the graphics unit or multiplier may have to drive one half of the output whilst the F_SHP drives
the other. For non-graphics instructions both will be asserted.

5 January 2001 ··· Subject To Change

Compaq Confidential
Floating-Point Execution Units - the Fbox

8-33

Fbox Short Pipe - F____SHP

•

In the case of graphics format data there are two independent values in each operand, one on the upper half of the bus, one on the lower. The two halves are handled
independently, each as described above. It is possible that one half of the result
needs to be driven by the F_SHP whilst the other needs to be driven by the arithmetic unit. SUPPRESSLOW_Hand SUPPRESSHIGH_H are driven independently
to signal the arithmetic unit which bits it needs to drive.

8.8.2.2 IEEE Data
8.8.2.2.1 ADDS, ADDT
Normal

+Inf

-Inf

A
CQNaN
A
A

CQNaN
A
A
A

B
B

+Inf
-Inf

0
Normal

A
B

B
A
(driven by addpipe)

8.8.2.2.2 DIVS, DIVT

Inf

CQNaN
Inf (DZE)
A

CQNaN(DZE)
A

Inf (DZE)
(driven by AP2)

Inf

Normal

A
CQNaN
A

CQNaN
A
A

B
B
(driven by mulpipe)

Inf

0
Normal

Normal

8.8.2.2.3 MULS, MULT

Sout =Sa xor Sb in all cases
A
B
Inf

0
Normal

Compaq Confidential
8-34

Floating-Point Execution Units - the Fbox

5 Jc1nuary 2001 -· Subject To Change

Fbox Short Pipe -

F___.SHP

8.8.2.2.4 SQRTS, SQRTT
Operand

Result

Inf
-ve
Normal

Inf
CQNaN(INV)
(driven by AP2)

8.8.2.2.5 SUBS, SUBT
A

+Inf

-Inf

Normal

CQNaN
A
A
A

A
CQNaN
A
A

+Inf
-Inf

-B
-B

A
(driven by addpipe)

Normal

-B

8.8.3 Floating-Point Control Register (FPCR)
The FPCR contains dynamic rounding information, trap disable bits and exception status.
Two of the four F_SHPs include a copy of the FPCR. The north-west F_SHP broadcasts
rounding mode information to and handles exceptions for the two western pipes whilst
the north-east F_SHP handles the two eastern pipes. The two southern F_SHPs contain
no FPCR state, but do contain some FPCR related logic.
Each of the two northern F_SHPs contains four FPCRs, one FPCR for each thread.
Each FPCR consists of two registers, one containing speculative state, the other committed machine

5 January 2001 ··· Subject To Change

Compaq Confidential
Floating-Point Execution Units - the Fbox

8-35

Fbox Short Pipe - F____SHP

Figure 8-7 Fbox Floating-Point Control Registers

F_SHP

F_PO

S eculative 0
S eculative 1
S eculative 2
Speculative 3

F_SHP

F_P3

CommitedO
Commited 1
Commited2
Commited3

F_P1

F_SHP F_P2

8.8.3.1 Reading the FPCR

The FPCR is read explicitly by the MF_FPCR instruction. In response to this instruction, the F _SHP reads the FPCR for the current thread and writes the result bus. As the
two copies of the FPCR are defined to be identical, this instruction can be issued to
either of the two northern pipes. Any implicit execution barriers needed will be performed by the palcode routine, so the current committed value of the FPCR can be
returned immediately.
The FPCR (commited) is read implicitly by every dynamically rounded floating-point
instruction (DYN_RM bits) and by every possible arithmetic trap (trap disable bits).
8.8.3.2 Dynamic Rounding

Every arithmetic instruction includes explicit rounding bits <12:11>
Table 8-14 Arithmetic Instruction Explicit Dynamic Rounding Bits
Bit

Meaning

Round chop

Round to -infinity

Normal (nearest/even) rounding

Dynamic rounding

Compaq Confidential
8-36

Floating-Point Execution Units -

the Fbox

5 Jam.u~ry 2001 - Subject To Change

Fbox Short Pipe -

F___.SHP

In the case of dynamic rounding, the instruction should use the rounding mode specified by <59:58> of the FPCR for that thread instead, as follows:
Table 8-15 FPCR Dynamic Rounding Bits
Bit

Meaning

Round chop

Round to -infinity

Normal (nearest/even) rounding

Round to +infinity

This is handled by each of the four F_SHPs rewriting the rounding mode of each
instruction issued to its pipe. This means each arithmetic unit can use the rounding
mode it receives directly without needing to handle the details of dynamic rounding.
As any change to the dynamic rounding mode bits must be isolated by execution barriers the committed state of the FPCR of the appropriate thread can be used directly.
This functionality must be provided by the southern F _SHPs also, so the northern
F_SHPs must drive the dynamic rounding mode of each thread to their southern equivalent (eight wires).
8.8.3.3 Exceptions

There are five maskable arithmetic exceptions:
Table 8-16 Maskable Exceptions
Exception

Meaning

INE

Inexact

OVF

Overflow

UNF

Underflow

DZE

Divide by zero

INV

Invalid operation

If FPCR<DNZ> is not set then any attempt to use a denormal operand will throw an

INV exception, even ifFPCR<INVD> is set. IfFPCR<DNZ> is set all denormal values
are treated as true zero.
INV and DZE are generated only by the F_SHP, so can be handled by the unusual operand handling logic. Because the F_SHP can handle any operation with a zero operand
(is this true? TBS) it can handle forcing denormals to zero without involving the arithmetic unit.
An arithmetic unit that produces an overflow, underflow or inexact result must assert
unit_UNF, unit_OVF or unit_INE respectively. In the case of overflow or inexact result
the unit must drive the IEEE correct result. In the case of an underflow a unit must drive
true zero. The F_SHP will throw the appropriate exception if it is unmasked, and if the
instruction enables the exception (for INE, if bits 15:13 = 111. For UNF, if bit 13 = 1).

5 January 2001 ··· Subject To Change

Compaq Confidentia I
Floating-Point Execution Units - the Fbox

8-37

Fbox Short Pipe - F____SHP

All exceptions are thrown in cycle F3 over a dedicated 8 bit bus to the Ibox (also used
for requesting traps to palcode for setting FPCR exception flags).
(Exception bus coding TBD).
The denormal to zero bits must be available to the southern F_SHPs, so the northern
F_SHP drives the four DNZ bits to the southern F_SHP (four wires).
The southern F_SHP has no direct access to the exception masks, so drives exceptions
to its northern counterpart, where they are masked and sent to the Qbox .

.1
__[

Detect
NaN, Zero,
Denormal,
Infinity
(Double,
Single)

Detect
NaN, Zero,
Denormal,
Infinity
(graphics)

Instruction
Decode

~~~

·---------r------------ --------c-----------------------:}----Unusual operand

--"'

control

FPCR - Floating Point
--"'

Control Register

------- ------ -------------------------------------------r---....

6:1 mux, low word

L6: 1 mux, high word
Exception Masking

8-38

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1nuc1ry 2001 -~ Subject To Change

Fbox Divider -

F____ DIV

8.9 Fbox Divider - F_DIV
8.9.1 Divider Description
The floating-point divider executes the following instructions:
•

DIVS

•

DIVT

•

DIVF

•

DIVG

The 21464 FDIV is the PCA57 divider. The only difference between PCA57 and the
21464 is that the 21464 inserts the divide result in the add pipe before the round adder.
This costs 2 cycles of additional latency. This divider uses a split remainder algorithm
which allows 6 bits/cycle and very little overhead. The timing breakdown is as follows:
Table 8-17 F_DIV Timing Sequence
Cycle

Action

operand transit

Ob-lOa

divider array (10 passes are needed)

lOb

sticky detect/add

lla

add/mux

llb

transit for result

12a/la

mux in add pipe

13b/2b

bypass

As you can see the total latency is 14 cycles for double precision. For single precision 5
cycles are removed from the array resulting in a latency of 9. The algorithm is unique
(we are patenting it) and as such isn't available in a text book. Basicly, a split remainder
divider does a SRT type operation. The true remainder is never exactly known, however, the uncertainty is kept low enough and bounded to still make quotient decisions.

8.9.2 The Divider in Detail
The first phase is used to transport the A and B operands to the divider. Everything
should be setup for dynamic logic for the fOb edge. Because EV8 doesn't support
denormals, both the·divisor and dividend fractions are correctly normalized and can be
thought of as always being between 1.0 and 2.0. (1.0<= operand fraction <2.0) The
logic to compute the exponent is done in the add pipe and is uninteresting.
For the next 10 cycles the actual divide occurs. The divider array consists of 6 stages
and a recirculating mux. Because each stage retires one quotient bit the divider will end
up computing 6*10=60 bits which is more than sufficient since only 56 are needed to
correctly round. All 6 stages are substantially identical to each. The only exception to
this is that two stages have a ghost latch built into them. This results in virtually zero
latch overhead for the divider.

5 January 2001 ··· Subject To Change

Compaq Confidential
Floating-Point Execution Units - the Fbox

8-39

Fbox Divider -

F.___ DIV

Each divider stage is composed of two different pieces. The low 49 bits of the remainder are kept in a redundant sum carry format. Carry save adders (CSA) are used for the
divisor add on this portion of the remainder. All the remaining high remainder bits are
kept in an exact fully encoded form. Because after each stage there is a 1 bit left shift,
signals spillover out of the CSA array. These signals have the numerical value of V:z, 1A ,
1A. The spillover signals skip a stage before they are incorporated into the state
machines. Because they skip a stage they have a numerical value of 1, V:z, V:z by the time
they are incorporated. The total uncertainty of the divider is 3V:z. This number is calculated this way:

CSA portion
Spillover Stage -1
Spillover Stage -2
Total

(118 +1/16 ... )*2
(Yi+ 1A + 1A)

(l+Yi+Yi)

1
3Y2

To compensate for having a rather large uncertainty of 3V:z, an over-redundant digit
selection was used. The range for allowed digits is {-2,-1,0, 1,2}. This requires the exact
remainder to be bounded by+- 4*divisor.( R<(R-2*divisor)*2 -> R<4*divisor) This has
the effect of changing the MUX in the CSA array from 3 to 5 inputs. Luckily 2 times
the divisor is available by looking 1 bit to the right.
The dynamic state machine circuit is a 4 high N stack. One transistor is needed for the
input, one for a mux to handle the various divisors, and 2 to do the spilover add. There
are three signals coming from the CSA array as part of the spillover add. Since two of
these are coming from the same CSA they were condensed into a single 4 bit vector. (2
vectors -> {0,1 }{ O,V:z} becomes one vector -> {0, V:z, 1, 1V:z} )
The NMOS devices in the state machine that handle the various divisor combinations
were problematic for speed. There are a total of 8 different divisor combinations(l.000,1.001,l.010,1.011...) the state machine must handle. To speed up this logic,
the state machine was broken into 2 pieces. One state machine handles divisors between
1.0 and 1.5 and the other divisors between 1.5 and 2.0. This reduced the size of the
muxing logic which meant much less source drain capacitance.
When the divide is finished information is needed about the remainder. There are 8 different ranges that the remainder could be in:
R=-4d
-4d< R <-2d
R=-2d
-2d< R <0

R=O
0<R<+2d
R=+2d
+2d< R <+4d

8-40

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jm1w1ry 2001 - Subject To Change

Fbox Divider -

F____ DIV

Determining these is complicated because the remainder is a little fuzzy. However,
because I did a spilover add in the ghost latch at the bottom of stage 6, the round state
machine under it knows the remainder with an uncertainty of only lYz which is sufficient. Basicly the round state machine looks at the remainder and if it's definitely less
than zero then it adds +2d to the remainder. If the remainder is definitely greater than
zero then it adds -2d to the remainder. If the remainder is too close to zero to be sure
then no add is performed, for this case the remainder is guaranteed to be bounded by 2d<R<+2d. The round state machine also takes its portion of the remainder and transforms it from a precise fully encoded form to a binary vector. This binary vector is
merged with the CSA remainder to make two 56 bit vectors. These vectors are what the
add, prescribed by the round state machine, are performed on. The result is two vectors,
the sum of which we need to know the sign and if zero. This is done with the sign and
zero detect block at the bottom of the datapath diagram. The sign detect is actually a
carry lookahead adder except only the final carry out is needed. The actually sum values are never computed. The zero detect uses logic similar to the leading zero detect
logic used elsewhere in the box. The zero detect is faster than the sign detect.
The one case in the table above that the hardware can not handle is the R=-4d case. This
is a degenerate case that can only occur when the divisor is precisely 1.0000 ... 0 (exponent can be anything so it is really when the divisor is a precise power of 2) When this
case occurs with the hardware described above, the round state machine will add +2d to
the remainder. The resulting sum will then be -2d which is nonzero and negative making the downstream logic think the -2d<R<+2d case was encountered. When doing
infinity rounding the final answer will be 1 LSB greater than the correct answer. For
chop rounding mode the answer would be correct, however, it would appear to be inexact. An existing zero detect in the add pipe tells the divider that the divisor is 1.0000... 0
When this is known the divider forces the rounding mode to be chopped. Because
dividing a number by 1.0000... 0 is always exact, the rounding mode is irrelevant. The
logic also masks the inexact flag.

8.9.3 Over-Redundant Digits to Binary and Rounding
The quotient digits coming from the divider array are in the form of {-2,-1,0,l,2} and
must be converted to a binary number with correct rounding and normalization. I
included a figure showing this logic. This conversion is done in two parts. First, {-2,l,O,l,2} must be changed into {-2,0,2}. This conversion is done with two stages of
logic and relies on the fact that I can add 1 to a digit as long as I subtract 2 from the digit
to its right. Likewise, I can subtract 1 and add 2. The first stage looks at each digit and
determines whether it is negative or positive and also if it is even or odd(even=-2,0,2
odd=-1,+1) The second stage then does the following (in C syntax):
switch(This digit)
CASE -2: if (left is odd) then return(O);
CASE

-l:if (right is positive AND left is odd) then return(+2);
i f (right is positive AND left is even) then return(O};
i f (right is negative AND left is odd) then return(O);
i f (right is negative AND left is even) then return(-2);

CASE 0:
CASE

if (left is odd) then return(-2);

1: if (right is positive AND left is odd) then return(O);

5 January 2001 ·-Subject To Change

Compaq Confidential
Floating-Point Execution Units - the Fbox

8-41

Fbox Divider -

F____ DIV

i f (right is positive AND left is even) then return(2);
i f (right is negative AND left is odd) then return(-2);
i f (right is negative AND left is even) then return(O);

CASE 2: if (left is odd) then return(O);

No carry propagation is required for this. The next conversion is from {-2,0,2} to
{0,1,2}. This is accomplished by shifting things left 1 bit so I now have {-1,0,1 }. This
can be converted a standard sum carry format by a simple mapping of -1->0,0 0-> 0,1
1->l,l. This conversion is so simple no additional logic stages were needed. The ORL
block in the figure generates sum carry outputs.
Because every bit needs to know the sign of the bit to right, a complication is introduced. The least significant digit (stage 6) can't be converted because it needs to know
the sign of stage 1 of the next pass. This is handled by pipelining the stage 6 digit one
cycle so that it gets converted with the following pass. Its sign is still needed in the current pass for the stage 5 digit.
Rounding for the divider is virtually free. Two observations about division make this
possible. First, whether the quotient needs a one bit normalization shift can be determined with out ever doing a divide. Simply if the dividend fraction is greater than or
equal to the divisor fraction no normalization shift will be needed. This is independent
of rounding modes. Second, the infinitely precise quotient can never be exactly half
way between two representable numbers. This means that the IEEE round-to-even case
never occurs. Here's why these two observations are so powerful. To do round to nearest all that is necessary is to add~ LSB to the quotient. And for infinity rounding, 1
LSB should be added to the quotient with a -1 added to the smallest possible digit. I can
do all of this because I know where the LSB is. In the figure the rounding is accomplished with the CSA blocks while the quotient is still in a redundant form. The 'magic
rounding vector' is computed by control logic based on the rounding mode, the
datatype precision, and the normalization shift result. The total overhead for rounding
in this divider is a dynamic CSA delay. Because a CSA shifts the carry one bit left I end
up with a seven bit vector. This is corrected by pipelining the most significant carry
until the next pass. The result of the CSA stage is two 6 bit vectors ready to be added.
One approach to generate the final binary quotient from the sum carry vectors would
have been to accumulate the vectors until the divide finished. A 52 bit carry propagate
add would then be performed to yield the binary quotient. This method requires building a fast adder in the datapath. The method PCA employed is different. The add is performed serially six bits at a time. This removed the fast adder from the datapath. It's
also faster to build a 6 bit adder than a 52 bit.
The hardware to do the serial add is located with the over-redundant logic in the control
section. Basicly 2 six bit adders were built. These two adders compute the sum of the
sum carry vectors with a carryin of 0 and of 1. In addition, I detect the case where the
sum of the two vectors would result in a carry out. This is the generate term. The prop
signal is asserted when the carryout of the block is equal to the carryin. The two 6 bit
sums (quo<5:0>,quo_plus_one<5:0>) are then routed to the datapath. These are muxed
into two registers(QO,Ql). After the first pass the 6 most significant bits of the QO,Ql
registers receive the sums. On the second pass, the six less significant bits get loaded.
This continues until the divide finishes and both registers have been loaded. The QO
register gets loaded with the quo<5:0> signal and the Ql with quo_plus_one<5:0> signal. The loading mechanism on the diagram is accomplished with the Ml,M2 muxes.
8-42

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 January 2001 ···Subject To Change

Fbox Divider -

F....DIV

You'll notice that once the bits of the register are loaded they are recirculated with some
extra muxes(M3,M4) in the path. These muxes propagate carries across bits and work
this way:
If the current block doesn't have PROP or GENERATE asserted then the more signifi-

cant bits already in the registers can never get a carry from this block. For this case I
force QO=QO and Ql=QO. For the case where this block does assert GENERATE then I
force QO=Ql and Ql=Ql. The last case is where the block assert PROP then I don't
know whether the more significant bits already loaded will receive a carry because it
determine by a the carry out of a future block. For this case I force QO=QO and Q 1=Q1.
This keeps the status quo until I eventually encounter a block that I definitely know the
carryout status. I switch the entire register with the muxes. The high bits that have their
carryin determined are unaffected by this muxing action because for them QO is equal
to Ql. The bits of the current block are also unaffected because I placed the loading
mux (Ml,M2) after M3 and M4. The bits to the right of the current block are don't
cares because I will eventually overwrite their values with a load operation. As you can
see a fast adder was replaced with 2 extra lactched and 2 extra muxes plus a 6 bit adder
in the control section.
A good way to think about the quotient registers is that Ql=QO+k. The infinitely precise result is always bounded between these two registers. Every pass through the
divider increases the precision of QO and Ql by 6 bits. Put another way, for every pass
through the divider, k gets reduced by 64(2A6). By the time the divide is finished k is
less than one LSB. The MO mux at the top of the diagram is used to pick whether the
final rounded quotient should the QO or Ql register. The M5 and M6 muxes perform
the one bit right normalization shift if necessary.
Back in the control section all of the signals that had to be pipelined one pass are now
needed for rounding since there won't be another pass. These are fed into a PLA along
with the rounding mode in effect and the remainder add selection from the round state
machine. The output of the PLA is whether QO or Q 1 should be used for three different
cases. The three cases are if the final remainder is less than zero, greater than zero, or
plain zero. The result of the sign and zero detect in the datapath swing this 3 to 1 mux
which then drives the final MO mux at the top of the quotient registers. This result is
then written into the register file and forward for use the next phase.

5 January 2001 ·- Subject To Change

Compaq Confidential
Floating-Point Execution Units - the Fbox 8-43

Fbox Square.. Root Unit -

F____ SQR

Figure 8-9 F_DIV Block Diagram

DIVIDEND<50:52>

0 DIVIDEND<0:49>

5
5
5
5
5

SM1
SM2
SM3
SM4
SM5

ROUND CSA

ROUND SM

ZERO DETECT SIGN DETECT

8.10 Fbox Square-Root Unit- F_SQR
The F_SQR unit is responsible for computing the square root of the fraction of the both
VAX and IEEE SQRT instructions. The square root unit does not have an exponent processor. It receives the input operand from the MUL unit and returns the result to the
divide unit. The divider unit selects either the divide result or if no divide result need to
be transferred the square root result. This result is sent to the F_AP2 pipeline for rounding. The square root unit computes the sticky bit for rounding and sends it along with
the result. Since the square unit uses the F_AP2 pipeline for rounding, a bubble needs to
be inserted in the F_AP2. For this purpose the square unit sequencer sends a 'square
root done' signal to the Qbox <tbs> cycles ahead so that Qbox stops issuing a new
instruction to the pipeline. In addtion, it is possible to have a divide and a square collide
for using the F_AP2. To prevent this square root receives a signal from the divide unit
which is used to delay the bubble request.
There is no exception checking in the square root unit. The square root unit assumes its
operand is a non zero operand and injects the hidden bit. The input exceptions including
the zero case is handled by the F_SHP unit. It is possible to abort the square root unit
during a square root by asserting the flush signal.
Operation

The square unit uses a SRT type algorithm and computes two root bits per cycle. As
shown in the block diagram, the square root unit consists of two identical cascaded sections, one section per bit, and a sequencer. The fraction part consists a RTB_IN _MUX
that selects in the first cycle the input operand and in subsequent cycles the second
Comp.aq Confidential
8-44

Floating-Point Execution Units -

the Fbox

5 Jc1m.u1ry 2001 ···Subject To Change

Fbox Graphics Pipeline

row's partial remainder. The RTB_CORR_MUX provides the correction to be added to
the partila remainder. The select controls are based on the previous remainders range.
The output of the correction term and the partial remainder are added in the sign digit
plus binary adder to compute the redundant new partial remainder. The index regester is
a shift register that keep track the insertion point of the root bits. The root register and
its logic serves two purposes: it saves the new root bits and converts the root into a
binary value so that it can be used for the next iteration. The sequencer depending on
the data type sequences the operand input, bubble requests, and result transfers.
Figure 8-10 F_SQR Block Diagram

Radicand register
RTB_IN_MUX

RTB_CORR_MUX
FLUSH_SQRT
SQRT_ISSUE, SQRT_DATATYPE
DIV_BUBBLE_REQ
CLOCK
EL

RTB_ NDEX_REG

SEQUENCER

SQRT_DONE

SEL

RTA_INDEX_REG

RTA_ROOT_REG

8.11 Fbox Graphics Pipeline
A graphics instruction set has been added to the Alpha architecture(ECO 118). The following is a brief summary of the ECO for ready reference. The next section provides a
brief description of the implementation.

5 Jam.mry 2001 ··· Subject To Change

Compaq Confidential
Floating-Point Execution Units - the Fbox 8-45

Fbox: Graphics Pipeline

The Fbox implements the new paired single precision instruction set. These paired SP
instructions are intended to accelerate the front end of the 3D graphics pipeline - object
physics, geometry transformations, clipping, and lighting calculations. The proposed
instruction set also includes several instructions that aid in the calculations involving
complex numbers. They use now popular, single instruction multidata implementation.
Two single precision operands are packed into a 64-bit register and each instruction
operates on two sets of operands thus doubling the performance of these operations.
They use the existing floating-point register file. The graphics ISA is general enough
for many Single precision applications to gain significant performance improvement.
There are 36 new instructions in this proposal. The proposal is to use one new opcode
(07 hex) for the graphics instructions. Since the graphics ISA uses only paired SP/LW
integer, we will use the source datatype (2 bits) field to expand the function field to 6
bits. All other bits of the FP operate instruction remain the same.

8.11.1 Paired SP Floating-point Operate Instruction Format
Table 8-18 Paired SP Floating-point Operate Instruction Format
Bits:

<31:26>

<25:21>

<20:16>

<15:13>

<12:11>

<10:5>

<4:0>

Contents:

Opcode

Trp

Rnd

Fnc

Bits (10:9) used to be SRC field. Now they are part of the FNC field.
The two data types used by the new instructions are shown below:

8.11.2 Register and Memory Formats
Table 8-19 Paired Single-Precision
Operand Low

Operand High

<63>

<62:54>

<53:32>

<31>

<30:23>

<22:0>

Sign

Exponent

Fraction

Sign

Exponent

Fraction

8.11.3 Rounding Modes
All four IEEE rounding modes are supported. For PARCPLx and PARSQRTx instructions, only chopped rounding mode is available and the round mode bits are ignored.

8.11.4 Exceptions
All exceptions as defined by the IEEE standard are generated individually on each half
and the exceptions are ORed together to report. It is possible to get two different exceptions. The results written for each half follow the existing rules specified by the Alpha
Architecture. Note that all 64 bits of the register are always written. It is possible to get
no exception result on one half and an exception result on the other. The FPCR status
flags are updated per the existing rules for the regular floating-point instructions. The
regular floating-point trap control mechanism is also used for the graphics instruction
set.

8-46

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1m.1c1ry 2001 -- Subject To Change

Fbox Graphics Pipeline

8.11.5 Paired Single-Precision Instructions
Table 8-20 lists the paired single-precision instructions. In the table, II means register concatena-

tion.
Table 8-20 Paired Single-Precision Instructions

Instruction

Opcode

Operation

PADD fa,fb,fc

07.10

fcH f- f aH + fbH,fcL f- faL + fbL

PARCPH fb,fc

07.09

fcH f- 1/fbH, fcL f- 0

PARCPL fb,fc

07.0A

fcH f- 1/fbH, fcL f- 1/fbL

PARCPLLfb,fc

07.0A

fcH f- 0, fcL f- 1/fbL

PARSQRT fb,fc

07.0C

fcH f- 1/SQRT(fbH), fcL f-1/SQRT(fbL)

PARSQRTHfb,fc

07.0D

fcH f- 1/SQRT(fbH), fcL f-0

PARSQRTLfb,fc

07.0E

fcH f- 0, fcL f- llSQRT(fbL)

PCADD fa,fb,fc

07.11

PCMPEQ fa,fb,fc

07.28

PCMPLE fa,fb,fc

07.2D

PCMPLT fa,fb,fc

07.2C

fcH f- f aH + fbL,fcL f- faL + fbH
IF (faH .xx. fbH) THEN fcH f- IEEE 1.0 ELSE fcH f- 0 (true zero)
IF (faL .xx. fbL) THEN fcL f- IEEE 1.0 ELSE fcL f- 0 (true zero)
N01E3,4
IF (faH .xx. fbH) THEN fcH f- IEEE 1.0 ELSE fcH f- 0 (true zero)
IF (faL .xx. fbL) THEN fcL f- IEEE 1.0 ELSE fcL f- 0 (true zero)
NOTE 3,4
IF (faH .xx. fbH) THEN fcH f- IEEE 1.0 ELSE fcH f- 0 (true zero)
IF (faL .xx. fbL) THEN fcL f- IEEE 1.0 ELSE fcL f- 0 (true zero)
N01E3,4
IF (faH .xx. fbH) THEN fcH f- IEEE 1.0 ELSE fcH f- 0 (true zero)
IF (faL .xx. fbL) THEN fcL f- IEEE 1.0 ELSE fcL f- 0 (true zero)
N01E3,4
IF (faH .xx. fbH) THEN fcH f- IEEE 1.0 ELSE fcH f- 0 (true zero)
IF (faL .xx. fbL) THEN fcL f-IEEE 1.0 ELSE fcL f- 0 (true zero)
N01E3,4

PCMPNEQ fa,fb,fc 07.29

PCMPUN fa,fb,fc

07.2A

PCPYS fa,fb,fc

07.20

N01El

NOTE 1

fcH f- faH<s> II fbH<exp.frac>,fcL f- faL<s> II fbL<exp.frac>
NOTE7

PCPYSE fa,fb,fc

07.22

fcH f- faH<s.exp> II fbH<frac>, fcL f- faL<s.exp> II fbL<frac>
NOTE7

PCPYSN fa,fb,fc

07.21

fcH f- NOT.faH<s> II fbH<exp.frac>,
fcL f- NOT.faL<s> II fbL<exp.frac>

PCVTFI fb,fc

07.39

N01E7

FcH f- cvt.integer(fbH),
FcL f- cvt.integer(fbL

PCVTSP fa,fb,fc

07.30

fcH f- {CVT 64b SP to 32b SP of fa},
fcL f- (CVT 64b SP to 32b SP of fb)

PEXTH fb,fc

07.3C

fc f- {CVT 3 2b SP of fbH to 64 SP)

PEXTL fb,fc

07.3A

fc f- {CVT 3 2b SP of fbL to 64 SP)

PFMAX fa,fb,fc

07.2E

fcH f- MAX(faH,fbH), fcL f-MAX(faL, fbL)

PFMIN fa,fb,fc

07.2F

fcH f- MIN(faH,fbH), fcL f-MIN(faL,fbL)

5 January 2001 ··· Subject To Change

NOTE5,7

NOTE6.7

Compaq Confidential
Floating-Point Execution Units - the Fbox

8-47

Fbox Graphics Pipeline

Table 8-20 Paired Single-Precision Instructions (Continued)

PHADD fa,fb,fc

07.12

fcH ~ faH + faL, fcL ~ fbH + fbL

PHSUB fa,fb,fc

07.16

fcH ~ faH - faL, fcL ~ fbH - fbL

PHSUBRfa,fb,fc

07.lF

fcH ~ faL - faH, fcL ~ fbL -fbH

PMOVHH fa,fb,fc

07.lB

fcH II fcL ~ faH II fbH

PMOVHL fa,fb,fc

07.lA

fcH II fcL ~ faH II fbL

PMOVLH fa,fb,fc

07.19

fcH II fcL ~ faL II fbH

PMOVLL fa,fb,fc

07.18

fcH II fcL ~ faL II fbL

PMUL fa,fb,fc

07.00

fcH ~ faH * fbH, fcL ~ faL * fbL

PMULH fa,fb,fc

07.02

fcH ~ faH * fbH, fcL ~ faL * fbH

PMULHN fa,fb,fc

07.06

fcH ~ -(faH * fbH), fcL ~faL * fbH

PMULL fa,fb,fc

07.01

fcH ~ faH * fbL, fcL ~ faL * fbL

PMULLN fa,fb,fc

07.05

fcH ~ faH * fbL, fcL ~ -(faL * fbL)

PSUB fa,fb,fc

07.14

fcH ~ faH - fbH,fcL ~ faL - fbL

PSUBC fa,fb,fc

07.15

fcH ~ faH - fbL,fcL ~ faL - fbH

NOTE2,7

NOTES
NOTES

Notes from Table 8-20:
1. The result for these two instructions is accurate to a minimum of 14 bits of precision only. For PARCPLx and PARSQRTx instructions, only chopped rounding
mode is available and the round mode bits are ignored. For PARCPLH, PARCPLL,
PARSQRTH, and PARSQRTL instructions, no checking is done on the unused
operand and no exceptions are generated.
2. With fb = f31, the lower half of the result can be cleared. Similarly high half can be
cleared with fa = f3 l. Operands can be swapped or duplicated with fa =fb.
3. There are no separate graphics branch instructions. With these compares one needs
to use the regular floating-point branches. There was a suggestion for a specialized
Branch for clip test - branch when either is negative. This can be accomplished as
PCMPLT fa, f31, fc; test faH/faL LT 0, fc = 0 only if both positive FBREQ fc,X
4.

Paired single-precision compares write the destination if condition is TRUE with a
value of IEEE 1.0, if the condition is FALSE, a true zero. This behavior is different
from normal floating-point compares. A combination of CMP and multiply instructions can be used to conditionally clear a register, for normal operands.
PCMPxx fl, f2,f3
PMUL

fl ,f3 ,f4

;Each half of f3 gets 1.0 if true, 0.0 otherwise
;Each half of f 4 gets fl (H/L) if condition is true, 0 otherwise.

5. PCVTSP instruction converts two 64 bit single-precision floating- point operands
(with 11 bit exponents) to two 32 bit SP (with 8 bit exponents) and packs them as
paired single-precision operands. Dropping three bits (61 :59) and ignoring (28:0)
does the conversion. No checking is done and there are no arithmetic exceptions.
{The operation is similar to a STS instruction.}
6.

8-48

PEXTx instruction converts a 32 bit SP in paired format to a 64-bit SP number. The
operation uses MAP_S exponent mapping (see SRM Chapter 2 Table 2-2) similar
to a LDS instruction. No checking of bits is done.

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 J~1nu(1ry 2001 -~ Subject To Change

Fbox Graphics Pipeline

7. PMOVxy, PCPYSx, PCVTSP, PEXTx are bit manipulation instructions and generate no arithmetic exceptions.
8. For this instruction the product is rounded first and then the appropriate half (specified by the instruction) of the result is negated.
The graphics instruction set is implemented in two pipelines in the Fbox. The PMULxx,
PARCPLx, and PARSQRTx instructions are implemented in the F_MUL pipeline. All
the other instructions are implemented in a separate pipeline F_GAD. The implementation details for the PMULxx, PARCPLx, and PARSQRTx instructions are described
under the F_MUL section. In the next section the implementation details for the
F_GAD are given.

5 January 2001 -~ Subject To Change

Compaq Confidential
Floating-Point Execution Units - the Fbox

8-49

Fbox Graphics Pipeline

Figure 8-11 F_GAD Block Diagram for One-Half of the Pair

MUX
er

IC
elxd

ticky

LEFT SHIFTER

EXP_RES_ADD
er

Fl2

er+1

VF/UNF DETECT

RND_CSA
B

ADD (32b)

msb

MUX
Er

8.11.5.1 Graphics Add Pipeline: F_GAD

A block diagram of the graphics add pipeline is shown in Figure 8-11. The F_GAD
pipeline has two identical units - one for each half of the paired data. In the block diagram only the high half (F_GHx) of the F _GAD pipeline is shown. The low half
(F_ GLx) is identical to F_GHx and it processes operands in the low half (31 :0 ) of the
pair. The F_GAD has been leveraged from the 21264 Fbox.

8-50

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1nuc1ry 2001 ··· Subject To Change

Fbox Graphics Pipeline

The F_GAD is a 4 cycle latency pipeline. The input operands to the pipeline are driven
by the Multiplier pipeline instead of the interface section, to minimize the length of the
low swing operand wires. The input operands for the GAD pipeline are single precision
floating point operands with 23b fractions and 8b exponents . Hence at the beginning of
the pipeline the fraction datapath is narrower. To accommodate the convert from floating point to 32b integer instructions the shifter and the final adder are 32b wide.
The implementation of add/sub instructions in the GAD pipeline is slightly different
compared to the main Fbox add pipeline. Referring to the section Fbox add pipeline, in
the 'near domain', instead of subtracting the smaller operand at the very beginning of
the pipe (step 2N) using an adder, the two operands are first normalized (left shifted)
removing the bits that produce the leading zeros. This requires an additional left shifter
in the data path. Once the two operands are normalized, the subtraction is done in the
final adder/rounder. Since there is only one adder in the path, for the ediff = 0 case, the
final subtraction must produce a positive result to conform to the sign-magnitude representation of the result. To ensure this, the smaller operand is to be always subtracted
from the other. This is accomplished by first comparing the two fraction operands (the
exponents are same) to determine the smaller operand.
In addition to the regular floating-point instructions, the GAD pipeline needs to implement several variations of add/sub instructions, Moves, floating MAX/MIN instructions. The add/sub instructions and the MOVE instructions require selection of different
halves of the two operands. The FMAX/FMIN instructions are similar to the compare
instruction.
The operation and the implementation of the GAD pipeline are described below using
the high half of the pipeline as shown in the block diagram. The description applies
equally to the low half also.
8.11.5.2 Fraction Datapath
8.11.5.2.1 OP_MUX

The OP_MUX selects the two operands for each half of the pipeline from the possible 4
operands- the two pairs from each source. For the high half of the pipeline the final
operands are flh, f2h on the fraction side and elh, e2h for the exponents. For the normal PADD/PSUB instructions t flh =fah and f2h =fbh, for PADDC/PSUBC instructions, flh =fah, f2h =fbl; for PHADD/PHSUB - flh =fah, f2h =fal and so on. For the
PMOVxx instructions the OP_MUX selects the operand indicated by the instruction on
f2h and zero on the flh and similarly on the exponent and sign parts of the operands.
This allows passing the flh on high half and fll on the low half before they are packed
into a 64 bit result. For the PCVTSP instruction, three bits from the exponent parts of
Fa and Fb registers are dropped and fraction, sign and exponent are selected onto f2h/
f21, s2h/s21, e2h/e21 respectively. The output of the mux is available at the end of FOA.
8.11.5.2.2 FTA, FTB

The FTA, FTB count the number of trailing zeros in the two operands flh and f2h for
calculating the sticky bit. The results of this block are the two signals etzl and etz2. For
effective add/subtract operations, since it is not known until the exponent subtract is
done, which of the operands is smaller, trailing zero for both operands are counted. The
sign of the exponent difference indicates the smaller operand and this is used in calcu-

5 January 2001 - Subject To Change

Compaq Confidea1tial
Floating-Point Execution Units - the Fbox

8-51

Fbox Graphics Pipeline

lating the sticky bit. If ediff is the exponent difference, which indicates the right alignment shift, the sticky bit can be calculated by comparing the ediff and 'etz'- if etz <
ediff then a '1' must have shifted out and hence sticky is 1.
8.11.5.2.3 FGT

The FGf block compares the two fractions flh and f2h and produces a signal flh > f2h
('agtb'). This signal is used for effective subtract with ediff =0 case, PCMP, and
PFMAX/PFMIN instructions. During an effective subtract operation when the efiff = 0,
the samller operand has to be subtracted from the other operand. This signal is used to
correctly compliment the smaller operand in the FI2 Mux. For the compare instructions
this signal is used to set the indicated condition is TRUE or FALSE based on the fraction compare. During the PFMAX/PFMIN operations this signal is used to coorectly
pick the correct fraction to be passed to the result in the FI2 MUX.
8.11.5.2.4 LXD and EXP PRED

The LXD block calculates the leading 1/0 using the input operands flh and f2h for
exponent difference <2 cases. It computes the LXD for three cases: ediff = 0 (flh-f2h/
f2h-flh) and ediff = 1 ( flh - f2h/2), and ediff = -1 (f2h-flh/2).
The EXP PRED block at same time examines the two least significant bits of the two
exponents and predicts the exponent difference using the logic presented in Table II.
One of the three possible LXD vectors is selected based on this prediction. This result
vector contains a '1' for all possible positions of the leading 1 where the very first '1' in
the vector is the correct position. To be able to drive the left shifter for normalization,
this vector needs to be stripped off of the unnecessary 1s.
8.11.5.2.5 LXS and LXE

The LXS block strips the extraneous '1 's after the leading 1 from the LXD vector from
the previous stage. This vector, containing a '1' in the leading 1 position and zeros
everywhere else, is wired to the left shift control to shift the input operands left. The
LXS vector is also encoded in the LXE block so that the left shift amount can be subtracted from the exponent result.
8.11.5.2.6 Fl1/Fl2 MUX and the LEFT/LR Shifters

The Fll/F12 muxes select the input operands for the two shifters. The left shifter is used
only during the effective subtract operation to shift the flh operand left to remove the
leading zeros before they are input to the adder in the final stage. For all other operations the left shifter simply passes the input operand. The L/R shifter which is capable
of shifting the operand left or right is used in all operations. Since conversion from
floating to integer operationrequires a 32b result, the L/R shifter is 32b wide, whereas
the left shifter is only 24b wide. The left shifter takes control directly from the LXS

8-52

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1nuc1ry 2001 ··· Subject To Change

Fbox Graphics Pipeline

vector. The L/R shifter is controlled by LXS and by the ediff from the exponent data
path. The following table lists the operand selected by the Fil, FI2 muxes, control and
the conditions. The operands are shown as frac_l or fract_2 etc.
Table 8-21 Fl1/Fl2 Shifter Operand/Control Selection
Control
LSHF

LRSHF

Left Shifter Input High Input

exp_l >= exp_2

L(O)

R(ediff)

frac_l

frac_2

exp_l < exp_2

L(O)

R(ediff)

frac_2

frac_l

exp_l >= exp_2+1

L(O)

R(ediff)

frac_l * 2

frac_2 * 2

exp_l+l < exp_2

L(O)

R(ediff)

frac_2 * 2

frac_l * 2

exp_l == exp_2

L(lxd)

frac_l * 2

frac_2 * 2

exp_l == exp_2+1

L(lxd)

frac_l * 2

frac_2

exp_l+l == exp_2

L(lxd)

frac_2 * 2

frac_l

exp_2 - bias>= 22

L(O)

L(ediff)

frac_2

exp_2 - bias < 22

L(O)

L(ediff)

frac_2

PCVTFF

L(O)

R(O)

frac_2

PCPYSX

L(O)

R(O)

frac_2

L(O)

R(O)

frac_2

NaN x

L(O)

R(O)

frac_l

NaN NaN

L(O)

R(O)

frac_2

L(O)

R(O)

AGfB

L(O)

R(O)

frac - 1

BGfA

L(O)

R(O)

frac_2

AGfB

L(O)

R(O)

frac_2

BGfA

L(O)

R(O)

frac_l

Oper.

Condition

Low Input

PADDx
PSUBx
Effective add:

Effective sub:

PCVTFI

NaNs: where xis not a Nan
Fl

NaN

PCM PX
PFMAX

PFMIN

8.11.5.2.7 RND CSA and ADDER

The RND CSA and the final ADDER perform rounding and the final addition/subtraction. The round CSA enables rounding and add in one step. The rounding CSA combines the two fraction outputs from the shifters and a rounding constant and prepares
the two inputs for the adder. The rounding constant depends on the rounding mode,

Compaq Confidential
5 January 2001 ~· Subject To Change

Floating-Point Execution Units -

the Fbox

8-53

Fbox Graphics Pipeline

sticky bit from the sticky bit logic. The rounding constant is added in the CSA in the
least significant bit positions. Since there are more than 2 inputs, a CSA in these positions enables adding them. In the high order bit positions a half adder is used.
The adder computes two results - one assuming the MSB = 0 and the other assuming
that the MSB = 1. If MSB = 1, the fraction needs to be shifted down and the exponent
has to be adjusted. Note that effective sub, when the ediff > 1, actually may need a 1 bit
normalization. In order to avoid looking at two bits (hidden bit and the bit below it), the
two input operands have been shifted left by 1 in the shifter input muxes as shown in
the Table. This preshift is also used for the effective sub ediff < 2 cases also to move the
1 bit uncertainty in the LXD logic so that it can be detected as an overflow and corrected.
8.11.5.3 Exponent Data Path

The exponent data paths is simpler than the fraction data path and deals with only 8b
exponent parts of the operands. The exponent section receives the exponent parts of the
final selected operands from the OPD_MUX for each half of the pair.
For floating point arithmetic operation it computes the exponent results and for bit
manipulating instructions it simply passes one of the exponents.
8.11.5.3.1 EDIFF ADDER

The ediff adder computes the absolute exponent difference for add/sub instructions and
the length of the integer portion in the floating-point operand for the convert to integer
operation. To calculate the absolute exponent difference, the ediff adder actually contains two adders - one that computes A-Band the other B-A. Based on the sign of the
first adder (En), the positive result is selected and is used to drive the L/R shifter control. For the convert floating to integer operation, the bias needs to be subtracted from
the input exponent to determine the integer portion of the floating-point value. This is
done in the B-A adder. For PEXTx instructions the bias is subtracted in the B-A adder
and DP bias is added back later to complete the conversion. For PCMPx and PFMAX/
PFMIN, the A-Badder provides the comparison of exponents
8.11.5.3.2 EDIFF DETECT

The Ediff detect logic determines exponent ranges and if the exponent difference is 0,1,
GfR 1, or GTR 25, etc. The exponent range is used to classify the input operands into
denormals, infinities, NaNs, and zero operands. The ediff detects are used for choosing
the near domain vs far domain, out of range in add/sub instructions, to determine if the
exponent comparison for CMP type operation.
8.11.5.3.3 ER MUX

The ER MUX selects the intermediate result for the exponent. For add/sub instructions
it picks the MAX(el,e2), for convert type instructions the result of B-A adder, and for
CPYSX instructions one of the input operand exponents. For PFMAX?PFMIN,
depending on the instruction, MAX(el,e2) or MIN(el ,e2) is selected. The intermediate
result 'er' is driven to the next stage.
8.11.5.3.4 EXP _RES_ADD

The exp_res_add computes the two final results of the exponent. It computes exponent
results Er and Er+ 1 - in case a fraction overflow occurs and the fraction has to be
shifted down. The MSB from the fraction is used to determine which result. During the
Compaq Confidential
8-54

Floating-Point Execution Units - the Fbox

5 Janwiry 2001 - Subject To Cfumge

G....AD Control

effective sub operations when the ediff <2, the 'elxd' - the normalization amount has to
be subtracted from the intermediate exponent. For this, the mux selects the elxd from
the fraction data path. Since the LXD logic can overestimate the left shift by upto 1 bit,
it is possible to shift the fraction result right by one bit The Er+ 1 result is used in this
case. During the PEXTx instructions, er represents the the unbiased exponent result.
The DP bias is selected and and is added back to the 'er'. For all other operations, a
zero is added to the intermediate 'er' and the MSB from the fraction data path is used to
select the final result exponent.

8.12 G_AD Control
The G_AD control pipeline reveves the opcode information during the FY cycle. In
addition, it receives the rounding mode information from the F_SHP pipeline. The control pipeline (not shown in the block diagram) decodes the opcode and controls the fraction and exponent data path. The control logic also detects the various exception
conditions and signals the exceptions to the F_SHP pipe to communicate to the Qbox
using the trap disable bits from the FPCR.

8.12.1 Fraction Data Path
The bits in the Fbox fraction data path are numbered as follows:
Table 8-22 Fraction Data Path
25
Binary point
Hidden bit for floating-point numbers

Bits <AlO:AO> are used for Integer operands and D-type operands.
The exponent data path, E<l 2:0> is 13 bits wide, with E<l 2>representing the
exponent sign bit. The pipeline also maintains a sign (N) and a zero (Z) bit for each
operand.
Operand data is formatted onto the fraction and exponent data paths depending on
the data type as follows:
Table 8-23 Operand Data Fraction and Exponenet Data Paths
BITS

F/S/GfT

AlO:A7

0P<63:60>

0P<63>R=3

0P<59>

0P<62>

A5:AO

0P<58:53>

l(NZ)

0P<52>

OP<52>

Bl:B3

0P<51:49>

0P<54:52>

OP<51:49>

0P<51:49>

B4:B52

0P<48:0>

0P<51:3>

0P<48:0>

B53(R)

OP<2>

Compaq Confidential
5 January 2001 -~ Subject To Change

Floating-Point Execution Units - the Fbox

8-55

Sticky Bit Calculation

Note:

•

In D format, OP<l :0> are lost. OP<2> is used as R-bit for rounding in CVTDG.

•

In L format, operand is sign extended.

The predicate(p) bit and sign bits are taken directly from RF<64:63>.

8.13 Sticky Bit Calculation
The sticky bit represents the aggregated information of the bits shifted out of datapath
in the process of alignment. With the help of the stick bit and round bit, we know
whether the infinite precision result is above midway point or below midway point in
round stage and the finite precision result can be rounded with specific rounding mode.
Essentially, the sticky bit checks every bit shifted out of datapath to see if there is any
'1'. If there is 'l', the sticky bit is set to 1; otherwise, it's 0. The most straight-forward
way is to do a zero detect on the string of bits shifted out. However, this approach needs
a 128 bit datapath in the worst case.
A smarter approach is to calculate the number of trailing zeros in the operand that is
going to be aligned. If the number of trailing zeros is more the the amount of right shift,
we can be sure that bits lost by right shifting are all zeros. Hence it is concluded that the
sticky bit is 0. Otherwise, there must be a one somewhere in the lost bits, which makes
the sticky bit 1. Note that we don't care about the precise value of the lost bits. All we
need to know is the relation between precise value and midway point.
For instance, the number of trailing zeros (etrz) is 6. If the right shift amount (ediff for
elxd) is more than 6, then Sticky bit is 1. If ediff, or elxd, is less than or equal o 6, then
the sticky bit is 0.
In physical implementation, we need to consider many more factors than the simple
example. First, the R bit is in the datapath. Second, all formats are sharing a single 65
bit datapath, so we need to take their differet lengths into account. Besides, we need to
handle integer and floating-point numbers differently. Third, the amount of right shift
can be from ediff or elxd. In addition, the equations are affected by how we encode
ediff and elxd, and the sign of them.
Table 8-24 Equations of Sticky Bit Calculation

8-56

Condition

Sticky

eff. sub & fs:

etrz - (edif+27) < 0

eff. sub & gt:

etrz - (edif -2) < 0

eff. add & fs:

etrz - (edif+28) < 0

eff. add & gt or CV1FQ & -en:

etrz - (edif -1) < 0

cvtqf & FS:

etrz - (29-elxd) < 0

cvtqf & gt:

etrz - (-elxd) < 0

cvtts:

etrz - (edif+28) < 0

cvtfq & en:

etrz - (edif+63) < 0

right shift

left shift

Compaq Confidential
Floating-Point Execution Units - the Fbox
5 Jc1m1ary 2001 ·- Subject To Change

Sticky Bit Calculation

In summary, these equations are composed of three elements: etrz, ediff(or elxd), and a
constant to account for different situations. In physical implementations, sticky is computed with a CSA to compress the 3 numbers and a carry chain to detect the sign, which
determine the value of sticky.

Compaq Confidential
5 January 2001 -· Subject To Change

Floating-Point Execution Units -

the Fbox

8-57

Sticky Bit Calculation

8-58

Compaq Confidentia I
Floating-Point Execution Units - the Fbox
5 J<1nw~ry 2001 ···Subject To Change

9
Memory Instruction Execution Unit -

the Mbox

The 21464 Mbox is responsible for executing Alpha memory reference instructions,
including integer and floating point load and store, memory barrier, prefetch, write-hint,
load-locked, and store-conditional.
The Mbox processes several instructions per cycle, out of order. Each cycle, the Mbox
can accept as many as three load instructions, and as many as two store instructions, for
a maximum of four operations. Unlike the other function units, it is responsible for
keeping track of memory reference instructions which have issued but not retired, and
for ensuring that the final effect of memory reference instructions is equivalent to
sequential execution of the thread, within the Alpha SRM definition of equivalence.
The Mbox also receives fill data from the Cbox and, to maintain cache coherence, processes probes that the Cbox receives from the rest of the system.
The Mbox has four instruction ports to handle loads, stores and prefetches. Three of
these ports can support returning data from the Mbox. Thus, the maximum number of
loads able to be issued to the Mbox each cycle is three. Two of input ports can perform
loads and prefetches; one can perform Loads, Stores and Prefetches, and one port performs only Stores. There are two data input busses, each is associated with a Store port.
The major components of the Mbox are:
Table 9-1 Mbox Major Components
Components

Description

Dcache

64KB of data storage, with a write-allocate, write-through write-policy

Dtags

lK entries of tag storage, arranged as 2-way set-associative with 4 read ports and 1 write
port

Load Queue

64-entry queue that holds issued, but not-retired Load addresses. Handles Load ordering
traps and re-issuing of Loads

Merge Buffer

16-entry buffer that accumulates Store data before writing it into the Dcache and Cbox.

Pre-MAP

Sixteen-entry buffer that holds the addresses of loads that have missed in the Dcache and
need further activity in the Cbox.

Store Queue

64-entry queue that holds Store addresses & data before Stores have retired. Used to satisfy Load requests to addresses with uncompleted Stores

Translation Buffers

128-entry, fully-associative with 4 read ports to perform the virtual-to-physical address
transactions

Compaq Confidential
5 January 2001 ··· Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-1

Major Inputs & Outputs

9.1 Major Inputs & Outputs
9.1.1 Inputs
•

TBS

9.1.2 Outputs
•

TBS

Figure 9-1 Address and Data Path

9.2 Dcache
The Data Cache, or Dcache, is a 64K-byte 2-way associative onchip data storage. The
data is organized in 64-byte blocks, divided into 32 4-byte banks (address bits 5-2 indicate which bank contains which longword). Each bank can accept one read and one
write a cycle. The reads share physical resources for accessing the Dcache. Three
Address ports are input to the Dcache from the Ebox; three Data ports are output back
to the Ebox.

9-2

Compaq Confidential
Memory Instruction Execution Unit-the Mbox
5 J(1nuary 2001 ···Subject To Clumge

Dtags

There are three data ports to the Ebox upon which read data are transferred from the
Dcache. Conflicts can arise if Loads request data from the same Dcache bank (address
bits 5-2 of each Load address are identical). If this conflict occurs, a load is blocked
from completing this cycle, and is retried again as soon as possible.
Stores write into the Dcache during the second half of each cycle.
The virtual address field of bits 14-6 indexes into the Dcache. The two quadwords of
data (one from each way) that are addressed by this bit field and by bits 5 through 3
begin driving their data. At the output of the Dcache, a selector drives only one of the
two quadwords out onto the Load data bus. This selection is provided by the Tag Store,
which uses the rest of the upper address bits to decide which quadword is actually being
requested.
The Dcache also has data parity bits stored with the block data.

9.3 Dtags
The Dtags, or tag stores, are address arrays organized similarly to the Dcache. There are
four Dtag arrays, one for every load port to the Mbox, and one for the back-end operations (Stores, Fills, Invalidates, Victims). Each Dtag array is maintained as an exact
copy of the others. Each holds lK tag entries; every tag entry corresponds to a physical
block of Dcache. Nine bits of virtual address (bits 14-6) are used to index a set of two
tag store entries. The Dtag dedicated to the back-end has an extra capability to access
eight tag store entries (when bits 14 & 13 are not used) so that Invalidates (which use
only physical addresses) can access the Dcache.
Each tag entry has a physical tag field (bits 47-13), control bits that designate the cache
state (valid, owned, and so forth), and it will likely have a parity bit across the tag field.
Each set of two tag entries will also have an allocation bit. This bit signifies which entry
way is the next to be allocated.
The upper part of the physical address (bits 47-13) is compared to the tags being stored
at each of the two accessed entries. If a tag matches the address, and the cache entry is
determined to be in the proper state, a hit signal is generated for the operation. This hit
information is used to generate the hit signal that is sent back to the Qbox. The hit information is also used to control which of the selected quadwords of Dcache data are to
drive the Load data bus.

9.4 Load Queue
The Load Queue, abbreviated as LQ, holds unretired Loads that have been issued to the
Mbox. The LQ entries are allocated to the Loads in program order, by thread. The LQ is
used to maintain ordering of Loads related to Stores and Memory Barriers. It also is
used to re-issue Loads from the Mbox itself, when a Load cannot complete successfully
when it is first issued from the Qbox.
The LQ has 64 entries and is partitioned equally between threads at run-time. Thus,
when a single thread is run ning, all 64 entries are allocated for that thread; if two
threads are running, each is allocated 32 entries (the LQ being separated at its midpoint). When four threads are running, each thread is allocated 16 entries.

5 January 2001 ···Subject To Change

Compaq Confidential
Memory Instruction Execution Unit-the Mbox

9-3

Merge Buffer

Each LQ entry contains bits 14-13 of the virtual address, the physical address, opcode,
INum, Qbox information, a done bit, and retry bits. The LQ is allocated in program
order (by thread) by the Qbox, which assigns load-serial (LNum) numbers for all Loads
during the Mapstage.
If the Load completes successfully, it is marked in the LQ during the Ml stage. Other-

wise, the Load is marked to be retried. The Load may retry due to a cache miss, a bank
conflict or may be a class of Load that can only complete at retirement (i.e., 1/0 Loads
which cannot be done speculatively).
Every cycle, the retry logic in the LQ scans all the entries and finds the oldest ready
entry (in a given thread). Readiness is defined differently for each type of retry, but generally refers to when the Load can make further progress. The retry logic then sends the
INum of the Load and other stored information to the Qbox. Retry candidates are chosen from different threads in a round-robin fashion.
The LQ facilitates speculative execution of Loads by allowing Stores to check if a Load
younger than it, in program order, may have completed (i.e., the Load returned data
before the correct Store data had been sent to the Mbox). When the Store address operation dispatches from the Qbox, it checks the LQ. If a match is found, the oldest Load
that matches the Store address is forced to trap (i.e., the Load INum is read out and sent
to the Qbox). Note that this check is relevant only for Loads and Stores within the same
thread.
The LQ also facilitates speculative execution of Loads past Memory Barriers. This is
made possible by allowing Stores from the Merge Buffer to check the LQ for possible
address matches. In this case, a Store needs to trap a Load from another thread. Note
that in this case, up to three Loads can match the Store's address and signal a trap simultaneously.
LQ entries are deallocated once the Load is past the retire point. The Qbox sends the
LQ an INum for each thread that corresponds to the youngest operation that is being
retired. All LQ entries that are older than this INum are marked as being deallocated.
The LQ drives a signal to the Qbox every cycle that specifies the youngest Load that
may issue out of the Qbox. This signal is based on the youngest Load that is being deallocated and the number of available entries in the LQ.

9.5 Merge Buffer
The Merge Buffer, abbreviated as MGB, holds Stores from the store queue after they
have retired and before they have updated the Dcache. The MGB helps to accumulate
Stores to the same cache block by allowing data from multiple Stores to fill up the same
MGB entry. This accumulation reduces the number of unique write operations needed
to send to the Dcache, thus reducing the Merge Buffer's bandwidth requirements on the
back-end bus.
The Merge Buffer has 16 entries. Two input ports, each providing a data and address
path, are driven from the SQA and the SQD, and up to two entries can be allocated each
cycle. The addresses compare against the existing MGB entries. Each entry contains a
physical address, 64 bytes of data, and a 64-bit mask to indicate which bytes of the
block have been written into the MGB.

9-4

Compaq Confidential
Memory Instruction Execution Unit - the Mbox
5 Jc1nuary 2001 -· Subject To Change

Pre...MAF

If a Store address matches one already in the Merge Buffer, the Store's data are loaded

into that entry, merging with the data there. The entry's byte mask is updated to reflect
the new bytes that are being written.
The Merge Buffer arbitrates with the Cbox for the DC_Data Bus. Once the MGB wins
arbitration, it looks up the Dtags. If the cache block is writeable, the data block is
updated in the Dcache and the Dtags are updated accordingly.
If the block doesn't exist in the Dcache, a miss request is made to the pre-MAF. If all 64
bytes if the MGB entry have been written, a block ownership request is launched; other-

wise a block fill is initiated.
Fills may need to merge data from the Merge Buffer before it updates the Dcache.
Hence, Fill addresses search the Merge Buffer for a match at the same time that the
Dtags are looked up. If a match is found, the Merge Buffer drives the valid bytes onto
the DC_Data Bus, effectively merging with the fill data.
Probes that hit on the Dcache also need to check the MGB. Probes check the Merge
Buffer at the same time that they check the Dtags. If a probe hits on a Merge Buffer
entry, the Merge Buffer reads out the valid data bytes and drives them to the Cbox.

9.6 Pre-MAF
TBS

9.7 Store Queue (SQA and SQD)
The store queue acts as a 64-entry reorder buffer for all Store instructions and a number
of cache movement instructions. The store queue is comprised of two parts: the SQA
that stores all information except the store's data, and the SQD that stores the data and a
duplicate virtual address. The store queue holds information for a store from the time it
issues until that store can be written into either the Dcache or the Merge Buffer. This
movement of information and simultaneous store queue deallocation cannot take place
until the processor retires (or commits to) the store instruction.
While a store instruction is resident in the store queue, it will attempt to supply data to
appropriate younger loads in the same thread. Issuing load addresses compare against
STA contents. If all of the requesting data are in one STD entry, the STD will override
the Dcache's drive of the Load data bus and drive the data itself.
The store queue input ports can accept two stores per cycle. In addition, the store queue
can process up to three loads (providing data) and two deallocations per cycle.

9.8 Translation Buffers
The Translation Buffers, or DTBs, are used to perform fast virtual address-to-physical
address translation. There are four copies of the DTB, one for each Mbox input port.
The DTBs each have 128 entries and are fully associative. Virtual addresses are sent to
the DTBs when Loads and Stores are issued to the Mbox. Each address checks for a
comparison against all of the DTB entries to see if the translation for its virtual page
number (bits 51-13) has already been loaded into the DTB. If there is a match, the DTB
drives back the physical page number (bits 47-13). These address bits are the ones used
to compare against the Dtag address. It is also loaded into the appropriate queue (LQ or
Compaq Confidential
5 Jam.mry 2001 -~ Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-5

Back End Bus

SQ) and driven down to the pre-MAF in case the operation requires miss processing. If
the virtual address bits do not match anything in the DTB, a TB Miss Trap is indicated.
This trap kicks off a PALcode routine that reaches into the operating system's page
tables and generates the appropriate physical address for the current operation. The
physical page number and the page's associated control bits are all loaded into the DTB
via an Internal Processor Register (IPR) write. When the PALcode routine exits, control
is returned to the program at the instruction that caused the TB Miss Trap. When the
operation makes it to the Mbox next time, the translated address will be in the DTBs.
The allocation policy is round-robin.

9.9 Back End Bus
TBS

9.1 0 Operations
9.10.1 Read Requests
Loads typically issue out of the Qbox (in the Q stage of the pipeline). After reading the
register file, it computes the Load address (in the E stage). The full virtual address is
sent to the Mbox at the beginning of MO. The Dtags are looked up in MO, using the
index bits (virtual address bits 14-6), while the DTB transates the virtual address to the
physical address. The translated (physical) address from the DTB is compared with the
tag address from the Dtags. The Dtags are 2-way set associative and the tag comparison
is done on the two tags simultaneously. Only one of these tags can compare with a
given address. Loads write into their assigned LQ entry starting in the MO stage.
Both indexed blocks of the Dcache are retrieved at the same time as the Dtags and
DTB. If the block is present in the Dcache, the hit indication is used to drive only one of
the retrieved quadwords from the Dcache.
The Load compares against SQ addresses in parallel with the Dcache access. If the SQ
indicates an address match, the Dcache drive is inhibited and the SQ drives the data. If
the Load is not satisfied by either the Dcache or the SQ, a miss request is launched. The
Load is retried from the LQ once the missing block is fetched by the Cbox.
If a Load that is dispatched on port 2 is to the same bank as a Load on port 1, the Load

on port 2 is marked as having a bank conflict and must be retried.
If a Load is to I/O space (physical address bit 47 equals 1), then the Load cannot dis-

patch through the memory system until it is known for certain that the Load will not
abort (by an exception or a trap). Once a Load address is found to be in I/O space, after
the DTB lookup, the Load is retried and completed only when the Qbox signals thatthe
Load is next to be retired.
After a Load has completed successfully (e.g., sent data to the Ebox) but before it is
retired (and deallocated from the LQ), it may receive a trap signal. In such cases, the
Load INum is sent to the Ibox.

9.10.2 Prefetches
TBS

9-6

Compaq Confidential
Memory Instruction Execution Unit - the Mbox
5 January 2001 ~· Subject To Change

Operations

9.10.3 Write Requests
Stores are issued from the Qbox on one of two ports. Stores perform a DTB lookup and
load an STA and STD entry. The translated address from the DTB is loaded into the
STA, as well as the virtual address and the other related instruction information.
The store's virtual address also compares against the valid entries in the LQ. This is
done to ensure that if a Load has been executed out of order to the same address as the
Store, it can be replay-trapped.
When a Store retires, the store data are written to the Merge Buffer. When a Merge
Buffer entry is selected for writing to the Dcache, the Dtag is accessed to ensure that the
block is writeable. When the Merge Buffer entry completes its write to the Dcache, the
SQA entry is invalidated.
I/O Stores cannot be completed speculatively; they cannot be cached. There is also a
strict restriction of how subsequent 1/0 Stores can merge into one system request. The
Dtags look-ups are irrelevant, because there will never be an 1/0 address in the Dcache
on which an incoming Store can hit. But, since the translation occurs simultaneously to
the first Dtag access, the Dtag access will continue until the tag comparison. At that
point, when the physical address is determined to be in 1/0 space, the Dcache miss signal will be inhibited, and no miss request will be sent to the MAF. Still TBD is how I/O
Stores that have made it to the Merge Buffer are coordinated with any I/O Loads in the
MAF. I/O operations within a thread must beserviced in program order.

9.10.4 Retries
A Load which issues from the Qbox may not be able to complete in the Mbox for various reasons. It can have a Dcache miss (and SQ miss), have a bank conflict, or be a
Load to I/O space which is not ready to be retired. In each of these cases, the Load is
marked to be retried.
Every cycle, the retry logic in the LQ scans all the entries and finds the two oldest ready
entries (in a given thread). Readiness is defined differently for each type of retry, but
generally refers to when the Load is again able to make further progress. It then sends
the information about the load to the pre-MAF. Retry candidates are chosen from different threads in a round-robin fashion.
For a Dcache miss, a retry is marked ready only after the Cbox has signaled to the
Mbox that the data return is imminent. The retry readiness is detected by the LQ when
the MAF number for the completing Fill (which is driven by the Cbox) matches the
MAF number stored in the particular LQ entry. All Loads that have this MAF number
will signal readiness to the retry logic. These Loads will have their retries serviced in
INumorder.
Bank conflict retries are marked ready immediately after they occur. As soon as the
retried Load is the oldest ready retry in the LQ, it will be sent to the Dcache again.
Retried Loads are guaranteed not to get a bank conflict.
When the Mbox receives the signal from the Qbox that an 1/0 Load is the next instruction to retire in a particular thread, the LQ marks the operation as ready to retry. 1/0
Loads will be retried in correct order due to the fact that they become ready in INum
order and they are selected in the retry logic in INum order.

5 January 2001 ··· Subject To Change

Compaq Confidentia I
Memory Instruction Execution Unit - the Mbox

9-7

Operations

9.10.5 Dcache Misses
If a Load operation is found to miss in the Mbox, based on the Dtag and SQ look-ups,

then the data are sought from the Cbox. The Cbox will first look for the data in the
Scache. If they aren't found there, it will seek the data from the external memory system. Cbox activity is initiated by loading a MAF entry with the physical address of the
missing Load. The physical address is driven to the MAF at the same time it is driven to
the LQ. The lower portion of the physical address is driven from the input ports (12:0)
and the upper portion comes from the DTB (47:13). The addresses for the issued Loads
(a maximum of three per cycle) are held at the input of the MAF until the hit/miss is
determined. Loads that have missed are queued. The addresses of the first two entries in
this queue are driven to the MAF to compare with the addresses already in the MAF.
If a miss address is found to be in an address block of an entry already in the MAF, the

address may be mergeable with the existing MAF entry, based on the merging rules
(TBS). If the address doesn't match an already existent MAF entry, a new MAF entry
will be allocated based on the MAF allocation policy (TBS). The MAF entry number
where the miss is loaded (or merged) is sent back to the LQ and stored there. If there are
no free MAF entries, though, the Load miss cannot continue yet; a retry is marked in
the LQ entry for the operation.
The Cbox arbitrates between the Loads, Stores and I-stream requests that are in the
MAF for its next Scache and system operations. The Cbox forwards the Fill data some
time later to the Mbox and indicates which MAF entry the data are for. That MAF number is driven to the LQ. Any Load operations that are waiting for these returning data,
will match and signal that one or more retry is required. The data are driven by the
Cbox into the Mbox, which steers them into the FRD buffers, via the Back End Bus.
When the Load retry (or retries) have made it back through the pipeline, the appropriate
FRD buffer is addressed (either by a CAM or by an index, TBD) which drives the Load
data bus.
Stores have to check the Dtags as they write to the Dcache. The Dcache state is most
relevant to the Store right at the time it can commit its data to the memory system
(when its data become system coherent). This look-up is necessary, because the Dcache
state could have changed between when the Store first accessed the Dtags and when it
is actually committing its data. A dirty Dcache block could have been victimized by the
time the Store made it to retirement. If so, the Store needs to initiate a miss request, via
the MAF. This time, though, the address is coming from the Merge Buffer. The returning data from the Cbox are sent to the Fill Buffer, and there is some interaction between
the Fill Buffer and the Merge Buffer so that the newest bytes of the block make it into
the Dcache, superceding the older bytes of the data block.

9.10.6 Load Locked/Store Conditional
Load-Locked and Store-Conditional operations are used by the Alpha architecture to
facilitate data sharing by multiple processes in the machine. Generally, a processor
attempts to perform the Load-Locked (or LDx_L) and Store-Conditional (or STx_C) as
a pair of operations, atomically, both to the same address. The address represents a
block of data in memory that is a shared resource. A processor attempts to read from
and then write back to the shared resource before any other processor in the machine
has written to that resource. The LDx_L issues first, loads data from the shared memory
address, and performs an internal operation to denote that the operation has been per9-8

Compaq Confidential
Memory Instruction Execution Unit - the Mbox
5 J~·muary 2001 ·- Subject To Cl1ange

Operations

formed. The STx_C is only allowed to complete successfully if no other processor has
written to this block. If it can be successfully completed, the STx_C performs the write,
clears any internal state set up by the LDx_L, then returns a value to the source register
denoting success. If the STx_C fails, no write is performed, the internal state set up by
the LDx_L is cleared, and a return value denoting failure is written into the source register.
The 21464 implementation of LDx_L/STx_C is somewhat complicated by the fact that
Loads and Stores can be issued out of order to the Mbox. Thus, the Mbox needs to wait
until the retire point of the operations before performing the locking operation, as a way
of guaranteeing that the processor has committed to the operation, and to maintain the
proper ordering of events.
At its retire point, the LDx_L operation loads a lock register with its address. When the
STx_C retires, it compares its address with what is in the lock register. If there is a
match, the STx_C can complete successfully if and when the processor has exclusive
ownership of the block. The STx_C retire is delayed until this exclusive ownership is
acquired. If the processor already has exclusive ownership, this retirement delay is not
very long.
When the STx_C is ready to complete, whether successfully or with a failure, it signals
a retry to the Qbox, driving the STx_C INum back to the Qbox. In response, the Qbox
"bubbles" the pipeline; which means that it allocates a Load port for this operation, but
doesn't initiate a new operation to the Mbox. When this bubble reaches the point in the
pipeline where cache data are usually driven for a Load operation, the Mbox will drive
the success/failure bit onto the least significant data bit. The Qbox then writes these
data into the correct register for the STx_C.
There are several situations that can cause an STx_C to fail. One of these situations is if
the system is not be able to give exclusive ownership to the ownership request generated by the STx_C. Another case is when a Store operation to the same memory block,
from another processor, beats the STx_C to the system memory. This competing Store
will enter the processor as an invalidate, which will clear the lock register. A third way
of failing a STx_C is if the Ibox receives an interrupt, a trap, any other control flow
change (e.g., taken branches) or another memory opera ion between the time it has
detected the LDx_L and when it has detected the STx_C, a STx_C failure must occur.
Because this last case is detected in the Ibox, a special bit must travel with the STx_C
through the Qbox and then out to the Mbox. The Mbox stores the bit with the STx_C
and upon its retiring, will use it as a condition for success or failure. If the bit indicates
a failure, no further processing in the Mbox is done.
The STx_C, regardless of whether it can complete successfully, always clears the lock
register at its retire time.

9.10.7 Traps
A trap may be signaled while an operation is in its initial dispatch pipeline in the Mbox,
or it could be signaled later, after the operation has completed but before it has been
retired. Traps clear all processor state relating to that instruction and all other instructions younger than it (in that thread). Traps are classified as either replay traps or exceptions.

5 January 2001 -~ Subject To Change

Compaq Confide11tial
Memory Instruction Execution Unit -

the Mbox

9-9

Interfaces

Replay traps can be signaled due to the following: a Store address operation (front-end
Store) dispatching through the Mbox finds a matching Load that has completed; a Store
data dispatch (back-end Store) finds a matching Load in the LQ that belongs to another
thread and the Load has completed; a Probe finds a matching Load in the LQ; SQ supplies incorrect data to a Load; Load gets a correctable ECC error. In each of these cases,
the Load is restarted from the front-end of the pipeline.
Exceptions may be caused due to the following: a memory operation (Load or Store or
Prefetch) finds a TB Miss or a violation (non-existent page, invalid operation); a Load
encounters an ECC non-correctable error; a memory operation encounters a non-existent memory error.
A trap indication along with the INum, is sent on a special kill-bus to the Qbox. The
Qbox arbitrates all the exceptions it is receiving this cycle and redirects the front-end
suitably.

9.10.8 Invalidates/Probes
•

Sources/Reasons

•

Flow of Events

9.1 0.9 Memory Barriers
•

General Concept

•

Issues with Speculation

•

Issues with Multi-threading

9.10.1 O Multi-threading
•

Support in Mbox for MT

•

Implications of MT on Mbox Events

9.11 Interfaces
9.11.1 Pipeline Legend
Load/Store Issue Pipeline

Load/Store Queue Dealloc. Pipeline

ark Entries to Dealloc

9-10

ick Retire Block

11
pdate Hi Water Mark

Compaq Confidential
Memory Instruction Execution Unit-the Mbox
5 Jc1n11ary 2001 ···Subject To Change

Data address Translation buffer (OTB)

tore Copy-Out Pipeline

10
ead SQ Entry

etire INum Com- ick Retire Block

11
end Addr to
erge Buffer

ead SQD Entry

end Data to
erge Buffer

Merge Buffer pipeline

ty13

Iv14

]Y_15

!Allocate Merge Buffer

~ssert NAK to Store Queue

lPick entry to write through

Back End Bus pipeline

z
rv. DC_data

Scache write-thru pipeline

id for Scache

2
3
omplete Scache send WriteThruDrocessing
ne to Cbox

0
ick Merge Buffer rive to Scache
ntry

9.12 Data address Translation buffer {OTB)
Point to Mbox Contract section for introductory material.
Point to Interfaces.

Compaq Confidential
5 January 2001 ·-Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-11

Data address Translation buffer {DTB)

9.12.1 Timing
Table 9-2 Memory Operation (Launch)

EO
A

Receive OP
Ebox drives LD
Issue from Qbox Addr;CAM
ASNandTPU

Launch VA into
TB, Tag, Stq

Read PA from
TB; determine
TB miss

Compare PA's;
determine
DC_Hit

Table 9-3 HW_MTPR TB Invalidate, TAG or PTE Issue

Ebox drives LD
Receive OP
Issue from Qbox Addr

M1
B

Latch VA

Table 9-4 HW_MTPR TB Invalidate or PTE Retire

Receive Retire
fromQbox

Wait for bubble
Send
Disable_Memop grant (about 18
Bubble in um to cycles)
Mbox retry/trap
logic

Table 9-5 HW_MTPRTB PTE Retire Bubble

EO of bubble
A

MO
B

Receive bubble IPR drives Tag
from staging reg- Addr, write
ASNtrPUGRP
ister

Write PIE Tag
into TB

Write PA into TB

Table 9-6 HW_MTPR TB Invalidate Retire Bubble

EO of bubble
A

MO
B

Receive bubble CAMASN
from staging reg- (IASN/IS)
ister

9-12

CAM saved VA
(IS only)

Clear TPU_Valid
bits

Compaq Confidential
Memory Instruction Execution Unit-the Mbox
5 J~·muary 2001 --·Subject To Clumge

Data address Translation buffer {DTB}

9.12.2 What Data are Compared on a OTB Lookup?
On every DTB lookup, we will need to receive the following inputs:
•

Opcode issued

•
•
•

TPU number (which is translated into a TPU group)
Address Space Number stored in the ASN IPR for the above TPU
Virtual Address Tag

Bits decoded from the opcode modify TB behavior in the following ways:
LD_PHYS
LD_VPI'E
ALT_MODE

bypasses TB and sets PA =VA. (PA appears with the same timing as a normal LD).
Retry/trap logic generates DTB double miss fault instead of single miss.
Retry/trap logic uses mode stored in DTB_ALTMODE IPR instead of the
CM IPR for access checks

A DTB hit produces both a physical translation and a protection mask. The protection
mask is ten bits, consisting of read and write permissions for each of kernel, executive,
supervisor and user modes, along with fault on read and on write. The retry/trap logic
uses these bits, along with the current mode and opcode to determine if an access violation trap is required.
The VA comes from the Ebox in EOB. The TPU ID comes from the Qbox in EOA and is
used to locate TPU group, ASN and mode information in the IPR section.
The ASN match is pre-evaluated one phase earlier, in EOB, and added to the VA CAM
in MOA to make the DTB VA CAM match lines shorter.
The Ebox will signal a BAD_VA trap if the VA is not correctly sign extended, i.e.,
VA<63:52> != VA<51>.
For each DTB entry the following information is stored for comparison with the current
VA.
ASN<7:0>
ASM
TPUGRP_ VALID<3:0>
VA<51:13>
VA_DONTCARE<24: 13>

Address Space Number I process ID
Address Space Match
TPU Group Valid bits
Virtual Address Tag
Decoded Granularity hint bits

Every application process has its own virtual address space. Therefore each process has
its own Address Space Number. The ASN is used to specify which process the PfE is
associated with.
The operating system can allow multiple processes to share PfE's. The Address Space
Match bit allows this. If the ASM bit is set the ASN is not compared on a DTB lookup.
To allow the 21464's four TPUs to be used to create 2 or more independent virtual
CPUs, a new mechanism to allow each PfE to be specific to a subset of the 4 TPUs will
exist. Four TPUGRP_VALID bits will indicate which TPU group each entry is valid
for.

5 January 2001 -·Subject To Change

Compaq Confidential
Memory Instruction Execution Unit - the Mbox

9-13

Data address Translation buffer (DTB)

9.12.2.1 The TPU Group

The TPU Group is a mechanism to control the sharing of address spaces and address
space numbers among TPU's. TPU group membership is defined by the TPUGRP IPR
and affects which lines can match on a DTB lookup. These equations will be described
later, but for now, here is the high-level description of what all these things mean.

•

The TPU Group delimits a space where TPU's in different groups have distinct
spaces of ASN values, ASM bits, superpages and address mappings. TPU's in different Groups share the same relation to each other as different processors within a
SMP machine. They could even be running different operating systems.

•

The ASN delimits a space where TPU's share ASM entries and superpages, but
have distinct mappings for non-ASM pages. TPU's in the same Group, but with different ASN, are running different processes within the same instance of an operating system.

•

TPU's whose Group and ASN both match share their entire address space. They
will be running different threads within one process and oneoperating system.

The Operating Systems people have referred to two different modes of operation, which
correspond to different TPU Group assignments. (These mode names are informational
only, and do not affect the design of the DTB.)
•

Mode 2 (expected to be used by VMS) is the SMP-like mode, where every TPU is
in a different group.

•

Mode 3 (expected to be used in Unix) is the full multithread mode, where all TPU's
are in one group.

To implement the TPU groups, each DTB entry has four valid bits, indicating which
group the entry pertains to. At most one valid bit may be set for a given entry. Having
all four bits clear indicates that the entry is invalid for all groups.
For a DTB entry to match when doing a compare the following condition must be met:
VA<51:13> == current_VA<51:13>
TPUGRP_VALID<current_TPU_group>
(ASN<7:0>
current_ASN<7:0>

AND
1 l1

AND

ASM

'1')

9.12.2.2 Granularity Hints

This condition become a bit more complicated with the addition of Granularity
Hint(GH) bits. GH is a mechanism that allows contiguous pages to be treated as one
larger page. GH bits allow recognition of pages of size 8x, 64x, and 512x. The 2 bit GH
encoding is interpreted in the following manner:
Table 9-7 Granularity Hint Encoding

9-14

Page

Size With SKB Base

Size with 64KB Base

VA Compare

normal

8Kpage

64K page

compare VA<51:13>

64Kpage

2Mpage

compare VA<51:16>

64x

512Kpage

64M page

compare VA<51: 19>

512x

4096Kpage

512M page

compare VA<51:22>

Compaq Confidential
Memory Instruction Execution Unit - the Mbox
5 Jc1nuary 2001 ··· Subject To ChangE~

Data address Translation buffer (DTB}

The condition then becomes:
VA<51:22> == current_VA<51:22>
VA<21:19>
current_VA<21:19>
( VA<18:16> == current_VA<18:16>
( VA<15:13> == current_VA<15:13>
TPUGRP_VALID<current_TPU_group>
( ASN<7:0> == current_ASN<7:0>

AND
OR
OR
OR
1' AND
OR

'11'
GH
GH >= '10'
GH >= '01'
ASM ==

AND
AND
AND

1'}

If this condition is not met for any of the DTB entries a DTB miss will occur.

9.12.3 64K Pages
The 21464 supports a 52-bit virtual address and 48-bit physical address. These widths
lead to several complications when used with the 8K page size standard with the Alpha
architecture. Chief among these is that the page table entry stores the physical page
frame number as the upper longword of the quadword entry. A 32-bit page frame number and a 13-bit page offset combine to permit only a 45-bit physical address. For this
reason, the 213464 has a 64K-page mode, with a 16-bit page offset, permitting the 48bit physical address space.
A second benefit of the 64K-page mode is that the full 52-bit virtual address space can
also be accessed using a 3-level page table, instead of the 4-level table that would be
needed with SK pages.
The 64K page mode is implemented by a bit stored (on a per-TPU basis) in the
VA_CTL IPR. When set, this bit has the following effects:

•

All granularity hint values are increased by one:
GH

00
01
10
11

64K

VA<15:13>
VA<18:13>
VA<21:13>

VA<15:13>
VA<18:13>
VA<21:13>
VA<24:13>

•

The PPFN stored in DTB_PTE<63:32> corresponds to PA<47:16>.

•

The VPTE offset in VA_FORM<38:3> corresponds to VA<51:16>.

From a hardware perspective, the 64K mode inserts another set of conditions into the
granularity hint logic, and a 3-bit shift into the PTE part of the DTB and into the VA
input into VA_FORM.

9.12.4 Hit Determination
The DTB is also in charge of determining whether a particular load access hits in the
cache. This is located at the DTB to minimize the load bus timing, even though it is not
strictly a matter of address translation.
Hit determination is done by driving both sets' tag addresses from the corresponding
tag array and comparing them with the translated physical address. A match sets the hit
bit and steers the set select to the matching set. For test purposes, the DC_CTL IPR can

5 January 2001 -·Subject To Change

Compaq Confidential
Memory Instruction Execution Unit-the Mbox

9-15

Data address Translation buffer {DTB)

force the cache to always hit in one set. In addition, the Store Logic can force a load to
hit, if Store Logic can supply data, or to miss, if it should supply data, but is unable to.
The Fill Buffer can force a load to hit, if it is supplying data for an 1/0 load. These conditions go into the hit logic and override the results of the tag comparison. The hit indication is returned to the Mbox retry/trap logic, which, in the event of a miss, marks the
operation for retry, and poisons its load data.
The output of this logic is the DC_HIT signal, which provides the overall indication of
whether the operation generated valid data, along with drive enables to the set drivers,
Store Logic and Fill Buffer to steer the proper data onto the
M%LD_DATA_M2A<63:0> bus.

9.12.5 Returned Status
In the most general terms, either of two things can happen when a memory operation
comes in. First, the DTB translation could succeed. In this case, the DTB returns a
physical address, along with Dcache hit and set select. Secondly, the DTB translation
could trap. In this case, the DTB returns a trap reason in MIA to the Mbox retry/trap
logic. After a little while, when this gets back to the Qbox, the operation in question
will be killed. Some time thereafter, PALcode will deal with the trap. Among other
things, this means that the address and hit indications generated are irrelevant and are
permitted to be garbage.
Trap processing is governed by the following rules:

•

If the Ebox sent a poisoned address, all traps are inhibited. Poison indicates that the

operation in question is the dependent of a load miss, and thus is garbage. The
Qbox will reissue such operations after the load retry.

•

Otherwise, if the Ebox sent BAD_VA, indicating that the sign extension check
failed, we signal a BAD_VA trap. PALcode will need to emulate the instruction
(pointed to by the EXC_ADDR IPR) to find out what the failing address was.

•

Otherwise, if no entry in the DTB matched the address, and the opcode is a
LD_VPfE, we signal a DTB_MISS_DOUBLE trap. PALcode will need to use the
double miss flow to find the correct PfE.

•

Otherwise, if no entry in the DTB matched the address, and the opcode is not a
LD_VPfE, we signal a DTB_MISS_SINGLE trap. PALcode will need to use the
single miss flow to find the correct PfE.

•

Otherwise, if an entry in the DTB matches the address, but the PTE has protection
bits inconsistent with the access requested, the Mbox retry/trap logic signals an
ACV trap. The specific equations are as follows:
mode = (OP == HW_LD/Alt II OP== HW_ST/Alt) ? DTB_AL1MODE : IER_CM/CM
prot = switch (mode):
case KERNEL: (KRE, KWE);
case EXEC: (ERE, EWE);
case SUPER: (SRE, SWE);
case USER: (URE, UWE);
ACV = (RD & (FOR 1-prot[O])) I (WR & (FOW 1-prot[l]))

Compaq Confidential
9-16

Memory Instruction Execution Unit -

the Mbox

5 Jc111uary 2001 ... Subject To Change

Data address Translation buffer (DTB)

9.12.6 Effects of a OTB Miss
As mentioned earlier, a DTB miss causes a DTB miss trap. The DTB miss trap is delivered to the Mbox central trap handler. The central trap logic checks whether the associated opcode is a LD_VPfE, which requires double-miss handling, or anything else,
which is a single miss, and requests that the Qbox kill the faulting instruction and dispatch to the appropriate PALcode flow.
The PALcode trap handler determines the appropriate PfE from the operating systems
page tables. It then fills this PfE into the DTB by writing to IPRs DTB_TAG and
DTB_PfE. All communications with the DTB from the PALcode routine is done
through writes to IPR registers, which will be discussed later.
The HW _MTPR DTB_TAGO must issue before the HW_MTPR DTB_PfEO, and, similarly, DTB_TAGl before DTB_PfEl. This is ensured by having the PALcode restriction that TAGO must come before PfEO, and be in the same picker, and TAGl must
come before PfEl and both must be in the opposite picker as TAGO and P'fEO. Having
the MTPR to TAGO, TAG 1, PfEO, PfEl immediately adjacent and in that order satisfies this requirement. In addition, the MFPR VA must be before and in the same picker
as the LD_VPfE. (Being before and in the same picker as the MFPR VA_FORM is also
acceptable.)
When the PfE has been written to the IPRs, it is not copied from the IPR into the DTB
until the instructions that write the IPRs have retired. To ensure consistency, in any flow
containing writes to the DTB TAG or PfE IPRs, either the complete set of four MTPRs
must retire, or none of the writers may retire.
So that all four TPU's can be working on TB miss flows at once without colliding, there
must be four sets of physical IPR's, with one set visible to each TPU.
A thread must not complete another DTB fill flow while a prior fill flow in the same
thread is unretired. This is done by scoreboarding the HW _MTPR TAGO at the beginning of the DTB fill flow against the retire of the HW_MTPR PfE 1 at the end of the
previous DTB fill in the Qbox, as documented by the Qbox.
A new DTB entry is written from the IPR set into the DTB array when the HW_MTPR
DTB_TAGl retires. When this retire occurs, the DTB requests that the Mbox retry logic
bubble back the HW_MTPR inum to the Qbox, which inserts a bubble into all four
load/store pipes, and also releases any waiting DTB writes for that TPU. The DTB
entry is written when the bubble arrives. In the meantime, memory operations in the
same TPU as the writer of a retired but unwritten DTB entry can continue to use the
copy of the DTB entry stored in the IPR.
9.12.6.1 Speculative and Duplicate OTB entries

Because the single DTB miss flow is performance-critical, DTB entries must be usable
even before the DTB writer retires. At the same time, if the DTB writer turns out to be
on a bad path, it must not have affected any good path instructions. In addition, since
the DTB is a shared resource, this restriction also applies to other TPUs.
To permit all this, a speculative DTB entry stored in a DTB_TAG/DTB_P'fE IPR set
may be used for a memory operation if all of the following conditions are met:

•

The DTB entry is not the result of poisoned data.

•

The DTB entry has not been invalidated as a duplicate.

5 January 2001 ···Subject To Change

Compaq Confidential
Memory Instruction Execution Unit-the Mbox 9-17

Data address Translation buffer (DTB)

•
•

The DTB entry was written by the same TPU as the memory operation.
The DTB entry is older than the memory operation.

Because, among other reasons, two TPUs could execute a DTB fill almost simultaneously, it is possible for the speculative PfEs of two TPU s to translate the same
address. These will never be used simultaneously, by the rules above. However, if one
or both PfEs retire and are written to the DTB array, multiple DTB entries could be
activated on a single operation. This has unpleasant electrical and logical consequences.
The following rules ensure that an entry in the DTB array can never duplicate another
entry in the array or a speculative PTE.
•

A HW_MTPR TAG performs a CAM of the DTB when issued, and invalidates
itself if a hit occurs.

•

A HW_MTPR PfE performs a CAM of all of the speculative TAGs when it retires,
and invalidates all matching speculative TAG/PfEs.

9.12.7 Data Storage in the PTE
The DTB uses a RAM array to store all data to be retrieved on a DTB read:
PA<47: 13>
UWE,SWE,EWE,KWE
URE,SRE,ERE,KRE
FOW,FOR

Physical address bits of 8K page.
User,System,Executive and Kernel Write Enable bits.
User,System,Executive and Kernel Read Enable bits.
Fault On Write, Fault On Read.

When a PA is written to the DTB, each bit is either XORed with the respective VA bit
or it is forced low. When the PTE is read from the DTB each PA bit is again XO Red
with the VA. This will restore the PA bit for any bit that was XORed on the way in. If a
bit was forced low on the write then it will insert the VA bit on the read. This provides a
mechanism for supporting GH bits and superpages(see later).
The PA bits are written in the following manner:
PA_write<47:22>

XORed with VA_write<47:22>

PA_write<21:19>

XORed with VA_write<21: 19>

if (GH < '11')

otherwise write "000"

PA_write<18:16>

XORed with VA_write<18: 16>

if (GH < '10')

otherwise write "000"

PA_write<15:13>

XORed with VA_write<15: 13>

if (GH < '01')

otherwise write "000"

The PA bits are read in the following manner:
PA_read<47:13> is XORed with VA_matched<47:13>

The protection and fault bits are compared with the access requested, and the Mbox
retry /trap logic sends back an ACV trap if inconsistent. The IPR section will also latch
the VA and access bits in the MM_STAT and EXC_ADDR IPR's. The IPR logic tracks
Mbox faults as they happen, together with the P%RK_KILL_V5A bus, to ensure that
these IPRs reflect the last good path faulting instruction.

9.12.8 I PRs That Affect the Contents or Behavior of the OTB

DTB_TAGO,DTB_TAG1

9-18

Compaq Confidential
Memory Instruction Execution Unit-the Mbox
5 Jc1nuary 2001 --·Subject To Clumge

Data address Translation buffer (OTB}

The VA is written to the DTB_TAG register by use of the MTPR PALcode instruction.
Four registers of each type exist (1 per TPU) although only one is visible to the user at
any given time.
DTB_PTEO,DTB_PTE1

The PA and protection bits are written to the DTB _PfE register by use of the MTPR
PALcode instruction. Four registers of each type exist (1 per TPU) although only one is
visible to the user at any given time.
DTB_IA (Invalidate All)

When a write to this IPR retires all PTEs in the DTB are invalidated for TPU s in the
writer's TPU group. PALcode must include an IFETCHB after this IPR is written and
before any memory operation.
DTB_IAP (Invalidate All Process Specific PTEs)

When a write to this IPR retires all process specific(ASM==O) PfEs are invalidated for
the writer's TPU group. PALcode must include an IFETCHB after this IPR is written
and before any memory operation.
DTB_IASN (Invalidate Address Space Number) (Proposed)

When a write to this IPR retires all process specific(ASM==O) PTEs with the ASN of
the current TPU are invalidated for the writer's TPU group. This has been requested by
Unix, and may be implemented if they provide a justification and SRM change. PALcode must include an IFETCHB after this IPR is written and before any memory operation.
DTB_IS (Invalidate Single)

When a write to this IPR retires any entry in the DTB which matches the VA provided
in the IPR( and current ASN of the TPU if ASM==O) is invalidated for the writer's TPU
group. PALcode must include a IFETCHB after this IPR is written and before any
memory operation.
M_CTL (Mbox Control Status Register)

When a write to this IPR retires, bits SPE<2:0> in this IPR enable the 3 superpage
modes. Superpage enables only affect the group of TPUs belonging to the writer of
M_CTL.
TPUGRP (Thread Processing Unit Grouping definition)

When a write to this IPR retires its contents define how TPUs are grouped. When a new
PfE is written to the DTB, the PTE can be made valid for only the group of TPUs to
which the writer belongs. Whenever a TPUGRP is reused the DTB must be flushed for
that TPUGRP.
DTB_ALTMODE (Alternate Access Check Mode)

This is the mode used when a memory reference comes in by way of a HW _LD or
HW_ST instruction with ALTMODE set. This is used to implement various probe operations, and must exist on a per-thread basis. In such cases, the Mbox retry/trap logic
uses the access mode (kernel, executive, supervisor or user) stored here, rather than the
one stored in the CM IPR.

Compaq Confidential
5 January 2001 -· Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-19

Data address Translation buffer {DTB)

9.12.9 Superpages
Superpages are an extension to VA ->PA translations. These mappings provide a translation outside the DTB for three regions of VA space. The translations are all set to permit access in kernel mode only.
SuperpageO is used to direct map one quarter ofWindowsNT's 32bit address space. The
kernel code is kept in this area of memory. It is believed that 64-bit NT will use the
Unix superpage mode.
Superpagel is used to dir~,et map the least significant 41 bits of the physical address
space (bits <47:41> sign extended) to support older versions of UNIX and VMS. This
superpage is consistent with the 43 bit virtual address supported by EV4, EV 5 and the
size of the 3 level VPTEs used in Digital UNIX. (see SRM Digital UNIX II-B section
3.1.1).
Superpage2 is used to direct map the whole of the physical address space for more
recent versions of UNIX and VMS which may use four level PfEs.
In hardware, we simply need a set of special match lines in the CAM which detect the
following conditions:
I f M_CTL[SPE<2>]
11

I f M_CTL[SPE<l>]
11

1 AND VA<51:50>

then

PA<47: 13>=VA<4 7: 13>, USEK= 0001

1 AND VA<51:40>

then

PA<47:13>= "1111111 ,VA<40:13>,
USEK= 000l

111111111101

I f M_CTL[SPE<l>]

"111111111100

1 AND VA<51:40>

then

PA<47: 13>=
USEK= 000l
11

If M_CTL[SPE<O>]
1 AND VA<51:30>
1111111111111111111110
11

then

0000000

11
,

VA<40: 13>,

PA<47: 13>= #00000, VA<29: 13>, USEK= 0001
11

M_CTL is an IPR. SPE<2:0> are the 3 superpage enable bits within the IPR. The operating system must ensure that no valid PfE in the DTB array will ever conflict with an
active superpage, by never including the superpage region in the regular page tables.
When one of these conditions is detected - a special, hardwired PA entry is read from
the PA array. Since all PA bits are XORed with VA bits as they are read from the array the hardwired PA entries are as follows:
PA<47:42> PA<41> PA<40:31> PA<30:13>

Superpage 2:
Superpage 1 & VA<40>=1:
Superpage 1 & VA<40>=0:
Superpage 0:

All Os
All Os
All ls
All ls

0
1
0

All Os
All Os
All Os
All ls

All Os
All Os
All Os
All Os

Since superpages are enabled separately for different TPU groups; for each superpage
mode - a full, four bit mask of TPUGRP_VALID bits will be stored. For a particular
superpage comparator - only the TPUGRP_VALID bits corresponding to the writer's
TPU group will be affected when M_CTL is written (bits outside the group should be
left as they are and not cleared). The superpage TPUGRP_VALID bits are not affected
by IA, IAP or IS. They do have a reset mechanism so that the whole DTB can be
flushed during Power on Reset and when TPU groups are modified by a write to the
TPUGRPIPR.
Compaq Confidential
9-20

Memory Instruction Execution Unit -

the Mbox

5 January 2001 - Subject To Change

Data address Translation buffer {DTB)

9.12.10 Possible Support for Generic Superpages
Bruce mentioned the idea of using generic superpages. Instead of hardcoding the superpages as additional DTB entries, there could be several generic superpage entries. This
certainly seems possible from an implementation standpoint with the following implications:
9.12.10.1 Page Table Array(PTA) Implementation

Since the superpages would no longer be hardcoded the translations would need to be
filled into the DTB instead of simply enabled using the SPE<3:0> bits. The XOR
scheme talked about above would still allow generic superpage translation for the existing superpage modes.
9.12.10.2 Virtual Address Array(VAA) Implementation

The DTB uses cam cells to compare the incoming VA with the VA stored for each entry.
For the superpages the cam cells are removed and the superpage code is hardcoded into
the array. The superpages vary in size and VA space mappings. Generic superpages will
require the addition of a 3 state cam cell into the cam array. The third state would be
'don't care'. This would allow the cam structure to compare only a subset of VA bits.
The EV6 DTB used 2 cam cells to build a 3 state cam cell for the GH bits. Something
similar could be done for the generic superpage entries.

9.12.11 Replacement Policy
Least Recently Used(LRU) vs Not Last Used(NLU) vs Round Robin?
The 21264 used a round-robin replacement policy. What will the 21464 use?
Some quick experiments indicate that (on the SQL benchmark), compared to Round
Robin, NLU improved miss rates by 1 Ilk inst, and LRU by 8/ lk inst.
Given the performance differences, the complexity of implementation, and the impact
of anything other than round robin on verification, the DTB will
use round robin replacement.

9.12.12 OTB Size
It has been observed that a 128-entry DTB is incapable of mapping a 2MB SCache
using default 8K pages. One could build a 2-set 256-entry DTB to do this. This would
essentially be two copies of the current TB, with the choice of set done by looking at
VA<l3>.

A preliminary experiment (using SQL again) indicates a 13 Ilk inst reduction in TB
miss rates with this design. Based on the above, the DTB will have 128 entries.

9.12.13 ITB Usage
The core of the DTB design will also be the core of the ITB design. Notable differences
include the following:

•
•

There is only one ITB, but there are four DTBs .
The ITB only stores 5 protection bits, vs. 10 in the DTB .

Compaq Confidential
5 January 2001 ··· Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-21

Data address Translation buffer {DTB)

•

The ITB does not have speculative entries and will use that logic to handle the
micro-ITB instead.

•

The ITB does not have hit determination logic.

9.12.14 Reset and Testability
All IPRs, DTB Valid bits and the DTB Write Pointer must be cleared on reset, except
that the DCache set enable bits in the DC_CTL IPR reset to one. If this proves impractical, a f allback position is to require that reset PALcode execute a DTB _IAG to reset
the write pointer and a DTP_IA to invalidate the array for each TPU group and to write
M_CTL and VA_CTL appropriately, before any virtual loads and stores are issued.
For test purposes, the test logic requires access to the virtual address (lookup), fill virtual address, PfE read data, PfE fill data and write pointer. The virtual address lookup
path is critical; the test port should probably be included as a leg of the retry address
mux.
Some caution is in order regarding nomenclature. The DTB is a CAM/RAM structure
whose data elements are addresses. Thus, the virtual address is the data part of the
CAM, the physical address is the data part of the RAM, and the write pointer is the
address part of the CAM, when viewed from the perspective of the test logic.

9.12.15 Issues
1. Unix changes PfEs without doing TB invalidates. Unix toggles the protection bits
without invalidating the PfE. The duplicate prevention logic must prevent ill
effects from resulting.
2. The combination of GH=l 1 and 64K pages means that GH cells have to go up to
VA<24>. EV7 seems to be doing this.
3. Enforcing issue ordering between MTPR TAG and PTE on DTB fill. This can probably be done by slotting the PALflow correctly. If not, it could be done by using
another IPR reader class.
4. Is EBox doing the BAD_VA trap? Chris has volunteered that EBox could signal the
BAD_VA trap, since they are already signalling Unaligned.
5. Is there a separate IPR bus? Ebox feels that a separate bus for MFPR which acts
like a multimedia result bus is the cleanest way.
6. Should we make all explicitly written IPRs readable? This is for testability.
7. Should generic superpages be added? What is the mechanism? Concern is the number of 3-state CAM cells.
8. Bit assignments and encodings for Trap Reason and IPRs.
9. Will Pbox ensure that we get Kill INum before freshly issued post-kill instructions.
(They have to do this for all boxes, not just us.) The current timing is V7=I2 of the
good path flow.
10. Duplicate suppression on bad paths requires handling the following conditions:
Spurious writer with GH greater than that of the overlapping entry in the main array
won't hit main array unless we bump up the GH value. If we CAM array at MT
TAG issue, GH bits and ASM are not available then. CAM at retire could hit a halfissued (TAG only) speculative entry, which doesn't have any GH bits or ASM. We
could make speculative entries work onlywith GH=OO.
9-22

Compaq Confidentia I
Memory Instruction Execution Unit - the Mbox
5 Jc1nuary 2001 - Subject To Clumge

Store logic

11. Approval to add DTB_IASN. Having it present still maintains backward compatibility.
12. PALcode restrictions.
13. EV6 and EV7 compatibility. EV7, in particular, is considering changes to 64K
mode and GH semantics.
14. Any issues relating to use of the DTB as the ITB.

9.13 Store Logic
9.13.1 Overview
The SQ, SQD, and SQC (collectively known as the Store Logic or STL) work together
to form a 64-entry reorder buffer for all store instructions and a number of "cache
movement instructions". These instructions are listed below:
STB, STW, STL, STQ, STQ_U, HW_ST
STF, STG, STS, STT
STL_C,STQ_C
QUIESCE
Cache Movement Instructions WH64, ECB, CCB, WMB
Store Instructions

The SQ buffers nearly all information for a STL instruction (except an actual store's
data), and contains the allocation and deallocation functions. The SQD buffers the data
portion of stores, supplies that data to loads that require it, and supplies that data to the
back-end process that copies stores into the cache/memory system. In order to perform
these tasks with the required timing, the SQD duplicates some of the functions of the
SQ. The SQC contains additional control and sequencing for more complicated store
instructions and MB instructions.
The STL supports three major instruction pipeline flows: Store Issue, Load Issue, and
Store Copy-Out. Each of these flows takes place on a different port, although there may
be some sharing of wires between the Store Issue and Load Issue ports. The STL supports two Store issues, and two Store Copy-Outs per cycle. In addition the STL can process up to 3 Load Issues per cycle, although only 2 loads are permitted if there are two
Store Issues in the same cycle.
The STL divides the 64 entries into 16 blocks of 4 entries, where each block will contain four consecutive stores (in program order) for a specific TPU. The number of
blocks allocated to each TPU is dependent on the number of active (non-quiesced
TPUs). A fifth pipeline flow controls the reallocation of a block to the same or a new
TPU.
In order to avoid the Qbox issue logic overflowing the STL buffering capability, the
Pbox assigns a store number (SNUM) to each instruction that will be processed by the
STL. The SNUM is a sequentially increasing identifier for STL instructions in program
order from a single TPU. The SNUM is unique among instructions for a given TPU, but
may be duplicated for instructions in another TPU. The STL specifies to the Qbox issue
logic a SNUM high-water mark, which is the largest SNUM that the STL has buffering
for. The Qbox will not issue STL instructions whose SNUM is greater than the high-

Compaq Confidentia I
5 January 2001 --· Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-23

Store Logic

water mark, which prevents STL overflow. When a STL block is added to a TPU, or
one is ready for reuse, the STL will increase the high-water mark for that TPU, and the
Qbox will enable the next four STL instructions to issue.
The STL holds information for an STL instruction from the time it issues until it retires
and other structures are updated. If the instruction refers to an address that is represented in the DCACHE, then the STL entry will not be deallocated until the DCACHE
is updated. If the instruction refers to an address that is not in the DCACHE, then it can
be deallocated as soon as the instruction has been copied into the MGB.
While a store instruction is buffered in the STL, it will attempt to supply data to appropriate younger loads in the same TPU. The STL will supply data, SNUM, and status for
each load specifying that the load should:
Status

Action

STL_MISS
STL_HIT
STL_RETRY

Use DCACHE data
Use SQ data
Use SQ data, but SQ can't supply it now

If STL_RETRY is signaled, the LQ simply retries as soon as possible.

In order to facilitate the S TL finding the appropriate store instruction for a given load,
the STL computes an active range of INums for each STL entry. This range is a conservative estimate of the load INums that should use this entry, and is initialized by the
Store Issue for the STL entry. Subsequent Store Address Issues may reduce the end of
the range if the incoming STL instruction is in the current range and the SQD address
comparison would match the original entry.
An incoming Load Address Issue, causes the SQD to read entries whose range includes
the INum of the load, and whose SQD address overlaps the load. The STL logic guarantees that there will be either zero or one entries that meet these criteria. At the same
time, the SQ is computing the appropriate status for the load access. In most cases,
STL_MISS or STL_HIT will be returned enabling the load to complete as far as the
STL is concerned. In a very few cases, the STL logic will be incapable of supplying the
correct data, and the SQ will return STL_RETRY. The load is then retried continuously
until the STL condition is cleared (usually by some entry being deallocated).
When a STL instruction retires its SQ status is changed to retired, which causes it start
requesting service from the Store Copy-Out logic. The Copy-Out logic processes
retired STL instructions in program order, and can handle two STL instructions from a
given TPU in a single cycle. The Copy-Out process sends the STL instructions to the
MGB (merge buffer) which responds with a positive or negative acknowledgement. If
the MGB supplies a negative acknowledge, then the STL instructions will be sent
again, when that TPU is selected. Eventually the MGB sends a deallocate message to
the STL, which changes the SQ status to deallocate, which enables the block containing
the entry to be processed by the reallocation pipeline.

Compaq Confidential
9-24

Memory Instruction Execution Unit -

the Mbox

5 January 2001

Subject To Change

Store Logic

9.13.2 Store Issue Flow
The Store Issue loads the information about an STL instruction into the SQ and SQD.
The entry that is to be loaded is determined by the SQ by matching the incoming
instructions TPU and SNUM against the TPU and SNUM stored in each block of the
SQ. The block TPU and SNUM are initialized by the reallocation process and described
later.
Each STL entry computes an active range of INums within which this STL entry will be
visible to Load Issues. This range is computed in a conservative fashion so that for a
given Load Issue, only zero or one entries of the SQD will be read. This range is
required because a number of STL instructions to the same address can be resident in
the STL at a given time. The SQ entry stores the virtual address and physical address
for STL instructions, however only the virtual address is duplicated in the SQD. Two
STL instructions are defined to "overlap" if the they are on the same TPU, the quadword SQD index is the same, and that some addressed bytes are shared by the two
instructions.
The lower end of the active range is simply the INum of the STL instruction until it
retires, at which point it is changed to be "negative infinity" (older than any instruction
to be issued). The end INum of the range (or EINum) is initialized to be "positive infinity" (younger than any instruction to be issued). When the EINum of an entry is retired,
it is changed to be negative infinity, which effectively disables this entry of the STL
from supplying data to loads. For each subsequent Store Issue, each entry in the SQ
updates its EINum if appropriate. If the incoming Store Issue's INum is younger than
the entry's INum and older than the entry's current EINum, then the entry's EINum will
be set to be the incoming Store Address Issue's INum.

9.13.3 Load Issue Flow
The three load address ports are used to look up the STL to determine if some data for
this load should come from the STL. The TPU, INum and virtual address of the load are
used to find a single entry that is enabled to supply data to this load. The TPU, INum
and physical address of the load are then used to determine if the correct entry was
selected using the virtual address. If this second comparison determines that either the
wrong entry was selected, or that there were more than one entry required for the load
to complete, it signals this fact using STL_RETRY.

9.13.4 Store Copy-Out Flow
When a STL instruction retires its SQ status is changed to retired, which causes it start
bidding to be copied into the MGB (merge buffer) and DCACHE. The Store Copy-Out
process selects a TPU that has some STL instruction bidding to be copied-out. It then
selects the oldest block that contains a bidding STL instruction, and copies out the two
oldest bidders from that block (if there is more than one). The SQ commands the SQD
to read the data for those STL instructions, and examines their opcodes and physical
addresses. If there is no overlap in the physical addresses and the operations are compatible, both are sent along with the data from the SQD to the MGB. After a fixed delay,
the SQ gets an acknowledgement from the MGB telling the SQ that each STL instruction was either accepted or rejected, and the MGB index it was merged into if accepted.
The SQ stores this information to control deallocation of the SQ/SQD entry. If the entry
was not accepted, it becomes enabled to bid again. In some instances, the STL instrucCompaq Confidential
5 January 2001 -~ Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-25

Merge Buffer

tion requires special handling by the Copy-Out process (WMB, STx_C, I/O Stores). In
this case, the Copy-Out process stops copying out STL instructions for that TPU until
the operation is done.
Because there is a window of a couple of cycles between the sending of an instruction
and the acknowledgment by the MGB, it would be possible to have a number of
instructions inflight toward the MGB. In order to process WMB instructions, the retire
ports will not send any instructions after the WMB until the MGB sends acknowledgment for instructions preceeding the WMB. In addition the retire ports wait until all
merge buffers for this TPU are made coherent before sending any new instructions. The
retire ports process STx_C like ordinary stores, except that STx_C holds up the retire
point. This means that no other stores can be processed after the S Tx_C in this TPU.
When the STx_C is sent to the MGB, it will reply with a STx_C_success signal in addition to the not_accepted signal.

9.13.5 Block Allocate Flow {TBD)
9.13.6 Things Not Done
Store conditional processing (in control section) IO processing signals

9.14 Merge Buffer
9.14.1 Overview
The Merge Buffer holds Stores from the SQ after they have retired and before they have
updated the Dcache and the Scache. The MGB helps to accumulate Stores to the same
cache block by allowing data from multiple Stores to fill up the same MGB entry.
Accumulating store data for a whole cache block, conserves system bandwidth by
avoiding data transfers for cache blocks that are completely dirtied; the CBox does a
CtoD transaction on the system. Furthermore, since the Scache can be written at a minimum granularity of a quad-word, the MGB acts as a holding place for stores which
write less than a quad-word thus avoiding Scache fills (in order to merge byte/word/
lword write from the processor).
The Merge Buffer has 16 entries. Two input ports, each providing a data and address
path, are driven from the SQ, and up to two entries can be allocated each cycle. The
addresses compare against the existing MGB entries. Each entry contains a physical
address, 64 bytes of data, a 64-bit mask to indicate which bytes of the block need to be
written into the Dcache and another TBS mask to indicate which data needs to be written to the Scache.
If a Store address matches one already in the Merge Buffer, the Store's data are loaded

into that entry, merging with the data there. The entry's byte mask is updated to reflect
the new bytes that are being written.
The Merge Buffer arbitrates with the Fill Buffer (Cbox fill returns) for the Back End
Bus. Once the MGB wins arbitration, it looks up the Dtags. If the cache block is writeable, the data block is updated in the Dcache and the Dtags are updated accordingly.
If the block doesn't exist in the Dcache, a miss request is made to the MAF. If all 64

bytes of the MGB entry have been written, a block ownership request is launched; otherwise a block fill is initiated.
Compaq Confidential
9-26

Memory Instruction Execution Unit -

the Mbox

5 Jt1nuary 2001 -~ Subject To Change

Merge Buffer

Fills may need to merge data from the Merge Buffer before it updates the Dcache.
Hence, Fill addresses search the Merge Buffer for a match at the same time that the
Btags are looked up. If a match is found, the Merge Buffer drives the valid bytes onto
the DC_Data Bus, effectively merging with the fill data.
Probes too need to check the MGB. Probes check the Merge Buffer at the same time
that they check the Dtags. If a probe hits on a Merge Buffer entry, the Merge Buffer
reads out the valid data bytes and sends them to the Cbox.

9.14.2 Merge Buffer Allocation
Every clock, the Store Queue sends up-to-two store addresses over to the Merge Buffer.
The incoming store may merge with an existing Merge Buffer entry if :
•

Physical address (PA<47:6>) and VA<14:13> matches

•

Both are pure store ops (not IO, WMB, STC etc.)

•

If they are to different TPUs, then they may merge only if the block is owned

If the new address cannot be merged or allocated in the Merge Buffer, then a NAK is

sent to the SQA. The Store Queue needs to resend this store later. In this case, the
merge_buffer also sets write_to_Dcache (unless it is already set) on the entry with
which it could not be merged.
It is also possible that an address matches and the state is owned, but neither Dcache

nor Scache processing is pending (this is signified by the free_to_allocate bit being set).
In this case, the thread_id[l :0] is not valid and need not match. Merging is legal in this
case and the thread_id[l :0] is set to the new thread.
PA Match && VA Match TPU Match

State

Merge

Yes

State==Owned

Yes

State !=Owned

If the store is accepted from the Store Queue, the merge buffer index that the store
merges with or is freshly allocated to, is sent over to the Store Queue. If a store does not

merge with an existing entry, a new entry is allocated only if one is available (has its
free_to_allocate bit set).
If there is no free entry around (and the incoming store does not merge), the merge

buffer NAKs the store copy-out. The Store Queue stalls (and continues to make the
same request), until either the Store Queue asserts Purge_mgb (Store Queue starts filling up) or if an entry becomes available (because its age counter times out).
9.14.2.1 Boundary Case
If the SQ sends a request such that it tries to set the dcache_dirty[64] bits at the same

time it is being cleared from a previous write dispatch to the Dcache (in DC_data stage
of the Back End bus pipe), then the setting of dcache_dirty takes precedence over clearing.

5 January 2001 ··· Subject To Change

Compaq Confidea1tia I
Memory Instruction Execution Unit -

the Mbox

9-27

Merge Buffer

9.14.3 Merge Buffer Writes to Dcache
A Merge Buffer entry will be marked for write_to_Dcache once it satisfies either of the
conditions:
•

entry is not an IO-store
andif

•

Line is fully dirty (all dcache_dirty bits are set)
or if

•

Its timer has timed out
or if

•

Purge_mgb is active and the entry's TPU matches Purge_mgb_TPU[3:0]
or if

•

A STx_C op is allocated to this entry

Purge_mgb_TPU[3:0] is sent by the Store Queue indicating that a given TPU needs to
have its Merge Buffer entries flushed, in order to make room in the Store Queue. When
Purge_mgb_TPU[3:0] is active, a request is made to the Back End Bus controller,
which then disallows the PreMAF from sending a request to the CBox in the following
clock. If in the Scache tag launch stage, the Back End Bus receives an ACK implying
that the Scache is not running a cycle this clock, then the Back End Bus may be granted
to the Merge Buffer. The Store Queue also asserts Purge_mgb_TPU[3:0] when it starts
to process a WMB.
A picker picks a Merge Buffer entry (to write to the Dcache) when the Back End Bus
controller indicates that the following cycle may be used by the Merge Buffer (to make
a request on the Back End Bus controller). The picker may pick an entry only if its
write_to_Dcache status is set. The picker may be simply a priority encode of the
write_to_dcache bits. The picked entry is read and the address sent to the Back End Bus
(in the same clock).
During the Dtag Read stage of the Back End Bus transaction, the Btag responds with
the Dcache state
(cache_state= {valid, shared, owned}). If the line is owned, the Scache set
(scache_set[l :O]) is sent to the MGB.
If the Dcache does not have write permission (dcache_state!=owned), it aborts the data

portion of the transaction by de-asserting DC_data_valid. Else, the Merge Buffer
assumes that the data is going to be successfully written to the Dcache and therefore it
is safe to ask the Store Queue to deallocate the entry(ies) that corresponded to the
Merge Buffer entry.
The Merge Buffer data is accompanied by the 64 dcache_dirty bits.
After the data is sent over the Back End Bus, the dcache_dirty bits are reset (if
DC_data_valid is asserted and the Cbox indicated that there was no ECC error).
If the Btag read revealed that the line is not valid in the Dcache (state!=valid), the data

phase of the Back End Bus is aborted (DC_data_valid=O). However, in this case, the
Merge Buffer assert MGB_id_dealloc<3:0>, to allow the Store Queue to deallocate. If

9-28

Compaq Confidential
Memory Instruction Execution Unit - the Mbox
5 J<1nuary 2001 ~· Subject To Change

Merge Buffer

the line does not have the correct state (state is invalid or line is not owned by the processor) or if line_fill_needed bit is set, a request is made to the MAF via the PreMAF.
The Merge Buffer may send one of 3 types of requests to the MAF (PreMAF):

•

ItoD: if all bytes are dirty, but the state-=valid

•

CtoD: if all bytes are dirty, but the state==shared

•

ReadMod: if line_fill_needed is set

It is possible, that the Cbox may refuse this request (send NAK in Scache tag launch
stage). In general it is expected that the probability of this happening is low (bank conflicts due to internal Cbox cycles); hence, if a request is NAK'ed, the write_to_dcache
bit is once again set and the request is routed via the Back End Bus to the MAF. Redoing the entire Back End Bus cycle may seem wasteful, but this avoids SILO'ing the
request and re-issuing it to the MAF, when the CBox detects an occasional collision
with an internal (Cbox) request. Note that the bytes that were successfully written into
the Dcache, the first time around, are not re-written again; the MGB_id_deallocate[3:0]
that is sent out corresponds to the new bytes that are written this cycle (if any). If no
bytes are written into the Dcache during this repeat cycle, STQ_dealloc (which is used
to qualify MGB_id_deallocate[3:0]) is set to 0.

9.14.4 Scache Writes
An entry needs to be written through to the Scache if there are qwords ready to be written. The following figure illustrates the various states of the Scache write through process.

5 Jatwary 2001 ·- Subject To Change

Compaq Confide11tial
Memory Instruction Execution Unit -

the Mbox

9-29

Merge Buffer

Figure 9-2 Scache Write-Through Process

state== owned && !line_fill_needed

tim e:r! =4

Each entry has the state machine shown above. Thus, until an entry is ready to be written out to the Scache, it stays in the NoRequest state. An entry may request the Scache
picker to write through, to the Scache only after the entry reaches the B tag Write stage
of the Back End Bus pipeline. Past the Btag Write stage, the Merge Buffer entry checks
the TBS bit and transitions to RequestScache state, provided line_fill_needed bit is not
set (line_fill_needed is set in the B tag Write pipe stage, if the cache block needs to be
fetched, because all the byte_dirty bits in a quadword are not set). TBS is set if any
byte_dirty bits are set in the octaword.
Once an entry is allowed to drive its data to the Scache, it transitions to the ScacheGrant
state; the grant signal is used to read the (PA) address (to be sent to the CBox). It is
presently expected that it will take 4 cycles to receive the ack signal from the Cbox
(implying that the data was accepted). Hence a 2-bit timer needs to be loaded to count
down the time it takes to receive the ack signal (alternately, one could stage this state
for 4 cycles). At the end of the delay, the ack is sampled. If ack is asserted, then it transitions to NoRequest state else it transitions to the RequestScache state, in order to
repeat the request (to the Cbox).
Compaq Confidential
9-30

Memory Instruction Execution Unit -

the Mbox

5 January 2001 - Subject To Change

Merge Buffer

9.14.5 Probe handling in the Merge Buffer
Probes arbitrate for the Back End Bus (just like fills) and check the Btags once granted.
The merge buffer is checked at the same time and if the address matches and if
dcache_state == owned, merge buffer responds with mgb_hit . If there is a hit, the
Merge Buffer writes the VDB_idx[5:0] which is sent by the Back End Bus (from the
CBox). If a probe (with invalidate) hits the Merge Buffer, line_fill_needed is not
allowed to be set (unless it is already set); this is different from other Back End Bus
transactions (in all cases, line_fill_needed is set if needed).
Note that data evicted by probes (unlike Scache write-throughs), are not full quad
words. Hence 16 byte_dirty bits are read (along with the octaword data) and sent to the
Victim Data buffer (in the CBox).
A separate signal WriteThruDone is sent, once the ack for each of the octawords that
were sent (to the CBox) are received back (no octaword_dirty bits are left set). The
CB ox closes the Victim Data buffer entry once it receives WriteThruDone.

9.14.6 Line fill and Merge Buffer
After the Btag Read stage of the Back End Bus pipeline, the Merge Buffer entry may
decide to launch a FetchLineMod request to the MAF (via the PreMAF) if any quadword does not have all the byte_dirty bits set (although atleast one byte_dirty is set) and
if the line_valid bit is not set (implying that the complete line is not valid). This status is
latched in the line_fill_needed bit for the entry.
During the Btag Read stage (of the Back End Bus pipeline), the Merge Buffer is
CAMed (along with the Load Queue). If there is an address match, the Merge Buffer
asserts mgb_hit.
Based on cache_state (cache_state=Btag_statelMGB_statelCbox_state), following are
the actions taken by the Merge Buffer.
•

cache_state==owned.
This is the case either when we initiated a fill from the Scache since we didn't have
a qword full of data to send to the Scache (line_fill_needed is set), or a fill initiated
by a load request, is returning from the Scache or system. The Merge Buffer indicates that it will drive the DC_data bus corresponding to the byte_dirty bits that are
set. The merge buffer reads out the bytes corresponding to byte_dirty and sources
the data onto the DC_data bus. The merge buffer receives the entire DC_data bus
and in the cycle that the DC_data bus is being driven (towards the Dcache) it also
writes itself (effectively merging the fill data with the existing dirty data).
The dcache_dirty bits are read out and sent along with the DC_data (to enable writing to the Dcache, only the bytes that need to be updated).
2 cycles following the data return (from the Cbox), the ECC status is returned. If no
errors are reported, the Back End Bus is responsible for setting the valid bit on the
Btags; else if a correctable ECC error is reported, the valid bit is left unset.
If no (ECC) errors are reported, the line_fill_needed bit is reset, all the byte_dirty
bits are reset and the line_valid bit is set. If a correctable ECC error is reported, the

line_fill_needed and byte_dirty bits are left unchanged; the line_valid bit not set .
If the Dcache accepts the Back End Bus request (Back End Bus indicates

DC_data_valid), the dcache_dirty bits will be reset.
Compaq Confidential
5 January 2001 - Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-31

Merge Buffer

The mgb_state of the Merge Buffer entry is set to cache_state .
Note that if the cache block does not exist in the Dcache (Btag state==invalid), the
Back End Bus controller is responsible for
asserting all of the dcache_dirty bits.
•

(cache_state==validllshared)
This is the case when a fill (initiated by a load) returns on the Back End Bus but the
final state is shared (not owned).
The data is merged (as above) and written into the merge buffer.
The Merge Buffer indicates to the Back End Bus that the transaction needs to be
aborted (abort_fill) i.e the Dcache will not be written (DC_data_valid is set).
Load retries triggered due to the data fill will be satisfied, since the merged data
will still be driven onto the load data bus.

•

Line comes with ownership, but no data.
This is the case when a CtoD was sent in the past. The Back End Bus transaction
proceeds as normal (using the dcache_dirty bits as set by the merge buffer).

9.14.7 10 Stores
IO stores do not set the write_to_dcache bit; instead, when the CBox is ready to send
the IO store to the system, it sends the IO store address on the Back End Bus. Once the
Merge Buffer entry (corresponding to the IO store) matches the address, it is forced to
write through to the Scache. The entry is deallocated after the Scache write through is
complete.
For a complete description of I 0 handling refer to the IO handling document.

9.14.8 Store Conditional Support
STx_C ops may not merge with existing entries in the Merge Buffer. A fresh entry
needs to be allocated. However, unlike other store ops, STx_C does not send the ack to
the Store Queue until it has obtained ownership (or been refused ownership) of the
cache block. If ownership is obtained, the Merge Buffer acknowledges the Store Queue
request and at the same time writes the STx_C into the Dcache. If on the other hand
ownership is not obtained, the Merge Buffer acknowledges the Store Queue, but fails
the STx_C and consequently does not update the Dcache.
Thus, write_to_dcache bit is set as soon as the STx_C is allocated into the Merge
Buffer.
Past the Btag Read stage of the Back End Bus pipeline, the STx_C is ack'ed if the block
is owned. If not, a CtoD STx_C request is routed to the PreMAF. Once the CtoDStXC
request completes, the STx_C is known to pass or fail (in the Btag Read stage).
The Merge Buffer receives lock_TPU<3:0> from the Load Queue. The lock_TPU<3 :0>
signifies that the TPU (indicated by the corresponding bit) owns the lock register (set
by a previous LoadLock instruction); if at any time, the TPU deasserts the lock, the
Merge Buffer is obliged to fail the STx_C.

Compaq Confidential
9-32

Memory Instruction Execution Unit -

the Mbox

5 Janwiry 2001 -- Subject To Change

Merge Buffer

If a previously initiated CtoDStXC operation returns (on the Back End Bus) and does

not find a matching Merge Buffer entry, the CtoDStXC is aborted (DC_data_valid=O)
on the Back End Bus.

9.14.9 MB and WMB Processing
Every clock, the merge buffer will send out a 4-bit vector TPU {0, 1,2,3} _coherent .
Each bit in the vector is set if all entries relating to that TPU (bit 0 signifies thread 0) in
the merge buffer have dcache_state = owned. Thus each Merge Buffer entry has the
TPU{0,1,2,3 }_owned vector; this vector is derived from mgb_state[2:0].
TPU{0,1,2,3 }_coherent is essentially a wired-AND of all the TPU{O,l,2,3 }_owned
bits.
When a MB is in flight, the Store Queue should assert Purge_mgb along with
Purge_mgb_thd[3:0] to assure prompt purging of merge buffer entries for that thread.
The Store Queue will use TPU{0,1,2,3 }_coherent to decide to proceed with retiring an
MB. WMB is retired at issue time, but the Store Queue needs to ensure that no stores
(for the thread in which the WMB is present) are sent to the Merge Buffer until
Thd{0,1,2,3}_coherent is active (for the thread).

9.14.10 MAF request
The Merge Buffer may make one of the following requests to load the MAF (via the
pre-MAF):
ltoD: the line is completely dirty in the Merge Buffer, but the line is not present in the
dcache (dcache_state != valid).
CtoD: line is completely dirty, but does not have ownership (dcache_state !=owned)
CtoDSTxC: this is sent out for Store Conditional operations
FetchLineMod: line does not exist in cache and the line is not completely dirtied in the
Merge Buffer
Inval: this is caused due to the Evict instruction. The effect is to do an internal probe
with invalidate. The PreMAF gives the highest priority to Merge Buffer requests; however, the CBox may reject the request 3 cycles later. Therefore, the write_to_dcache status is staged and may be set back, if the CBox rejected the request.

9.14.11 Cache Movement ops (WH64, Evict)

TBS
Evict is sent from the Store Queue. Neither the byte_dirty nor the dcache_dirty bits are
set. Evict sends an Inval request to the MAF.

9.14.12 Merge Buffer States
Figure 9-3 is a logical state diagram that illustrates the states through which a Merge
Buffer entry progresses. Each of the states are represented as follows:
Free
Merging

.. .-

free_to_allocate;
-free_to_allocate && -write_to_Dcache && .... mgb_hit;

Compaq Confidential
5 January 2001 ··· Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-33

Merge Buffer

Dcache_write
LineFill
ScacheWrite

...-

mgb_hit;
line_fill_needed;
ScacheRequest II ScacheGrant;

Figure 9-3 Merge Buffer Entry States

~e_to_a11x.ate

Ffte
!ScacheRequest ~
!S ca.cheG:tard: ~
fiee to allocate

!:6:ee_to_alhca.te

!wi:ite_to_Dca.c:he &&
!mgb_hit

wJ:ite_to_Dcaclte 11
:rrgb_hit

! l.ine_fill_needed

9.14.13 Data Array
The data array slice consists of:
•
9-34

512 bits of data

Compaq Confidential
Memory Instruction Execution Unit - the Mbox
5 Jc1m1ary 2001 ~- Subject To Change

load Queue

•

64 byte_dirty bits for each byte of data

byte_dirty is set when:
store data (from the Store Queue) is being written per P{O,l }be[7:0] and PA[5:3]

It is reset when:
the write through to the Scache (CBox) is complete
or if
a line fill merges with an existing entry
The data is written as shown by the Store Queue. It is also written by the Back End Bus.
The data array has 3 write ports (2 write ports from the Store Queue and another write
port for the fill data from the Back End Bus). There are 2 read ports (1 for Dcache
writes and another for Scache writes).

9.14.14 Address Array
Following are the fields in the Address array:
TBS

9.14.15 Control Section
TBS

9.15 Load Queue
VIRTUALLY IDENTICAL TO Section 9.4 .....
The Load Queue, abbreviated as LQ, holds unretired Loads that have been issued to the
Mbox. The LQ entries are allocated to the Loads in program order, by thread. The LQ is
used to maintain ordering of Loads related to Stores and Memory Barriers. It also is
used to re-issue Loads from the Mbox itself, when a Load cannot complete successfully
when it is issued from the Qbox.
The LQ has 64 entries and is partitioned equally between threads at run-time. Thus,
when a single thread is running, all 64 entries are allocated for that thread; if two
threads are running, each is allocated 32 entries. When four threads are running, each
thread is allocated 16 entries. When a thread quiesces, it gives up its load queue entries
to the other active threads.
Each LQ entry contains the physical address, opcode, INum, a done bit, retry bits and a
TBS. The LQ is allocated in program order (by thread) by the Qbox, which assigns LQ
numbers (LNums) for all Loads during the Map stage.
If the Load completes successfully, it is marked as done in the LQ. Otherwise, the Load
is marked to be retried. The Load may retry due to a cache miss, a bank conflict or may
be a class of Load that can only complete at retirement (i.e., I/O Loads which cannot be
done speculatively).

Compaq Confidential
5 January 2001 ··· Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-35

load Queue

Every cycle, the retry logic in the LQ scans all the entries and finds the oldest ready
entry (in a given thread). Readiness is defined differently for each type of retry, but generally refers to when the Load can make further progress. The retry logic then sends the
Load to the (DIFFERENT) pre-MAF. Retry candidates are chosen from different
threads in a round-robin fashion.
The LQ facilitates speculative execution of Loads by allowing Stores to check if a Load
younger than it, in program order, may have completed (i.e., the Load returned data
before the correct Store data had been sent to the Mbox). When the Store address operation dispatches from the Qbox, it checks the LQ. If a match is found, the oldest Load
that matches the Store address is forced to trap. Note that this check is relevant only for
Loads and Stores within the same thread.
The LQ also facilitates speculative execution of Loads past Memory Barriers. This is
made possible by allowing Stores from the Merge Buffer to check the LQ for possible
address matches. In this case, a Store needs to trap a Load from another thread. Note
that in this case, up to three Loads can match the Store's address and signal a trap simultaneously. Probes (invalidates) also may force a previously completed load to trap; in
this case, upto 4 loads (belonging to different threads) may trap.
LQ entries are deallocated once the Load is past the retire point. The Pbox sends the LQ
an INum for each thread that corresponds to the youngest operation that is being retired.
All LQ entries that are older than this INum are marked as being deallocated.
The LQ drives a signal to the Qbox every cycle that specifies the youngest Load that
may issue out of the Qbox. This signal is based on the youngest Load that is being deallocated and the number of available entries in the LQ.

9.15.1 Load Queue Allocation
The Load Queue bas 64 entries that are shared between the currently active threads.
Load Queue entries are arranged in blocks of 4 entries (referred to as load blocks).
Entries within a block are physically contiguous and are in strict INum order (the oldest
is allocated to entry 0 while the youngest to entry 3). Blocks may be dynamically reassigned from one thread to another in order to achieve a balanced sharing. In normal
mode of operation, the number of load queue entries allocated to a given thread is
shown.
During retries and traps, we need to know the age of the load with respect to other loads
in order to pick optimally. Each load block bas an young_vector associated with itself,
which gives the location (bits correspond to the physical location of other entries) of
loads that are older than itself. When a thread is activated (or if a previously quiesced
thread wakes up), it starts to take away load blocks from other threads as those threads
release their entries (via retirement). Similarly, when a thread quiesces, it releases its
entries which are then assigned equally among the threads that are active.

9.15.2 (Age) Young Vector generation

TBS
9.15.3 Load Queue Limit and Block Allocation

TBS

9-36

Compaq Confidential
Memory Instruction Execution Unit-the Mbox
5 Jc1nuary 2001 ·-Subject To Change

load Queue

Case 1: New Thread in Machine (or a Previously Quiesced Thread Waking Up)

TBS
Case 2: A Thread Quiesces

TBS

9.15.4 Thread Choosing
TBS

9.15.5 Block Assignment
TBS

9.15.6 Load Issue
TBS

9.15.7 Load Retries
Loads may not complete (return data) after they have been issued. They end up retrying
due to the following reasons:
•

The block does not exist in the Dcache (dcache miss).

•

The block does not exist in the Scache (scache miss).

•

Dcache bank conflict

•

The data exists in the Store Queue, but the data has not arrived yet.

•

The load is to IO space and thus must issue (to the system) only when the load has
retired (at retire).

When the load encounters any of the above conditions, it is marked for retry. The load
stalls in the Load Queue until its stall condition (retrying condition) has gone away i.e a
load stalled on dcache miss, is allowed to retry once the data from the Scache is imminent. However, since there may be more loads ready to retry than there are retry ports
(currently ports 0 and 1 are reserved for retries), a picker picks the oldest 2 loads in a
TPU (and goes round robin between TPUs).
It is possible that a load may get multiple reasons to retry (bank conflict as well as Store
Queue data not available at the same time). However, only 1 retry reason is recorded in
the Load Queue. The retry reasons are prioritized in the following order:
1.

At retire

2. Scache miss
3.

Dcache miss

4. Bank conflict

5. SQA immediate retry
When a load is asked to retry, it sets the retry_status register to record the reason for
retry.

Compaq Confidential
5 January 2001 -- Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-37

Load Queue

9.15.8 Dcache Miss
When a load issuing from the Qbox, misses in the Dcache, it sets its
retry_status=dcache_miss. It also sets its retry_ready bit, thus preparing to retry immediately. A picker in the Load Queue picks the oldest 2 loads from the Load Queue and
sends the physical address (PA[47:6]) to the Pre MAF. The Pre MAF arbitrates between
requests arriving from the Load Queue, the IBox as well the Merge Buffer; once, the
CBox accepts the retry request, a bubble request is sent over to the Qbox.
Using the Load Queue to determine the picking in the MAF helps as follows :
•

The Load Queue already has a picker that picks the oldest load in a TPU goes
round-robin between the different TPUs. The MAF looses any notion of program
order and thus the MAF can never pick the oldest waiting load. Furthermore, we
need have only 1 picker (not 1 in the Load Queue and another in the MAF).

•

We cannot allocate 3 load misses per cycle in the MAF. Therefore the Load Queue
acts as a queue into the MAF.

•

The Scache loop is shorter than the retry loop. Hence if the Load Queue waited
until the MAF picked an address to send through the Scache and then initiated the
retry process, it would unduly increase the latency of the load (as shown in the
pipelines below).

9.15.8.1 MAF Pick

TBS
9.15.8.2 Load Queue Pick

TBS

9.15.9 Scache Line Miss
The Load Queue schedules the dcache-miss retry, assuming it will hit the Scache. If it
turns out that the line does not exist in the Scache, then it sets the
retry _status=scache_miss and waits in the Load Queue.
When the CB ox receives the data from the system, it sends the fill address and subsequently the fill data to the Mbox. The fill address is allocated in the Fill Buffer. Once
the Back End Bus grants arbitration to the Fill Buffer, the fill address is sent to the
Dtags, the Merge Buffer, the Load Queue as well as the Fill Buffer.
If a fill dispatching on the Back End Bus matches a Load Queue entry that is stalled on

a scache miss (retry_status==scache_miss), then its corresponding retry_ready bit is
set.

Compaq Confidential
9-38

Memory Instruction Execution Unit -

the Mbox

5 January 2001 - Subject To Cfumge

Load Queue

9.15.10 Load Queue retry - Bank Conflict
Bank conflicts are detected in the A phase of MO. The retry status is written in Ml,
phaseB.
-stage

0-stage

1-stage
rite Load Queue ick oldest retry
Set
etry_status=bank
conflict
etry_code==bank
conflict

ead Load Queue end Bubble
equest to QBox

The load may be picked (to retry) as early as M2 and may start to retry (send bubble
request) 2 cycles later.
Add SQA immediate retry.

9.15.11 Retry at retirement

TBS
9.15.12 Retry Block
TBS
9.15.12.1 Pick Oldest Retry

TBS
9.15.12.2 Oldest and Next Oldest Retry Chooser

Each bank (of the Load Queue) drives out the silo_id and the lnum of the oldest block
as well as an indication
(more_than_l_retry_rdy) that the block has more than 1 entry that is ready for retry.
The 2 lnums are then compared. If the older block (depending upon the lnum comparison) has more_than_l_retry_rdy asserted, then the oldest and next_oldest are assigned
to the same block.
Once the oldest and next_oldest entries are selected, drive retry_grant_bank to each of
the banks. Thus if the oldest and next_oldest are from separate banks then drive
retry_grant_bank to both banks; else drive only to one (bank that is selected).
The selection of the oldest and next_oldest and sending back the grant (to disable the
load queue entry from bidding again) needs to happen in 1 phase (as shown above).
In the next phase, we read out the inum of the oldest_retry and next_oldest_retry (using
the silo_id read out in the previous phase).
9.15.12.3 Thread Chooser

A 4-bit vector last_thd_chosen[3:0] records the last thread chosen (to retry). Priority
encode thread_mask[3:0] starting at last_thd_chosen[3:0] to generate
thread_choose[3 :0].

5 January 2001 -~ Subject To Change

Compaq Confidential
Memory Instruction Execution Unit - the Mbox

9-39

load Traps

The Pre-MAF may send block_rty_TPU[3:0] to disable retries from being chosen in a
given TPU. This is used in conjunction with TPU_mask[3:0] to choose the next TPU to
retry.

9.15.13 Prefetches
Following are the forms of prefetch instructions which arrive on the load port(s):
•
•
•
•
•
•
•
•

Pref
PrefEvict
PrefOnce
PrefDC

LDL r3 l ,(r):if Dcache miss, fetch from Scache/system and install in Dcache (Pref)
LDQ r31,(r):if Dcache miss, fetch from Scache/system and set LRU to evict
(PrefEvict)
LDB r3 l ,(r):if Dcache miss, fetch from Scache/system in shared state (PrefShared)
LDW r3 l,(r)
LDF r31,(r)
LDG r31,(r):if Dcache miss, fetch from Scache/system but don't cache in Bcache
(PrefDC)
LDS r3 l ,(r):if Dcache miss, fetch from Scache/system in owned state (PrefMod)
LDTr31,(r):if Dcache miss, fetch from Scache/system but do not cache in either
Bcache or Dcache

Send out a FetchLine command to the Cbox.
Send out a FetchLineEvict command. The Cbox treats it the same as a FetchLine however, when the
block returns to the MBox, the Dtag sets the LRU bit to evict.
Send out a PrefOnce command to the Cbox. The Cbox sends out a FetchLine command but does not
cache the block in Scache (invalidates
Send out a PrefDC command to the Cbox. The

9.16 Load Traps
TBS
9.16.1 OTB trap
9.16.1.1 Load/store Order Trap

TBS
9.16.1.2 lnval Trap (Traps Due to Probe-invalidates)

TBS
9.16.1.3 MGB Trap (Traps Due To Merge Buffer Dispatches On Back End Bus)

TBS

9-40

Compaq Confidential
Memory Instruction Execution Unit - the Mbox
5 Jt1nuary 2001 ·- Subject To Change

load Traps

9.16.1.4 Trap Summary

Table 9-8 shows the trap summary.
Table 9-8 Trap Summary
Trap

Status Bit
(Trap_status)

Signalled in
Pipe Stage

Trap Condition

DTBtrap

dtb_trap

TB_not_present II TB_access_vio

Load-store order conflict

order_trap

TBS

Parity/non-correctable
error

Machine_check

(P{ 0, 1,2} ECC_check_result == error_corrected) II
P{O,l,2}dcache_parity_error

ECC correctable error

correctable_error M3

Inval trap
(also MGB trap)

in val_trap

Load Queue not available

Back End Bus
Tag Read stage

(DC_addr[47:6]==LdQ.PA[47:6])&&
((DC_op==store) && (DC_thread_id !=
Ldq.thread_id) II (DC_op==inval))

load_queue_not_avail

9.16.2 Trap Resolution
Up to 4 threads may have their trap bits set in thread_trap[3 :0] but we can process only
1 thread at a time. In order to prevent the Completion Unit from advancing the retire
point past the trap point, we assert stall_retire_thd[3:0] for each of the threads that have
a possible trap.
The Load Queue examines the trap status of the Load Queue blocks belonging to the
thread chosen. It finds the oldest Load Queue entry that has its trap bit set. The block is
oldest if N AND(younger_vector[i] ,block_trap) is true.
In M4 phase B, the lnum for the oldest block in each bank is compared. Drive
grant_trap_bank in phase B, to the bank that is older. The oldest entry in the block that
is granted resets its trap bit (so as not to bid again). The inum of the oldest load is read
out of the Load Queue and sent to the Retire/Kill unit.

9.16.3 Thread chooser
Each port records the thread ID in M3 (a 4-bit vector thread_trap[3:0]) if it has a potential trap. thread_trap[3:0] may also be loaded directly by the Load Queue (when an
probe invalidation or store dispatch from the Merge Buffer finds a hit in the Load
Queue).
The Store Queue sends its trap status (for each of the ports). If either of the threads (on
the 2 store ports) match a thread in the thread_trap register, then choose the matching
thread. If both threads match, choose one.
If there is no thread match between the store ports and the thread_trap register, then

choose (in a round robin fashion) a thread from among the bits set in the thread_trap
register. Send the thread ID to the Trap Resolution block in M3 (i.e send the chosen
thread at the same time the ECC error is being latched at the Load Queue).

5 January 2001 - Subject To Change

Compaq Confidential
Memory Instruction Execution Unit-the Mbox

9-41

DcacheTags

If the Store Queue has traps on 2 different threads, while the Load Queue has a trap on

yet another thread, then allow the Store Queue to report its traps (the Load Queue loses
its bid to report its trap that cycle). All this is because, the Mbox can report only 2 trap
inums per clock.

9.16.4 Kill Bus
The Kill bus (kill_valid,kill_inum[7 :0], kill_thread_id[l :0], trap_type[?? :O]) is sent to
the Retire/Kill unit.

9.16.5 Litmus 1 Handling
If a probe-inval or a store from the Merge Buffer sets the inval_trap, it asserts

stall_retire . Once the Retire/Kill unit acknowledges (for that thread) that there are no
more pending retired instructions in the pipe, the Load Queue initiates trap processing
on that thread. This ensures that a load queue entry may trap only if it hasn't advanced
past the retire point. Exact interface signal names are not known at this time.

9.17 Dcache Tags
The MB ox contains four tag arrays. These tags describe the contents and state of every
line in the DCache. Three of the arrays are at the front end, and are connected to the
load ports. One array is at the back end, and is tied to the Back-End Bus.

9.17.1 Front End Tags
The front end tags serve one and only one purpose, namely to indicate whether a load
coming in on its associated port is a hit. Tag launch is slated for MOA, and hit determination for MlA.
In accordance with the DCache structure, the tag is 2-way set associative, with 512
entries per set indexed by VA<14:13> and PA<l2:6> (or, equivalently, by VA<14:6>).
The data stored in the front end tags are the physical tag, PA<47:13> and valid and fillin-progress bits. One parity bit protects the tag entry. Parity errors have the effect of
forcing a trap. The fill-in-progress bit indicates that the associated tag entry is not yet
valid, but will soon be. This allows the Load Queue to initiate a retry for that entry
immediately.
The front-end tags have three read ports and one write port. Physically, there are three
copies of the front-end tag to provide the necessary number of ports.
To support multiple synonym invalidation, valid bits for all four combinations of
VA<14:13> for a given physical index PA<12:6> need to be able to be rewritten simultaneously. It may be convenient to pull the valid bits out of the main tag array to permit
this operation.

9-42

Compaq Confidential
Memory Instruction Execution Unit - the Mbox
5 Jc1nuary 2001

Subject To Change

DcacheTags

9.17.1.1 Timing

Table 9-9 show the Dcache front-end tag timing.
Table 9-9 Dcache Front-End Tag Timing

Receive OP Issue
from Qbox

Ebox drives LD
Addr

Launch VA into
TB, Tag, Stq

Read PA's from
tags

ComparePA's with
TB; determine
DC_Hit

9.17.1.2 Tag Operations

•

Incoming loads on the three load ports are looked up in the tag, using VA<14:6> as
the index, compared with the output of the DTB, and return a hit indication, a set
selection and a fill-in-progress indication.

•

The back end tag may send a new tag entry to be written. Some flight time delay is
acceptable, as long as the new tag arrives before the entry is set valid.

•

The back end tag may send a new set of valid bits to be written .

The tag needs to support both a read and a write in the same cycle.

9.17 .2 Back End Tag
This tag handles stores, synonyms, probes and fills. Because this uses physical
addresses, this tag is 8-way set associative, indexed by PA<12:6>. VA<14:13> are concatenated to the set number.
The back end tag must contain, in addition to PA<47:13>, valid, shared and owned bits,
and the SCache set in which the cache line resides. Each pair of entries also must contain a set allocation bit, indicating the destination set of the next DCache fill. The tag
entry is parity protected.
9.17.2.0.1 Tag Operations

•

Start fill.
When the Cbox MAF accepts a PreMAF miss request, the PreMAF requests a back
end tag launch, using VA<14:6>, and reads the set allocation bit. The entry so
indexed is then written with the new tag value, including the appropriate ownership
and SCache set values. The valid bit is cleared and the fill-in-progress bit set. The
set allocation bit is flipped, unless directed otherwise (Prefetch Evict Next). With
the exception of the valid bit, these data need not be written immediately, as long as
the write occurs before the End Fill operation.

•

Endfill.
When all fill data are transferred from the Cbox to the DCache array, the PreMAF
requests that the valid bit in the tag be set and the fill-in-progress bit cleared.

•

Probe.
The CBox requests a tag launch with PA<l2:6>. If any of the 8 entries so indexed
hit, a write cycle is initiated clearing the matching valid and fill-in-progress bits, of
which there may be as many as four.

5 January 2001 ··· Subject To Change

Compaq Confidentia I
Memory Instruction Execution Unit - the Mbox

9-43

Dcache Array

•

Store.
The Merge buffer requests a tag launch with PA<12:6> and also supplies
VA<14:13> at the time that the PreMAF accepts a merge buffer evict request. For
the two entries indexed by VA<14:13>, return a hit indication. For the other six
entries, invalidate entries that hit.

9.17.3 IPRs
Tag operation is controlled by the DC_CTL IPR. Relevant fields include the following:
DC_CTL Field

Description

SET_EN[l:O]

Gates the match lines for the respective sets.

F_HIT

Forces the DC_HIT line.

FLUSH

Clears all the valid bits.

F_BAD_TPAR

Forces bad parity on tag writes.

DCTAG_PAR_EN

Gates parity checking.

9.18 Dcache Array
The Data Cache, or Dcache, is a 64K-byte on-chip data storage. The data are organized
in 64-byte blocks, divided into 32 4-byte (longword) banks (virtual address bits 6-2 are
used to address the banks). Each bank can accept a read and a write per cycle. Three
Address ports are input to the Dcache from the Ebox; three Data ports are output back
to the Ebox.
There are three data ports to the Ebox upon which read data are transferred from the
Dcache. Since only 1 read is permitted per clock, in the event of a bank conflict, the
older load on port 0 and 1 is given priority, followed by the younger load on port 0 and
1, followed by the load on port 2. In the event of a bank conflict, the load is retried out
of the Load Queue.
Cache block fills originating either from the CBox or stores in the Merge Buffer, may
write upto an entire cache-block (512 bits) per cycle into the Dcache. The write data is
accompanied by 64-parity bits, which are stored in the array, as well as 64 dcache_dirty
bits, which control the write to the appropriate byte-bank. During a write, the cache
index bits are sent from the Back-End Bus, while the set bit is sent from the Dtags.
During a read operation, 3 indexes are presented to the Dcache. After prioritizing
between the 3 ports, each bank is selected to drive the load data bus on the corresponding port. Both sets are read out of the Dcache, formatted and sign-extended. Once the
Dtag compare as well as the Store Queue check (to see if the Store Queue may drive the
data, instead of the Dcache) is done, the appropriate set is selected and sourced onto the
Load Data bus.
The parity bits (for each byte) are read out and sent along with the Load Data bus. The
EBox is responsible for signalling parity errors.

Compaq Confidential
9-44

Memory Instruction Execution Unit -

the Mbox

5 Jc1nuary 2001 ··· Subject To CJumge

Pre...MAF

9.18.1 Read Dcache
The Dcache row index (lndx<l2:0>) comprises of VA<14:13> and PA<l2:2>. The row
index arrives at the MBox in early MO phase A (or late MY phase B). During phase A,
the row address is decoded and possible bank conflicts checked. The array is read in
phase B of MO.
The data is formatted in early phase A of Ml; the set select (as well as Store Queue hit)
is known in phase A of Ml. The data is selected and driven to the EB ox.
Back-End Bus Pipeline/Dcache Write Pipeline/Load Pipeline TBS.

9.18.2 Write Dcache
Write pipeline is shown above. Write data may originate either from the the Merge
Buffer, directly from the CB ox (via a latch in the Mbox) or from one of the 4 Fill Buffer
entries (for IO and partial fill data).
The fill data and the fill address (VA[l4:13], PA[46:6]) busses are sent to the Dcache
during the Drv. DC_data phase of the Back End Bus pipeline.
The data is written during phase A of the Dcache Write phase of the Back End Bus
pipeline.

9.18.3 Bypass Fill Data
Presently it is thought that the fill data bus (write data) will mux onto the read lines in
order to bypass the data to the Load Data bus. This needs to be examined.

9.18.4 Structure
The array is physically structured in 2 halves: left and right. The right half provides the
even long words (0,2,4 etc.) while the left half provides the odd long words (1,3, etc).
Each half contains both sets A and B. The bits for each set are interspersed (bitO of set
A and set Bare adjascent). The LSBs for each of the 4 bytes are grouped together
(0,8,16,24) and the corresponding bit position of each of the 8 longwords are kept
together as follows:
(0,32) / (8,40) / (16,48) / (24,56)

(1,33)

(9,41) • • • • • •

The 32 banks are arranged as 8 sets: each set comprises of 4 banks which are interleaved between the right half and the left half. Thus bank 0 (and bank2) is on the right
array, while bank 1 (and bank3) is on the left array.
6 sets (3 ports* 2 ways) of differential bit wires (12 wires) or global bit lines enter the
DRV section (sense amp). Following the DRV section is the SWP section which does
format conversion (byte, word, lword, qword) as well as sign extension. The formatted
data for each of the 2 sets (ways) are then muxed with the Store Queue data path and
onto the Load Data bus.
The channel in between the 2 arrays (left and right halves) is used to route
Index<12:0>.

9.19 Pre-MAF
Figure 9-4 shows the pre-MAF queue.

5 January 2001 ··· Subject To Change

Compaq Confidential
Memory Instruction Execution Unit - the Mbox

9-45

.,,

Me:igebu:ff'er

<»

..
..

cS"
c

co
~

s:::
CD

I-stream

I-sire.am bypass

3
0

3'"

"'O
CD

I-stream Queue

se.

L:l.O_M2

Ll.l_:M2

(5'

L:l.2_M2

L:l.O_:M3

s.o·

Ldl_M3

L::l.2_:M3

.,,
J>
==

PA.[47:6] to CBax:

bypass f l.O:tn Loai Q.i&e

8:0>

:J
;:::+(")

I o

s:: ..0

Bypass Buffer

O" (")
0 0

-s:

x :s

D-s'llearn PA[47 :6]

Dstream Queue

t1)

;·

Retry SILO
(Fill Bufffer)

Ul
0:...
~

::s

Porto b:ypass

Ret:ry_FO_Inunl.8 :O] to QBox

.....

(/)

:c:::

~·

"'*

Qi
Q
ii}

c
CD
c
CD

_3
:::r "a

Retiy_Pl_Inunl.8:0] to QBax:

Port 1 b:yp ass

"'O

tong

«>
~

s:
>
"Tl

Pre...MAF

The Pre-MAF queue can accept requests every clock from the following sources:
•

I-stream

•

Merge buffer

•

2 load retries from the load queue

The Merge Buffer requests have the highest priority and are always forwarded to the
Cbox (MAF) in the cycle that the request arrives. The load retries are written into the
D-stream Queue while the I-stream requests are written into the I-stream Queue.
If the D-stream request queue is empty, newly issued loads (from either ports 0,1or2)

may be bypassed directly to the CBox, without going through the extra stages of writing into the Load Queue and then being scheduled from it. The bubble requests are conditioned upon the CBox acknowledging that the request is indeed being sent to the
Scache. Retry requests from the Load Queue which do not need Scache access, bypass
the Dstream Queue and are allowed to send their bypass request directly. The bubble
request is arbitrated 3 cycles prior to being sent.

9.19.1 Merge Buffer Requests
Merge Buffer requests are sent via the Back End Bus. Merge Buffer requests are TBS.
Merge Buffer requests are sent to the MAF, without any queuing delays in the Pre MAF.
The acknowledgement (ack) is sent directly to the Merge Buffer. If the request was not
accepted by the MAF, the Merge Buffer needs to resend the request.

9.19.2 D-stream Queue
The purpose of the Dstream Queue is to buffer requests from the Load Queue enroute to
the CBox. Since it takes many cycles to re-read the Load Queue in case the request
doesn't get access to the MAF, the Dstream Queue buffers 16 retry requests destined for
the CBox MAF.
Requests emanating from the Load Queue have a status bit send_to_scache, implying
that the request needs to be sent to the CBox. These are the requests that are enqueued
in the Dstream Queue. All others proceed directly via the bypass path shown (Port 0 &
1 bypass), to send the bubble request to the Qbox.
Retry requests coming in, CAM the Fill Buffer (Retry SILO) as well as the Bypass
Buffer. If a match is found, the send_to_Scache bit is reset (implying that the request
should not be forwarded to the CBox). Requests entered into the Dstream Queue are
allowed to send their bubble request only after all preceding entries in the Dstream
Queue have sent their bubble request. By disallowing requests which hit in the Retry
SILO (Fill Buffer), from proceeding to the MAF, allows the request to send its bubble
request paired with another. It also preserves Scache bandwidth. One exception to this
are IO requests, which keeps its send_to_Scache bit set, even if a hit is found (in the
Retry SILO).
Three cycles after a request is sent to the CBox, an ack is received which implies that
the Cbox accepted the request; if the ack is not received, the request needs to be resent.
At the time, the ack is received, the PA of the request that was just sent is used to CAM
the Dstream Queue; if any entry finds a match, it's send_to_Scache bit is reset (except if
its IO).
Compaq Confidential
5 January 2001 -· Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-47

Mbox Back End Bus

9.19.3 Killing Retries
The D-stream queue also acts as a staging latch (silo) for retries waiting for Scache
access. The kill bus is routed to the D-stream queue and all retry entries need to compare their inum to check if the retry needs to be aborted; if so the entry sets retry_abort.
Once the tail_ptr comes to an entry whose retry_abort is set, it suppresses the bubble
request. However, the MAF request will still be made.
Entries whose retry_abort bit is set, do not assert block_retry.

9.19.4 I-stream Queue
The I-stream Queue is constructed as a separate queue primarily because of the following reasons:

•

!stream requests do not need to CAM previous requests in order to suppress
requests to the Scache

•

There is no bubble widget for !stream requests.

The I-stream queue is a 16-entry-deep FIFO. When the !stream queue is almost full, the
pre-MAF asserts pmf_full to prevent additional entries from being sent.

9.20 Mbox Back End Bus
TBS

9.21 Internal Processor Registers
The Mbox Internal Processor Registers (IPRs) provide visibility and control for processor-specific operations in the Mbox. These include handling Translation Buffer misses,
along with various kinds of Dstream faults, and enabling and disabling various parts of
the box, generally for test and reset use.
Mbox IPRs are described in Section 16.3.
All IPRs, with the exception of DC_CTL, exist on a per-TPU basis. That is, any read to
a readable IPR returns data specific to that TPU. Any explicit write to a writable IPR
takes effect when the write retires in that TPU. However, many IPRs control chip-wide
state. Thus, retiring an IPR write can affect state that is visible to another TPU. For
example, the DCache and DTB are shared resources. DC_CTL is a chip-wide IPR, it is
used to control the shared DCache.
In addition to the general 21464 treatment of IPRs, the Mbox applies the following specific rules to its IPRs;

•

TB entries must be usable speculatively. For further information, see the Translation Buffer document. IPRs that write the DTB, including invalidates, exist as a
pair of IPRs (such as DTB_PTEO and DTB_PfEl). To perform a write, both members of the pair must be written in adjacent slots in the same map block, slotted to
the strong load ports, with the IPRl following the IPRO. The IPRl operation is a
long latency operation which causes a bubble back, allowing the QBox to release
subsequent DTB writers.

Compaq Confidentia I
9-48

Memory Instruction Execution Unit -

the Mbox

5 J(1nuary 2001 ·- Subject To Change

Internal Processor Registers

•

All other MBox IPRs only take effect on retire, and must be protected with an
IFETCHB instruction. The MBox IPR write port is connected to the
LD_ADDR[0]<63:0> bus. In order to write an MBox IPR, two identical
HW_MTPR instructions must be issued in adjacent slots in the same map block,
slotted to the strong load ports. This ensures that one of the two writes will travel
down the LD_ADDR[O] bus. The other will go down LD_ADDR[l] and will be
ignored.

•

The MBox IPR read port is connected to the Ebox IPR read bus, which operates as
a 5-cycle multimedia instruction slotted through a weak load picker. Thus, all
MBox HW_MFPR instructions must be issued on the weak load port.

In addition to the MBox IPRs, several other box IPRs also interact with the MBox.
They are handled as follows:
•

The CBox IPRs are mapped into physical memory. Access to them is via IO Loads
and Stores.

•

The IBox IPRs are accessed through the load address and data busses they same
way JSR call and return addresses are. In particular, the JSR call and HW_MTPR
path is connected to the LD_ADDR[2]<63:0> bus. This connection is made at the
Ebox end of the bus, adjacent to the drivers. These instructions must issue on the
weak load port. The JSR return and HW_MFPR path is connected to the Ebox IPR
read bus. These operations must be slotted as 5-cycle multimedia operations on the
weak load port.

9.21.1 Implicitly Written IPRs
There are two groups of Mbox IPRs that are written implicitly, that is, by other than an
HW_MTPR instruction. Implicitly written IPRs require special and careful handling, as
documented by the Qbox. The first group consists of the DC_STAT IPR, which is written when any of several asynchronous events happen on a DCache fill. DC_STAT is an
implicitly event-written IPR, in that we do not attempt to associate its writing with any
particular instruction. Also, events set bits in DC_STAT which are not cleared even if
the instruction (indirectly) leading to that event turns out to have been killed.
The second, and more complex, group consists of the VA, VA_FORM and MM_STAT
IPRs, which, for simplicity, will be referred to as the MM_STAT IPR set, as all three
share the same update criteria. The MM_STAT IPRs are implicitly written by any
instruction causing a Dstream fault leading to any DTB_MISS or DFAULT PALcode
entry point. Furthermore, the set of MM_STAT IPRs read by a particular entry to PALcode must correspond to the instruction that generated the disruption leading to that
PALcode entry. This means that if an older disruption overshadows a younger one, the
older disruption must overwrite the MM_STAT IPRs. Note that poisoned instructions
must never generate faults.
This may be expressed, in a TPU-centric view, as assigning an INum to each (TPU-specific) MM_STAT IPR set, and only allowing a faulting instruction to update the set if all
of the following conditions hold:

•

The instruction is older than all faulting instructions for the same TPU issued in the
same cycle in other memory pipes.

Compaq Confidential
5 January 2001 -· Subject To Change

Memory Instruction Execution Unit -

the Mbox

9-49

Internal Processor Registers

•

The instruction is older than any instruction in this TPU that decided in the previous cycle that it will write MM_STAT.

•

The instruction is older than the INum, if any, currently associated with this TPU's
MM_STAT IPR set.

If this is the case, the instruction's particulars and INum are written to the MM_STAT

IPR set. The MM_STAT IPR set INum is cleared whenever that INum or an older INum
is killed.
A special case condition is that the LD _ VPTE instruction, which is only executed
within the DTB single miss flow, writes only its INum, and not its particulars, to the
MM_STAT IPR set. The reason is that the LD _VPIE disruption handler deals with correctly fixing up the underlying memory operation that caused the DTB single miss
immediately preceding the LD_VPIE, rather than fixing up the LD_VPIE itself, which
was merely the first attempt to deal with the DTB miss, and not anything interesting of
itself. Thus, the MM_STAT particulars from the original disruption must be preserved.
The INum of the LD _ VPIE must be written to the MM_STAT set to ensure that we do
not speculate all the way through a DTB single miss and into another Dstream fault
while a LD_VPTE double miss or trap is pending. The particulars of the original singlemiss entry will still be preserved at the time the LD_VPIE traps, as all subsequent
memory operations are dependent on the DTB writer block issue, which is in tum data
dependent on the LD_VPTE. The Mbox must detect LD_VPIE faults before this protective window expires to avoid having younger memory operations overwrite the
MM_STAT particulars before the LD_ VPTE disruption has a chance to write its INum
to the MM_STAT set.

9-50

Compaq Confidential
Memory Instruction Execution Unit - the Mbox
5 Jc1nuary 2001 -~ Subject To Cl1ange

10
Internal Ring Bus
This chapter is to connect the Cbox, Rbox, and Zbox chapters.

Compaq Confidential
5 January 2001 -~Subject To Change

Internal Ring Bus

10-1

Compaq Confidential
10-2

Internal Ring Bus

5 J<1nu<1ry 2001 ··· Subject To Change

Cbox Overview

11
Second-Level Cache and Controller {Cbox)
The Sbox and the Cbox contain the onchip 3 MB six-way set-associative second-level
cache (the Sbox) and the control of this cache (the Cbox). Additionally, the Cbox, in
conjunction with the Mbox and Zbox, implements the cache coherent, distributedshared memory system.

11.1 Cbox Overview
The Cbox is divided into two logical and physical partitions:
•

CS -

the "Scache controller" partition. The CS manages the Scache pipeline.

•

CF - the "fill datapath" partition. CF handles data flowing to and from the Cbox

The following sections comprise the CS partition.
Name

Mnemonic

Description

Internal probe queue

IPQ

Miss address file

MAF

Probe queue

PRQ

Response queue

RSQ

Retry queue

RTQ

System interface

SYS

System request queue

SRQ

Sixty-four entry FIFO for holding MAF indexes that
require internal probe processing
Holds requests from the local processor until satisfied,
and holds probes and forwards from remote processors.
Thirty-two entry FIFO for holding probes and nonblock responses that are waiting for access to the
Scache pipeline.
FIFO to hold VAF indices that have not yet been
delivered to system.
Sixty-four entry FIFO for holding MAF indexes of
Scache transactions that must execute through the
Scache pipe again due to an error or bank conflict.
Connects the Cbox to the Rbox and Zbox via 21464's
internal ring bus.
Stores MAF indices requiring a system request.
Tracks number of outstanding system requests for a
given Scache index.

Test structures
Victim address file

TIQ
VAF

Config and status registers CSR

5 January 2001 - Subject To Change

Holds responses being sent back to the system either
as displacement victims or in response to system
probes.
Holds the Cbox CS Rs.

Compaq Confidential
Second-Level Cache and Controller (Cbox)

11-1

Cbox Overview

The following sections comprise the CF partition.
Name

Mnemonic

Data buffer muxes
Fill data buffer

CF_DBM
CF_FDB

Fill data logic, ecc
Rambus input
Rambus output
Victim data buffer

CF_FBE
CF_RBI
CF_RBO
CF_VDB

Description

Buffer that holds fill data from the Zbox, destined for
the Rbox or Cbox.

Buffer for victim data from the Cbox, destined for the
Zbox or Rbox.

When the processor (Ibox or Mbox) needs access to a block of data that it does not
have, it makes a request to the Cbox. If there is a copy of the requested block in the second level cache (Scache) Cbox returns that copy. Otherwise Cbox will make a request
to the system to get a copy of the requested block. In multi-processor systems, to keep
memory coherent, Mbox must be sure that this processor is the only processor with a
copy of that block (exclusive) before it writes to it. If the block Mbox wants to write
resides in the Dcache or the Scache but is marked as shared then there may be other
processors with copies of that same block. Cbox must make a request to the system to
obtain an exclusive copy.
When a stores retires it is first written into the Mbox merge buffer. The merge buffer
merges stores to the same block before requesting an exclusive copy of the cache block
from the Dcache, Scache or system. Once an exclusive copy has been obtained, the
store data in the merge buffer merges with fill data returned from Dcache, Scache or
system. The complete cache block is then written-through to Dcache and Scache simultaneously.
The system responds to requests for cache blocks by returning a copy of the block with
state indicating if you have a shared or exclusive copy and if the block is dirty or not.
The returned block is filled into the Scache and also sent back to the requester (Ibox or
Mbox). Filling a block can cause an existing Scache block to be displaced and sent back
to the system as a victim.
To process a request for a copy of a cache block, the system must be able to determine
where copies exist. A directory is used to hold this information (see ?). The system will
ensure that the requester gets the most up-to-date copy of the block, and if requested
exclusive then the system will initiate the invalidation of copies residing in processors.
To do this the system sends probes to processors holding copies of the block. The
probes ask the processor to forward its copy to the requester and mark its copy as
shared and/or invalidate its copy.
In summary, the Cbox major features are:

11-2

•

Up to six outstanding requests to the same Scache index. (By comparison, the
21364 can have one outstanding request to a given Scache index.)

•

64-entry MAF

•
•

64-entry VAF
Fills entire cache line (512 bits) to the Mbox and Ibox per cycle. (By comparison,
the 21364 fills 128 bits per cycle.)

Compaq Confidential
Second-Level Cache and Controller (Cbox)
5 Jc1nuc1ry 2001 ·-Subject To Change

Sbox Overview

11.2 Sbox Overview
The Scache has the following features:
•

3 MB, six-way set associative

•

Physically indexed, physically tagged

•

16 banks, with one read/write port per bank

•

QUAD-word (64 bits) writeable, single-bit ECC correction on tags and data. Double-bit ECC detection on tags and data

•

LRU replacement

11.3 Scache Control -

the CS Partition

The following sections describe the Scache control logic - the CS partition.

Compaq Confidential
5 January 2001 - Subject To Change

Second-Level Cache and Controller (Cbox)

11-3

........
I

Internal Probes

(IPQ) Retries

LRUevict/BLK* 1 (SYS)

(RTQ)

Arbitrate for Scache pipe

Arb

::J
0..

Read PA & MAP state.
Bank conflick check.

Read PA & MAP state.
Bank conflict check.

...
0

Read PA & MAP state.
Alloc VAF entry.

MAF

Compute new MAF
state. int_probe_ack

0
::J"
CD

Write new MAP state.
Read tag.

$l)

::J
0..

Tag ECC detect.
Set select.

::J

<DO
..., 0

Tag ECC correct.
Writetag. Read data.

Co~ute new MAP

state . retry_ack
Write new MAP
state. Rd tag.
Tag ECC detect.
Set select.

(")

-s:
0

Compute new
MAP state.
Write new
MAP state.

Taf ECC correct.
Write tag. Read3 data.

-3

g2 "O
0 ~
c,..o

Read PA&
MAFstate.

Compute new MAP
state.
Write new MAP state.
Read LR U/tags
Victim set select.

Read data.
Victim tag ECC.
Write LRU/tag.

Send probe addr/tag/cmd
toMbox.

Send addr/tag/
cmdtoMbox.

':::sc:
~

8......

Fill data on
the fill bus.

!i
(Q
e

Decrement
inflight count

s:
.,,>

n
er

::r

CD
th

-I

Set
Select

.,,

"'
.,,
(')

::T
CD

-a·CD

CC»

......"'

0
co

D>
CD
th

...0
0

WR2

n
n

::r

... c
...0 .,,ma
-a·
...w0
...0
CD

I\)

.l:;o

C16

1-1>

,_.(.'!)

•
•

1-tj
1-tj

1-1> (!)

(!)

0.
......

C" (IQ
~

(')

WR6

(!)

~
(IQ

-a·CD

WRS

..."' w ...... .,,
e g: 0

n
m
n

WR3

""'I

< -s·
.
5" . - !e.
s·
NCD
CD

cs
1-1>

•

I-ti

~o
w~

:...:i (!)

n
cr'
0

......

g
~
s·
~
g:

-&
(!)

1-1>

0.
(!)
00

g......

cr'
0.
(!)

s·

Cl'.l

(!)

:r
('!)
("')

.,,>< =g. ; ""O00

(/)
(')
D)
(')

ii'

C15
Write data.

0
en

Write VDB.

WriteVAP.
WriteRSQ.

I
......

Read Data
Write Tag

WR4

t::
~

.l:;o

Victim data on the fill bus.
Victim data correction.
CAM MAP with victim PA.

......
......

0
w

Fill
Bus

(I)

Tag
Compare

WR1

Read Tag

>
...

0I\)

.....

!
Ul

Tag
Launch

Send Victim addrr and VDB. Drive
idx to Mbox. Write VAF.
Data
Send fill addr/
tag to Mbox.

.....
.....
~
.....

~
Transaction Type (Source) er
ii'

..."""

....

:r
(f)
("')

en
'1l

....:;::;
~

s·
::::J

Early Warn5 Sharedlnval
(SYS)
(PRQ)

§
t:

Forwards (PRQ)

S2D* (PDQ)

~
Transaction Type (Source) er
i"

Read PA.

Alloc/merge MAP entry. Read
MAP states Bank conflict
check. Alloc VAF entcy.

Read PA & MAP state.
Bank conflict check.
Alloc VAF entry.
MAF

ti)

t::
~
~·
~
....,.

Compute new MAP
state. probe_ack.

~
~
~~

Tag ECC correct. Tag ECC correct.
Write tag.
Write tag. Read data.

Compute new MAP
state. probe_ack

Tag
Launch

Write new MAP
state. Read tag.

Read Tag

Tag ECC detect.
Write tag.

Tag
Compare

Tag ECC correct.
Write tag. Read data.

Read Data
Write Tag

,,3

Set
Select

..Q

en ::i

Send addr/tag
toMbox.

Send probe addr/tag/cmd to Mbox

Send add/tag/cmd
to Mbox.

CD~
0 Q.

::J

::::s

Drive
Data

......
J,

en
(")

n ca~

( ")

"'CJ

-a·CD

n
I»
n

::r
CD

"'O

-a·CD
s·
CD

n
....

m
(")

... c

::J"
CD

...n ma
-a·
...nw

"'CJ
CD

WRS

n
....

WR6

CD
tn

::J"
CD

(")

WR4
Decrement
inflight count.

m
(")

WR3

[

()

Is·

...

Write VDB.

::J

><
"'O

0C"

WR2

(")

n
N

Fill data on the fill
Fill data on the fill bus.
bus4 • ECC correction.

:::::r

CD
""'II

WR1

CD
Sl>
::J

52 s::
>
'T1

CD .....

92..

.....
.....
.....I

n
.....

Fill
Bus

r-c. -~r
()
Sl>
0

Arb

~
c......

>
...

Arbitrate for Scache pipe

:Q)

:::s

s·

CD
CL

tfJ

Ch
I»

(')

CD
tn

:r
(!)

~
(')

0
::::5

""'ll
"""'
2.

,..

~
(fp

w
~
,..
""'ll

C15
C16

;:;:

er:J

Scache Control - the CS Partition

Table 11-1 Cbox Pipeline Stages (Continued)

Scache Pipeline States

~
:::s
0

e
G)

Arb

MAF

Scache Tag Pipe
C2

Scache Data Pipe

C10

C11

C12

C13

C14

~
c

~
ca

.c
()
c

cu
,,._

._ ...I

LL.

-s

I-

c:(

2' ii

..
CD

0,-..

~J
.._,<.#
~:a

Cl)

..c::
..c~Q,I
g
Cl)

.._,(.#

~00

~~
~ l~

Clj

2
Cl)

..c ~ Cd
.b
~t

1----i

:§

5
~

~ '°~

~s::!

~ ""(.)

=~~
~~~
.g ~a
'a:l~ 0

<!~i::Q

~
~~
[~,

8'~

~""

~
~

.Q
t$

.s ....i

~ .
(.)

~e3

~ ·c

c:ll

~
oil

~§

]

'O~

..... ~

u~
~'O

'O ~

~~ ~~

.~~

(.)

0""

rl:i

><
IZl

;::

~
'O

.....

;::

'cl

;::

<.,:::

~~
~

;::

:::::

..... 0

,,;

-=:I'

~
c:ll

;:::)

(\I

;::

Q,I 00
~ r:l.l

..c ~

....

"' "'

t---

co
.... .....

. ; C)

c~
ts ~cu
C) E
i~
CD ._
·c 1a =~
cu 0
CD .!
CD
._ 0
a: ;::
cc LL.
m
cu

= <
~

·6

""=

.;!l 0

·~

o ..

.s~

i::Q,.c::

""e

:> ()

5 .

1$:~

]'S,

~..c

·~

0 (.)

·E~

.....'~

a: g

·E~

Block responses and 12DResponses require two Scache transaction. The first transaction is the
LRUEvict, which extracts the victim data, and the second transaction is the fill. These happen atomically. Timing shown here assumes that the extracted LRUvictim is a) coherent, and b) we don't need
MGB to write-thru.
2

The only retry action that updates the MAF state is STODFAIL. A retrying STODFAIL can set
MAF.need_sys_req.

If the retry is due to a data ECC error, the Scache tag has already been updated and the retry must not
change the tag state again.

Cbox sends the cache block to Mbox if Scache is to keep a copy of the cache block Shared after the
probe.
5
EarlyWarn does not set MAF.sc_inflight.

11.3.2 Miss Address File - the MAF
11.3.2.1 Overview

The miss address file or MAF, is the major control structure in the Cbox and is responsible for tracking outstanding miss requests to the system. The MAF is a 64-entry associative memory, with control logic for managing the Scache pipeline. Note that the
number of MAF entries (64) corresponds to the number of misses (where a miss may be
ad-stream cache-block request, i-stream cache-block request, ownership-only request,

11-6

Compaq Confidential
Second-Level Cache and Controller (Cbox)
5 January 2001 -~Subject To Change

Scache Control - the CS Partition

etc) that a processor may have in-flight at any one time. Allowing several simultaneous outstanding misses is critical to keeping a wide-issue superscalar machine like
the 21464 fed. By way of comparison, the previous-generation Alpha processor (the
21364) has 16 MAF entries.
11.3.2.2 Principle of Operation

<block diagram needed here>
MAF operation is initiated by 3 major classes of operations:
1. Requests from the core
2. Fills/responses from the system
3. Probes from other processors
11.3.2.2.1 Requests from the Core

The Mbox PreMAF and Mbox MGB may deliver a total of 1 request per cycle to the
MAF. The Mbox delivers the physical address and other request state to the MAF. The
MAF firstCAMs the PA against existing MAF entries to check if there already is a
MAF entry with thisphysical address. If there is, the MAF attempts to merge the new
request into the existing MAF entry. If there is no CAM match, a new MAF entry is
created. Note that along with merging or creating a new MAF entry, the request is
launched into the Scache pipeline.
11.3.2.2.2 Fills/Responses from the System
If a miss request from the core does not find the data (or the required cache state) in the
Scache, a request is launched to the system. The system will respond with the data and
the new cache state. This response from the system carries with it the MAF index number of the corresponding request. When the fill arrives from the system, the MAF index
is used to access the correctMAF entry, extract the relevent information, and perform
the fill.

11.3.2.2.3 Probes From Other Processors
Probes (Forwards, lnvals)

Probes, like Mbox miss requests, may or may not find that the MAF already has an
entry with the probe physical address. Therefore, an incoming probe first CAMs the
MAF. If a MAF entry already exists, the probe uses this MAF entry, and launchs into
the Scache pipe. If a MAF entry does not exist for this physical address, one is created,
and the probe launches into the Scache pipe.
11.3.2.3 MAF Pipeline Timing Diagram and Pipeline Overview

Table 11-2 shows the MAF pipeline timing diagram.
Table 11-2 MAF Pipeline Timing Diagram

PreArb

Arb

RD/CAM
PA

Sc ache
pipe
CTRL

Write new
MAFstate

C7
Fill
C1RL

Compaq Confidential
5 January 2001 -·Subject To Change

Second-Level Cache and Controller (Cbox)

11-7

Scache Control - the CS Partition

11.3.2.3.1 CZ, CO: MAF Arbitration Logic

The MAF arbitration logic selects one transaction each cycle to launch into the Scache
pipe. This arbitration is split across two stages: CZ and CO. CZ is called the PreArb
stage and CO is the Arb stage. Transactions arbitrate in the PreArb stage with the following priority:

1. Retries from RTQ
2.

Internal probes from IPQ

3. Probes/non-block responses from PRQ
4.

Mgb requests from Mbox

The winner from this stage is latched and sent to the CO arbitration stage, where the
Cbox arbitrates with the following priority:

1. System fill from SYS
2.

PreArb winner

3. Mbox PreMAF request from Mbox
Note that the logic is designed to give the Mbox PreMAF requests (such as the L1 Load
Miss) as low a latency as possible.
11.3.2.3.2 C1: MAF Bank Conflict Detection Logic I MAF CAM I MAF RD

The Scache tag array (STAG) and Scache data array are large memory structures. The
pipeline diagram in Table 11-1 shows that the arrays are read and written in different
cycles, thereby requiring multiple ports. Multiporting an array as large as the tag or data
array would result in a prohibitively large structure. To avoid this constraint, the arrays
are banked into individual, smaller arrays.
Specifically, there are 16 banks for the Scache tags and 16 banks for the Scache data,
with each bank having a single read/write port. The low four bits of the physical address
(PA<9:6>) give the bank number. Although single transactions are launched into the
Scache pipe each cycle, bank conflicts can still arise, and are managed with the following logic.
Consider the Scache tag array first (STAG). The STAGs are read in cycle C3 of the
Cbox pipe and are written in cycle C5. Recall that the STAG array has only a single
merged read/write port per bank. Table 11-3 illustrates how a bank conflict can
occur:
Table 11-3 Scache Tag Array Bank Conflicts
Cycle Number ~
Transaction .J,

Sharedlnval

RD tag

MissReql

WR tag
RD tag

MissReq2

RD tag

In cycle 3, the Sharedlnval transaction is trying to write the STAGs at the same
time the MissReq2 transaction is trying to read them. If the Sharedlnval and the
MissReq2 have the same PA<9:6>, then we have a bank conflict. This situation also
applies to the Scache data array.
11-8

Compaq Confidential
Second-Level Cache and Controller (Cbox)
5 Januc1ry 2001

Subject To Change

Scache Control - the CS Partition

The bank conflict detection logic in the CS_MAF is responsible for detecting situations like the one above, and rejecting the later arriving transaction to prevent the
bank conflict. (Exceptions to this are described in Section 11.3.2.3.3.) This is
accomplished as follows:
For every transaction that enters the pipe, we note when it wants to access the
STAG (C3, C5) and/or SDATA (CS, C15) arrays.
We check this against what is already in the pipe. For instance, if the particular
transaction wants to read the STAGs (C3), we check if we currently have a
transaction in the C3 stage that is going to write the STAGs in C5. If so, the
incoming transaction is either NACKed (Mbox) or placed in the RTQ (RTQ,
IPQ, PRQ).
{ SHOULD A TABLE BE PLACED HERE OF EACH TRANSACTION TYPE
ALONG WITH WHEN THEY READ/WRITE TAGS/DATA ? }
11.3.2.3.3 Exceptions

Fills (and LRUevicts) are special transactions. They are never stalled, rejected, or
retried (but see section on hiccup). If the BCL detects a bank conflict for a fill or
LRUevict in Cl, the transaction which IS ALREADY IN THE PIPE AND IS CAUSING THE BANK CONFLICT is "preempted".
The unlucky transaction is placed in the RTQ for later execution, and the fill procedes
merrily along.
11.3.2.3.4 C1: MAF CAM I MAF RD

In the C 1 pipe stage of the Cbox, we either CAM the MAF with the incoming transaction's physical address or we read the MAF entry specified by the incoming transaction.
The state at the corresponding CAMed or read entry is read out and is delivered to the
C2 stage of the pipe. There is a single CAM port on the PA<47:6> stored in the MAF,
which is shared between:

•

Mbox requests
Incoming Mbox requests CAM against the entries in the MAF. If there is a MAF
entry with a matching PA, we attempt to merge the incoming Mbox transaction
with the matching MAF entry.

•

Probes (such as Forwards and Sharedlnvals)
These CAM the MAF to discover if there is an outstanding request to the same
address.

•

Victims
Victims on their way to the VAF CAM the MAF to determine if they are coherent.

11.3.2.3.5 c2: MAF logic

The C2 stage of the MAF is the most complex. Here, based on the incoming transaction
and the state read from the corresponding MAF entry, if any, we compute the "next
state" of this MAF entry, and required outputs for subsequent stages. The logic involved
is too complex to be discussed in detail here; it will be covered later in this chapter
when we discuss the flows for each transaction type.

Compaq Confidential
5 January 2001 ·-Subject To Change

Second-Level Cache and Controller (Cbox)

11-9

Scache Control - the CS Partition

11.3.2.3.6 C3-C6: Scache Tag Access

In these cycles, the physical address is delivered to the Sbox and the tag state is looked
up. These cycles technically belong to the Sbox. No MAF logic runs here.
11.3.2.3.7 C7: Fill Pipe Control

In cycle C7, the result of the STAG lookup is latched in the MAF, and, based on the tag
state and the transaction state, various commands are delivered to the Mbox and to the
VAF. Again, the details of this logic will be discussed later.
11.3.2.4 Contents of MAF Entries

Table 11-4 shows the contents of each MAF entry.
Table 11-4 Contents of Each MAF Entry
Contents

Description

#Bits Ports

I. Physical address bits

miss_pa<47 :6>

1RD/WR,1RD,1 CAM I/O space:PA<47> =1.

miss_tpu<3 :0>

1 WR,2 RD

Thread processing unit, merged

miss_ifill_ptr<4: O>

1WR,1 RD

I fill buffer index.

mgw_closed

lCAM, lWR

I/O merge window is closed.

io_mask<7:0>

1/0 byte mask.

io_size< 1:0>

Stored in a separate table
indexed by the TPU

QW_addr

II. I/O request fields

I/O size.
RdIO orWrIO

Ill. Request type

i_fill_ena

2RD,2WR

I-stream fetches (I-demand/1-prefetch).

dc_req_type<4:0>

2RD, lWR

Request type ( LD, ST, STx_C, Prefetch Scache,
prefetch mod)

valid

Multi-port

MAF entry is valid.

sc_inflight

2Rd, 2WR

Has inflight miss/fill/probe in the Scache pipeline.

sys_cmd<2:0>

3RD,1 .... 2 WR

System request command.

need_sys_rqst

2RD, 1 WR

Waiting for the system request launch.

sys_inflight

2RD, 1 WR

Has an inflight system request.

IV. Request state

IV. Coherence state bits

Compaq Confidential
11-10 Second-Level Cache and Controller (Cbox)

5 Jc1m.1c1ry 2001 ... Subject To Change

Scache Control - the CS Partition

Table 11-4 Contents of Each MAF Entry
Contents

#Bits Ports

coherent

cohr_cnt<5:0>

timer_on

has_int_probe

VA<15:14>

Inval_seen

victimize

vic2shr

Invalidate

evict_next

MB _retired<3 :0>

Total

-83

Description

lRD, lWR

Coherence state

Notes:
Physical Address

•

The physical address field PA<47:6> of the MAF needs:
CAM ports (1 ):
Allocation and merging of Miss from Mbox.
Probe processing CAM.
Victim processing CAM.
Since probes and Scache victims are infrequent, we share 1 CAM port for all
three functionality with the priority:
Victim CAM.
Scache pipe (Probe CAM and Miss CAM).

•

Write port (1 ): MAF allocation for new Misses.

•

Read ports (2):
Scache pipe (Blk*, Retries, ShrToDirty*, *Req, and *Forward).
System request.

•

Write and read are mutually exclusive and we share a single port for Read & Write .

•
•

The current proposal is to have one RD/WR port and one RD-only port.
A MAF entry may be originated from more than one thread due to merging. When
we merge MissReqs from Mbox, we must merge the thread processing unit of the
first requester. Mbox does not merge retires and I/O requests across threads. Cbox
does merge MissReqs across thread and preventing Cbox from mergeing I/O
request across threads must be done in Mbox (i.e. close the merge window first).

When a Shr2Dirty[STC]Req fails:
Compaq Confidential
5 January 2001 -·Subject To Change

Second-Level Cache and Controller (Cbox) 11-11

Scache Control - the CS Partition

•
•
•

If the MAF entry has a I-miss or LD, then send a ReadReq.
If the MAF entry has a ST, then send a ReadModReq .
If the MAF entry has only Stx_C, then send no system request.

Since we can have only one 1/0 request bidding for the system request pipe at any given
time, we will have a small structure to store 1/0 specific fields.
The MAF.sys_cmd<2:0> is set if a system request is needed after looking up the Scache
tag.

•

If a new Miss gets merged before the system request is sent out (i.e .

MAF.need_sys_rqst = 1), then we change the system command.

•

If a Sharedlnval hits a ShrToDirty, we do not need to change the system command

to a ReadMod since the ShrToDirty gets forwarded.

•

If a Sharedlnval hits a ShrToDirtySTC, we may reset the Stx_C bit .

Coherent= (have_max_coh = 1 & timer_running = 0).
11.3.2.5 MAF Allocation/Merge/Retry

•

Overview and working assumptions
The MAF accepts one Miss request per cycle from the Pre-MAF (Pre-MAF).
There is ONE MAF entry for a cache block.
There is one MAF entry that has the merging window open for an 1/0 block.
No-cache pre-fetches from Mbox will follow the same path as for the regular
loads miss but the pre-fetch block doesn't get written to the Scache. We must
make sure the block is not a ExclCln or Dirty.
Miss Request Inputs
1/0 requests never ask for the write permission.

Ibox never asks for the write permission of a cache block.

•

Upon receiving a new Miss request from the PMF, the MAF determines whether to
Reject the Miss request.
Allocate a MAF entry for the Miss request.
Merge the Miss request.

MAF Full

Scache Pipe
Available

MAFCAM
Available

x
x

Address Match

sc_inflight

Action

x
x
x

Reject

x
x
x

Merge

Allocate

Compaq Confidential
11-12 Second-Level Cache and Controller (Cbox)

5 Jc1m.1c1ry 2001 -- Subject To Change

Scache Control - the CS Partition

Eviction requests and ChgToShared requests from Mbox need a VAFNDB
slot. Can we just set the MAE victimize* bit for the request rather than see if
VAF slot is avaialble before deciding whether to accept the request. Then take
the same flow as MAP.victimize path?
•

Reject: The MAF asks the PMF to retry the Miss request
If Cbox can't service the Miss request.

Since Mbox filters multiple Miss requests to the same cache block, the merits
of merging Miss request when the MAF is full seems small. Hence the MAF
will rejects new Miss requests when the MAF is full.
The Pre-MAF continuously retry the failing requests which minimizes possible
thread starvation for access to MAF entries.
•

Allocate: The MAF allocates a new MAF entry for a Miss request
The MAF returns the ACK along with the MAF index to Mbox so that Mbox
can pull a Qbox bubble.
The new Miss request enters the Scache pipe except for 1/0 requests. 1/0
requests enters the Scache pipe (or goes directly to the CRQ?) only if the merging window is closed (i.e. m%io_ok_to_send = 1).

•

Merge (1/0 space)
1/0 requests from the same thread gets merged in Mbox but Mbox uses the
MAF CAM port for the PA compare.
1/0 requests to the same block get merged in the MAF if the merge window is
open.
The MAF may have more than one MAF entry for the same 1/0 block (e.g. 1/0
byte reads) but only one has its merge window open.
1/0 request whose merge window is closed (m%io_ok_to_send = 1) enters the
Scache pipe (or system request pipe?).
The merging windows for 1/0 requests are managed by Mbox.
We may have only one 1/0 request per thread waiting for the system launch.
The first four VDB entries are reserved for WrIO data block.
Mbox does not send multiple 1/0 requests to the same block from different
threads unless the merge window is closed to prevent 1/0 merging across
threads.
All 1/0 merging rules conforming the SRM are handled in Mbox.

MAF States
sc_inflight

need_sys_rqst

sys_inflight

Action

Notes

Reject

Has in-flight miss/probe/fill in the
Scache.

Merge Request enters
the Scache.

No outstanding request.

Bidding for the system request pipe.

Merge Request does
not enter the Scache.

5 January 2001 -·Subject To Change

nflight request in the system.

Compaq Confidentia I
Second-Level Cache and Controller (Cbox) 11-13

Scache Control - the CS Partition

MAF States

Merge 1

Merge 1]

Must not happen

This can happen since we delay clearing of the sc_inflight bit to give Mbox sufficient time to consume
fill blocks. But to minimize the system fill latency we may send a system request if necessary before
resetting the sc _inflight bit.

•

Merge (Memory space)
Mbox filters the most of multiple Miss requests to the same cache block by
CAMing the fill buffer.
Filtering of multiple Misses to a cache block is to conserve the Scache bandwidth.
The MAF merges multiple Miss requests to a cache block if they didn't get filtered in Mbox.
For merged Miss requests, the MAF
Returns the merged MAF index to the PMF.
Sets the appropriate control flags for the merged MAF entry.
If a non-no-cache prefetch request merges onto a no-cache prefetch, we clear

the bit.
-

We merge STx_C across different threads but only one thread will succeed
depending on the order of the retry in Mbox.
Since the I-fill buffer index may get reassigned to a new miss in the Ibox, we
must not fill the Ibox more than once with the same cache block. Resetting the
miss_icache bit after the block has been delivered to the Ibox prevents the filling the same block twice.
After Scache tag return.
After System fill.

11.3.2.6 MAF Deallocation

•

1/0 Read can be deallocated when the requested block is returned or NXMResp is
received.

•

1/0 Write can be deallocated when the WrioAck is received. When we receive the
WrloAck, we also need to notify the Mbox so Mbox can retire the MB.

pa<47> *_inflight_in_sc need_sys_rqst inflight_in_sys coherent
1

victimize vic2shr Notes

IO request

Compaq Confidential
11-14 Second-Level Cache and Controller (Cbox)

5 January 2001 -·Subject To Cf1ange

Scache Control - the CS Partition

•

If we receive NXMResp, we save the PA & ... and Mbox will trap. We can de-allocated the MAF index once we notify the Mbox and save necessary information into
the error status register.

•

If the MAF entry has a victim waiting to become coherent in the victim buffer, we
must clear the blockage when the cache block becomes coherent even before we
de-allocate the MAF entry.

•

If MAP.victimize, MAF.vct2shr, MAF.cvt2inval, MAF.cvt_inv_if_shr bit is set,
then we need to perform the Scache tag update before we de-allocate the MAF
entry.

11.3.3 RSQ

11.3.4 Internal Probe Queue-the IPQ
The internal probe queue or IPQ is a 64-entry FIFO for holding "internal probes". An
internal probe is a special Scache transaction that either invalidates or victimizes an
Scache block.
Principle of Operations

Internal probes can be created by the following three transactions:

1. Cache manipulation instructions from the Mbox (CCB, ECB instructions).
2. Block response from system arrives and one of the following is set:
MAF. victimize
MAF. vict2shared
MAP.invalidate
3. Non-block response from system arrives and one of the following is set:
MAF. victimize
MAF. vict2shared
MAF.inval_seen
Other than the cache manipulation case, internal probes arise because the network is not
ordered. The following sequence illustrates the generation of an internal probe:
1. Processor A sends an ownership request to the home node.
2.

The home responds by sending an exclusive copy of the block to processor A.

3. Before the exclusive copy arrives at processor A, another processor requests the
same block, and the home sends a FWD message to processor A.
4.

The FWD message arrives at processor A before the exclusive copy arrives.

Processor A records the fact that another processor has requested ownership of this
block and sets the MAP.victimize bit. When the exclusive copy finally arrives, processor A cannot simply throw the block out, because that would cause a livelock. Instead,
processor A fills the block to the Mbox and the core, ensuring forward progress, and
loads the IPQ with a Victim command. Thus, we satisfy the forward progress requirement, as well as the requirement that we reliquish ownership of the block.

5 January 2001 -~Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-15

Scache Control - the CS Partition

If an internal probe wins arbitration but is rejected because there is another transaction

in the scache pipe to the same address, the internal probe is placed at the back of the
FIFO queue.
If an internal probe wins arbitration but is rejected due to a bank conflict, the internal

probe is placed in the retry queue.

11.3.5 Probe Queue -

the PRQ

The Probe Queue (PRQ) is a 32-entry FIFO, similar in design to the RTQ and IPQ. The
PRQ holds probes and non-block responses from the system before being processed in
the Scache pipe. The PRQ accepts one probe or one non-block response from the SYS
section every cycle. Additionally, a NACKed probe or non-block response from the
MAF may also need to be written into the PRQ; therefore, the PRQ has two write ports.
11.3.5.0.1 Principle of Operation

The 21464 probe queue, unlike the 21363 design, holds two different classes of messages from the system: probes and non-block responses.
The probes are:

•
•
•
•
•
•

FetchFwd
ReadShrFwd
ReadFwd
ReadModFwd
InvalToDirtyFwd
Sharedlnval

The non-block responses are:
•
•
•
•
•
•
•

ShrToDirtySuccessCnt
ShrToDirtyFail
ShrToDirtyProbCnt
InvalAck
WrIOACK
NXMResponse
ERRResponse

The other key difference between the 21464 and 21364 designs is that in the 21364, the
PRQ is strictly ordered. Requests must be processed in FIFO order. In the 21464, no
such restriction applies.
Each cycle, the head of the FIFO is read out and delivered to the MAE Should this
transaction be NACKed by the MAF (in C2), the transaction is placed at the tail of the
FIFO and will be reissued again when it reaches the head. To prevent a transaction
from being continuously NACKed, we record, at each PRQ entry, whether this probe
has been rejected before. If a probe that has been rejected before is rejected again, a
counter is incremented. When this counter saturates, a signal to the MAF arbitration
logic asserts, giving the PRQ priority.
Because we place responses and forwards in the same queue, and because forwards
generate responses, we have the possibility for deadlock. If the PRQ was full of forwards, and the response channel was full of responses, we wouldn't be able to take a

Compaq Confideaitia I
11-16 Second-Level Cache and Controller (Cbox)

5 Janwiry 2001 - Subject To Change

Scache Control - the CS Partition

forward out of the PRQ because to process the forward requires a response buffer. But
we can't sink any responses because the PRQ is full. To allieviate this, we reserve one
PRQ entry for non-block responses. This ensures forward progress.
The PRQ receives responses and probes from the Rbox and Zbox. Should the probe
queue become full, it must prevent the Rbox and Zbox from sending it more transactions. To accomplish this, the probe queue asserts backpressure signals to the Rbox and
Zbox. It must assert these backpressure signals early enough so that all transactions
already in flight to the PRQ can be sunk.
Use the following calculation to determine when to assert the back pressure signals:

•
•
•

4 (entries in pipe which may be NACKed)
2 (probes on the way to PRQ from SYS)
6 (probes in ring or that will be injected onto ring before backpressure signal
arrives).
Thirty-two minus twelve equals 20, therefore:
Throttle probes when 19 entries in use.
Throttle non-block responses when 20 entries in use.

11.3.5.1 Probe Address File (MAF) Contents per Entry
Table 11-5 PRQ Contents for Each Entry
Contents

#Bits Ports

valid

probe_paddr<47:6>

lWR, lRD

probe_cmd<4:0>

2WR, lRD

transaction_id< 16:0>

2WR, lRD Processor ID+ Requester's MAF index

TOTAL

-65

11.3.6 Victim Address File -

Description

the VAF

The Scache retains most blocks until the space (i.e. Scache set) they occupy is needed
for another block. If a block is not held exclusively in the Scache at the time it is
evicted, it is simply overwritten. But, if a block is in exclusive state, the directory must
be notified that this cache is releasing exclusive access and, if the block is dirty, it must
be written back to memory. Rather than delaying the fill that overwrites this block, the
Scache moves the old contents to the victim data buffer (VDB), where the block waits
for coherence before being sent to the home node. The victim address file (VAF) is a
64-entry buffer that stores addresses of Scache victims or probe responses to be sent to
the memory system.

5 January 2001 --·Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-17

Scache Control - the CS Partition

In the VAF 64-entry buffer, four entries (one for each TPU) are reserved for Wrlo
requests from the Mbox and four entries are reserved for probe responses.
Table 11-6 VAF Commands
Class

Encoding

Network Message

Destination Data Directory State

Non-block
Responses

CS_VAF_CMD_SPCL_INV_ACK

InvalAck

Requester

RemoteExcl

cs_VAF_CMD_INV_ACK

InvalAck

Requester

RemoteExcl

CS_VAF_CMD_INV_TO_DIRTY_RESP

InvalToDirtyRespCnt(O)

Requester

RemoteExcl

CS_VAF_CMD_VICTIM_CLEAN

VictimClean

Home Diectory No

InMemory

CS_VAF_CMD_VICTIM_CLEAN_TO_SHR VictimCleanToShared

Home Diectory No

Shared

cs_VAF_CMD_FORWARD_ACK_EXCL

ForwardAck:Excl

Home Diectory No

RemoteExcl

cs_VAF_CMD_FORWARD_ACK_SHR

ForwardAckShared

Home Diectory No

Shared

CS_VAF_CMD_FORWARD_MISS

ForwardMiss

Home Diectory No

RemoteExcl/Shared

CS_VAF_CMD_SHR_TO_DIRTY _COMPL

SharedfoDirtyComplete

Home Diectory No

RemoteExcl

cs_VAF_CMD_SHR_TO_DIRTY _RELEAS

SharedToDirtyRelease

Home Diectory No

InMemory/Shared

CS_VAF_CMD_BLK_SHR

BlockShared

Requester

Shared

CS_VAF_CMD_BLK_INV

Blocklnvalid

Requester

Yes

Shared

cs_VAF_CMD_BLK_DIRTY

BlockDirty

Requester

Yes

RemoteExcl

cs_VAF_CMD_BLK_EXCL

Block:ExclCnt(O)

Requester

Yes

RemoteExcl
InMemory

Release
Responses

Block
Responses

Victim
Block
Responses

cs_VAF_CMD_VICTIM

Victim

HomeDirectory

Yes

cs_VAF_CMD_VICTIM_TO_SHR

VictimToShared

HomeDirectory

Yes

Shared

cs_VAF_CMD_VICTIM_ACK_SHR

VictimAckShared

Home Directory

Yes

Shared

cs_VAF_CMD_VICTIM_ACK_EXCL

VictimAckExcl

HomeDirectory

Yes

RemoteExcl

11.3.6.1 Victim Address File (VAF) Contents per Entry

Table 11-7 shows the VAF contents for each entry.
Table 11-7 VAF Contents For Each Entry
Contents

#Bits

Ports

Description

valid

The VAF entry has the valid response.

allocated

The VAF entry is speculatively allocated and not available
for allocation.

victim_pa<46:6>

lWR, lRD

victim_cmd0<4: 0>

lWR, lRD

victim_cmd1<4:0>

lWR, lRD

maf_idx<5:0>

lWR, lRD, lCAM Requester's MAF index

Compaq Confidential
11-18 Second-Level Cache and Controller (Cbox)

5 Jc1m.1c1ry 2001 - Subject To Change

Scache Control - the CS Partition

Table 11-7 VAF Contents For Each Entry
Contents

#Bits

Ports

Description

req_node<9:0>

lWR, lRD

Requester's node ID

has_full_blk

lWR, lRD

Has the full victim block (i.e. ECC corrected and merge
buffer data has been extracted).

TOTAL

-81

11.3.6.2 Principle of Operation

The following table outlines the main victim flow for each Cbox pipe stage.
Table 11-8 Main Victim Flow for Each Cbox Pipeline Stage
Stage

Main Victim Flow

co
Cl

Speculatively allocate a VAF entry based on the transaction that won the arbitration. The following transactions allocate a VAF entry in Cl:
•
LRUevict
•
BlkExclusiveProbable
•
futernal probes
•
Forwards (Sharedlnvals, etc)
•
ShrToDirtySuccess, ShrToDirtyProbable

Deallocate VAF entry if transaction is rejected -

if the incoming transaction got NACKed.

C4
C5
C6

MAP logic computes initial VAF command based on tag from the Stag array and transaction
type. This logic is covered in a later section on Cbox Flows.

Send initial victim command from MAP to VAF.

Write victim physical address to VAF, deallocate VAF entry if no victim.

ClO

CAM MAP with victim PA.
If we hit a MAP entry with the victim PA, and the MAP state indicates that the victim PA is not
coherent, then we must wait for coherence before sending the victim.

Cll

Write victim data read from the scache into the victim data buffer (VDB).

Cl2

Receive probe_hit from Mbox and compute final victim command.
This signal indicates if the Mbox MGB has modified data for the victim cache block. If this is
the case, we must allow the Mbox MGB to "write-thru" to the VDB before sending the victim.

C13

Write VAF cmd, write VAF index to RSQ, if the victim is ready to be sent.
The following two conditions can prevent the victim from being sent:
•
Victim is not coherent
•
Mbox MGB has modified data for this cache block, and has not yet written thru to the VDB.

Compaq Confidential
5 Jam.J~1ry 2001 --·Subject To Change

Second-Level Cache and Controller (Cbox) 11-19

Scache Control - the CS Partition

Table 11-8 Main Victim Flow for Each Cbox Pipeline Stage
Stage

Main Victim Flow

C14

Read head of RSQ, read VAF index specified by RSQ.

C 15

Send the VAF command to SYS, deallocate VAF entry if no VDB read required. However, if the
corresponding VDB entry needs to be read by either the Zbox or Rbox, the VAF entry is not
deallocated until this read occurs.

C 16

Send second victim command, if applicable.
A victim flow can require two messages to be delivered: one to the requesting node and one to
the home node. If this is the case, then we send the second victim command at stage C16.

11.3.6.3 Secondary VAF Flows

As noted above, if the victim is not coherent, or if the Mbox MGB has not yet written
the modified data thru to the VDB, we must not send the victim, as follows:
1. When an incoherent MAF entry receives the final lnvalAck, making it coherent, the
MAF sends the MAF index that became coherent to the VAF. We CAM the VAF
with this MAF index, and the hit entry is written to the RSQ (note the retiming as
we go from CS to C15):
CS
MAF sends coherent MAF index to VAF.
C15/C6 CAM VAF with MAF index
Cl6
Send VAF index to RSQ
C 17
Write RSQ

2. When the Mbox MGB writes thru to the VDB, the VAF entry (if coherent) is ready
to be sent to the system (note retiming as we go from C12 to Cl6):
C9

Mbox MGB sends write-thru VDB

ClO

Cll
C15/C12 MAF sends wr_thru_vdb_done to VAF
C16
C17
Send VAF index to RSQ
C 18
Write RSQ
11.3.6.4 Reserved VAF Entries

Four VAF entries, one per TPU, are available only for WRIO requests. The Mbox uses
these four entries along with the four corresponding VDB entries, to store write IO data.
Additionally, four VAF entries are reserved for handling responses to forwards.

11.3.7 System Interface {SYS)
The System interface section (SYS) connects the Cbox to the Zbox and Rbox via the
internal packet ring network. The SYS contains two 8-entry FIFOs: one for requests
from the Cbox destined for either the Zbox or the Rbox, and one for responses from the
Cbox destined for either the Zbox or the Rbox. Incoming packets from the Rbox are

Compaq Confidential
11-20 Second-Level Cache and Controller (Cbox)

5 Jc1m.1c1ry 2001 ... Subject To Cf1ange

Scache Control - the CS Partition

either passed along to the Zbox, placed into the PRQ (probes and forwards), or driven
into the scache pipe (block responses and I2DResponse). The SYS section also contains
the Cbox CSRs.
In the 21464, unlike the 21364, we victimize Scache blocks at fill time. When the new
fill arrives with data, we extract the victim, write the new data to the Scache, and write
the victim to the VDB. Since we never stall fills, we must be assured that when the fill
arrives, we have a VAF entry into which to put the victim. The MAF maintains a
counter of the number of available VAF entries. Every time the SYS sends a request,
we send a signal to the MAF counter to decrement the number of free VAF entries. Furthermore, the MAF sends back to the SYS a signal that indicates,...whether there is an
available VAF entry. If there is not an available VAF entry, then the SYS must not send
a request.
Note:

The BlkExclusiveProb message might require two VAF slots. Therefore,
when we send a ReadModSTC request from the SYS, we decrement the
VAF free count by 2.

11.3.7.1 Principle of Operation:

Every cycle, the SYS section must arbitrate among the 3 possible sources that want to
drive the ring. It does so with the following priority:
1. Rbox packet destined for the Zbox
2. System response packet from Cbox
3. System request packet from Cbox
The SYS section receives "back pressure" signals from both the Zbox and Rbox. These
signals tell the Cbox whether the Zbox or Rbox can accept new packets for a particular
class (responses or requests). If either the Rbox or the Zbox cannot accept responses,
the CBox does not place anything on the ring. If either the Rbox or the Zbox cannot
accept requests, the Cbox does not send a new request packet. Finally, to prevent Zbox
starvation, if the Cbox has driven new packets out onto the ring in 15 consecutive
cycles, it stalls for one cycle and does not place a new packet on the ring.
When the 8-entry FIFO in SYS for Cbox requests is filled, the SYS section signals back
to the SRQ, which then stops sending further system requests to SYS.
Similarly, when the 8-entry FIFO in SYS for Cbox responses is filled, the SYS section
signals back to the RSQ section, which then stops sending further responses to SYS.
11.3.7.1.1 Response FIFO Entry Fields

A response FIFO entry contains the following fields:
Table 11-9 System Interface Section Response FIFO Entry Fields
Field Name

Size

Block Address

31 bits

Home_Owner_Node

10 bits

Stripe_bit

1 bit

Cmd

8 bits

5 January 2001 --·Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-21

Scache Control - the CS Partition

Table 11-9 System Interface Section Response FIFO Entry Fields
Field Name

Size

Requester_Node

10 bits

Requester_MAF_Idx

6 bits

Requester_VAF_Idx

6 bits

Request_destined_for_zbox

1 bit

11.3.7.1.2 Request FIFO Entry Fields

A request FIFO entry contains the following fields:
Table 11-10 System Interface Section Response FIFO Entry Fields
Field Name

Size

Block Address

31 bits

Req_Home_Node

10 bits

stripe_bit

1 bit

cmd

8 bits

req_maf_idx

(6 bits

request_destined_for_zbox

1 bit

iomask

8 bits

qwadd

3 bits

tpu_idx

2 bits

11.3.8 System Request Queue (SRQ)
The SRQ is a 60-entry 1 FIFO queue that buffers requests which miss in the Scache and
require a system request. The SRQ serves two main functions:
•

Since we can generate requests more rapidly than the paths to memory can accept
them, the SRQ serves as a buffer between the MAF and the system interface:
We can generate one request per cycle, but the ring interface is shared with the
Rbox and Zbox, so sometimes our request will not win arbitration for the ring.

•

The SRQ limits the number of outstanding requests to a given Scache index.

11.3.8.1 Principle of Operation

The Scache is six-way set associative, which means that at any given Scache index, we
have six sets of storage, and six tags. If we were to allow more than six outstanding
requests to the same Scache index, it is possible that the first six responses (fills) would
arrive, write into the Scache, and then the seventh fill would arrive and victimize the
first before we have actually written the fill data to the Scache. 2

1 Four MAF entries are reserved for forward probes, so the SRO needs to be only 60 entries,
not 64.

Compaq Confidential
11-22 Second-Level Cache and Controller (Cbox)

5 Janwiry 2001 ··· Subject To Cfu.mge

Scache Control - the CS Partition

When the MAF generates a system request - either due to a miss from the Mbox or a
system response that does not fully satisfy the original request - we put the MAF
index of the miss into the SRQ, along with the Scache index. Each cycle, we take the
entry from the head of the SRQ and check how many outstanding requests we already
have to the stored Scache index. If the number of outstanding requests is less than six
(the number of Scache sets at any index), we do the following:
•

We use the stored MAF index to read the PA from the MAF, and we deliver the
request to the C-U-Z interface unit (CS_SYS).

•

We increment the count of the number of outstanding requests to this Scache index.

However, if the number of outstanding requests is equal to six, we leave the SRQ entry
on the queue and advance to the next SRQ entry; no request is sent.
When we receive a response from the system, the proper counter is decremented.
As noted earlier, the Scache index consists of bits PA<18:6>. Using counters to keep
track of the number of outstanding requests at each index would therefore require 8K
counters. To reduce this storage requirement, we instead allow a total of six requests to
a group of 128 indixes. Physical address bits <11:6> are used to specify one of 64
counters. As an example, we would allow a total of six requests to be outstanding to the
following group of Scache indices at any one time:
0, 128, 256, 384, .....
This could be a performance issue for strided code.
There is a debug mode in the Cbox that allows only one outstanding request per
PA<ll:6>.

11.3.9 Retry Queue (RTQ)
Transactions in the Scache pipeline can encounter conditions that prevent them from
completing normally. In those cases, the transaction must be retried, that is, reexecuted
through the Scache pipeline. The retry queue is a 64-entry FIFO that holds transactions
which must be retried.
11.3.9.1 Principle of Operation

The following conditions can cause an Scache transaction to be placed in the retry
queue and retried:

•
•
•
•
•
•

Scache tag array bank conflict
Scache data array bank conflict
Single-bit ECC Scache tag error
Single-bit ECC Scache data error
Ifetch request from Mbox hits data in Mbox MGB
System response arrives and finds MAF entry with MAF.i_fill_ena asserted

2 The current Scache pipeline is such that with six sets, this situation can not occur. Since the
pipeline and the number of sets may change between now and tapeout, the SRQ remains in
the design.

5 January 2001 --·Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-23

Fm Datapath - the CF Partition

An entry in the RTQ consists of the following:
•

MAF index associated with the transaction (6 bits)

•

VAF index associated with the transaction (6 bits)

•

Retry type (2 bits)

•

Retry command (4 bits)

•

Retry due to data error ( 1 bit)

•

This retry has been nacked (1 bit)

•

This transaction originated with an IO processor (1 bit)

If a transaction is to be retried, the MAF.sc_inflight bit for the MAF entry associated

with the retry is kept asserted until the retry successfully retries and clears the Scache
pipe. This ensures that no other transaction to the same address (with fills being the
only exception) may enter the Scache pipeline until the retry has been processed.
The RTQ must have the same number of entries as the MAF (64) because we could
have a situation where every system fill returns an exclusive block to the Cbox, and
each MAF entry associated with these fills has MAF.i_fill_ena asserted (!stream
request). This requires that each !fetch be retried from the RTQ.
Each cycle, the RTQ reads the entry at the head of the FIFO, delivers it to the MAF, and
deallocates the RTQ entry. If the MAF signals an ACK 3 cycles later, all is well. If
instead, the MAF signals a NACK back to the RTQ (transaction was rejected), the RTQ
allocates a new entry and pushes it into the back of the FIFO.
Situations can arise whereby a transaction in the retry queue is continuously denied
access to the Scache pipeline. The RTQ has logic to detect this situation. When the RTQ
detects that an entry is being denied access, a signal is asserted to the MAF arbitration
logic. This signal forces the MAF to reject all requests that have a lower priority than
the RTQ. Please refer to the section on Cbox livelock/starvation avoidance for more
details.

11.3.10 TTQ

11.4 Fill Datapath -

the CF Partition

11.4.1 FBE
11.4.2 VDB

11.4.3 FOB

Compaq Confidential
11-24 Second-Level Cache and Controller (Cbox)
5 Jam.1c1ry 2001 - Subject To Change

Scache Tag Array - the ST Partition

11.4.4 DBM

11.4.5 RBI

11.4.6 RBO

11.5 Scache Tag Array -

the ST Partition

The Scache tag array (STAG) stores cache states of blocks in the secondary cache
(Scache).
11.5.0.1 Principle of Operation

In response to a Scache tag request command from the Scache control, the Scache tag
array is responsible for the following:

•

Look up the Scache tag and send the CURRENT cache block state to the Scache
control.

•
•

Update the cache state and LRU if necessary.

•

In case of a single bit tag ECC error, correct and store the corrected tag in the tag
ECC register and signal it to the Cbox. The Cbox (i.e. Scache control) retries the
request.
The tag ECC register gets cleared
-

After the retry reads the corrected tag from the register.

A probe or a system fill to the same Scache index since they may displace or
victmize the cache block which is in the tag ECC register.

Compaq Confidential
5 January 2001 -~Subject To Change

Second-Level Cache and Controller (Cbox) 11-25

Scache Tag Array - the ST Partition

11.5.0.2 Pipeline Stages

Table 11-11 shows the pipeline stages for the Scache tag array.
Table 11-11 Scache Tag Array Pipeline Stages
$2

1. Decode the
Scache index for
RD.
2. Generate the tag
ECC [1].

1. Read the Scache
tag/LRU.
2. CAM the ECC tag
register with the
MAFindex.

1. Tag compare and · 1. Send the set num- 1. Send the response
Set select.
ber to the Cbox and (VSD) to the Cbox.
2. Syndrom genera- SC data array.
tion and Error detec- 2. Write the Scache
tag and/or LRU if
tion.
3. Decode the
necessary.
Scache index for
3. Single bit error
correction and load
WR.
4. Decode the LRU the corrected tag into
and look up the stale the ECC register if a
retry is required [2].
fill table if
LRU_RD_ENA is
4. Clear the tag ECC
register if necessary.
asserted.
5. Fix the tag ECC
5. Write the stale fill
bits if necessary [ 1]. table if
C_ST_CMD_LRUE
VICT.

Notes:
•

[l]: The ECC bits are function of the physical address and the cache state. Yet we
only need tag ECC bits which depend only on the physical address for the tag compare. Tag ECC bits which depend on the cache state are generated before the tag is
written back.

•

[2]: A retry is required if
Scache miss and a single bit tag ECC error in any set OR
Scache hit and a single bit ECC error in the same set.

11.5.0.3 State Transition

Table 11-12 shows the Scache tag state transition table.
Table 11-12 Scache Tag State Transition Table
Current Tag State
Command
(cs%st_cmd_c3a)

Invalid

Shared

Exel

Dirty

Write the
Stale Fill Inval.
LRURD LRUWR Tag RD Tag WR Table
ECC1

C_ST_CMD_NOOP

Invalid

Shared

ExclCln

Dirty

C_ST_CMD_MISS
C_ST_CMD_SEfDIITTY

Invalid
Invalid

Shared
Shared

ExclCln
Dirty

Dirty
Dirty

No
No

Yes

Compaq Confidential
11-26 Second-Level Cache and Controller (Cbox)
5 Jc1nuc1ry 2001 ... Subject To Change

Scache Tag Array - the ST Partition

Table 11-12 Scache Tag State Transition Table (Continued)
Current Tag State
Command
(cs%st_cmd_c3a)

Invalid

Shared

C_ST_CMD_LRUEVICT3

Exel

Dirty

Invalid

LRURD LRUWR Tag RD Tag WR

Write the
Stale Fill Inval.
Table
ECC 1

Yes

Yes
No

C_ST_CMD_INVAL

Invalid

Invalid No

Yes 4

Yes

C_ST_CMD_CTOS

Invalid

Shared

Shared No

Yes 2

Yes

Yes 2

Yes

Yes4

Yes

Yes 2

Yes

C_ST_CMD_STOE

]Invalid

C_ST_CMD_STOD

Invalid

C_ST_CMD_BLKINV

Invalid

C_ST_CMD_BLKSHR

ExclCln

Must not happen.

Dirty
Must not happen

Shared

C_ST_CMD_BLKEXCL

ExclCln

C_ST_CMD_BLKDIRTY

Dirty

C_ST_CMD_RCVRY

Invalid

Shared

ExclCln

Dirty

Yes

[Yes]

Yes

C_ST_CMD_STOI
C_ST_CMD_ETOS_DTOI
1

Invalidate ECC register in the index match
Make the set most recently used.

To prevent the LRUEvict from evicting a stale fill block, we check the stale fill block table at the same
time as we read out the LRU. The stale fill block table stores the set numbers that may have stale fill
blocks. Those sets in the stale fill table must not get evicted. The stale fill block table is a simple FIFO
that stores the last N bits ( 4 - 8 bits ) of Scache index and the set number of system fills in-flight in
the Scache.

Make the set least recently used.

•

Stag Read/Write conflict
Because the Scache bank conflict can't be detected early enough it is possible
to have a read/write conflict to the same Scache bank.
For a Scache bank conflict, the earlier transaction takes precedence over the
later one, which means the WRITE must proceed. One exception is when we
try to evict a block to make a room for the following fill (i.e. LRU_RD_ENA is
asserted), the WRITE is discarded in favor of the READ. The Scache control
(Cbox) is responsible for replaying the rejected transaction.

ST_RD_ENA_C2A

LRU_RD_ENA_C2A ST_WR_ENA_C4A

LRU_WR_ENA_C4A

ACTION

NoOp/Fill

Must not happen

0
0

5 January 2001 ~·Subject To Change

WriteLRU
Write tag and LRU

Must not happen

Read tag

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-27

Scache Tag Array - the ST Partition

ST_RD_ENA_C2A

LRU_RD_ENA_C2A ST_WR_ENA_C4A LRU_WR_ENA_C4A
0

ACTION
Read tag and write
LRU

Bank conflict: Write
tag and write LRU

Bank conflict: Read tag
andreadLRU

11.5.0.4 Stale Fill Table

The SFf (Stale Fill Table) stores the set number and the Scache index of the system fill in progess in the Scache to prevent from victimizing the set which has the stale data in the Scache.
Table 11-13 Stale Fill Table (SFT)

C_ST_CMD_LRUEVICT(O)
c_sT_CMD_BLK*(O)

[1]

[2]

C10 C11

C12 C13 C14 C15 C16 Notes

[4]

[3]

C_ST_CMD_LRUEVICT(5)

[1]

[2]

Victimizing the
block before the
block is written into
theScache.

Notes:

•

[l] LRU decode and set select

•

[2] Scache data array read

•

[3] Scache tag array write

•

[4] Scache data array write

•

If we have less than 6 sets, we must not send more than (N-1) system requests to the

same Scache index, where N is the number of sets.

11.5.0.5 The 21464 Scache Least Recently Used (LRU) Scheme

The 21464 Scache is 6-way set associative cache. To minimize the probability of the Scache
bank conflict, the 21464 LRU scheme is designed such that only a LRU victim needs to read
the LRU. All other Scache transaction does not require reading of the LRU array. The proposed 21464 Scache LRU uses an approximate tree-like structure.

Compaq Confidential
11-28 Second-Level Cache and Controller (Cbox)

5 Jc1nuc1ry 2001 - Subject To Change

Scache Tag Array - the ST Partition

Table 11-14 Scache Least Recently Used (LRU) State Bits
States

Notes

Arb

The set A is more recent than the set B

CrD

The set C is more recent than the set D

ErF

The set E is more recent than the set F

ABrCD

The set AB is more recent than the set CD

ABrEF

The set AB is more recent than the set EF

CDrEF

The set CD is more recent than the set EF

•

Making a set the most recently used (MRU) set:

LAU states
ArB

CrD

ErF

ABrCD

ABrEF

CDrEF

Make the set A MRU

Set

No change

Set

No change

Make the set B MRU

Clear

No change

Set

No change

Make the set C MRU

No change

Set

No change

Clear

No change

Set

Make the set D MRU

No change

Clear

No change

Clear

No change

Set

Make the set E MRU

No change

Set

No change

Clear

Make the set F MRU

No change

Clear

No change

Clear

•

Making a set the least recently used (LRU) set

LAU States
ArB

CrD

ErF

ABrCD

ABrEF

CDrEF

Make the set A LRU

Clear

No change

Clear

No change

Make the set B LRU

Set

No change

Clear

No change

Make the set C LRU

No change

Clear

No change

Set

No change

Clear

Make the set D LRU

No change

Set

No change

Set

No change

Clear

Make the set E LRU

No change

clear

No change

Set

Make the set F LRU

No change

set

No change

Set

5 January 2001 - Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-29

Scache Tag Array - the ST Partition

•

Determining an LR U set from the LRU states

LAU States
ArB

CrD

ErF

ABrCD

ABrEF

CDrEF

LRU Set

x
x

x
x
x
x

Set A is the LRU

x
x

Set C is the LRU

Set Dis the LRU

x
x

Set Eis the LRU

Set Fis the LRU

Illegal state

x
x
x
x
x
x

0
1

x
x
x
x

0
1

x
x

Set Bis the LRU

Notes:

•

We assign an arbitrary set when an illegal state is detected .

11.5.0.6 Scache Tag ECC Code

The 21464 Scache tag (paddr<46:19> + vsd<2:0>) array is protected by a 32b ECC
code.
The C"s in the table below indicate which bits contribute to the corresponding check
bit. The corresponding check bit is the parity of the contributing bits and the syndrom
bit is the parity of the corresponding check bit and the contributing bits.
Check bits 6 and 4-0 are calculated with odd parity, meaning that the total number of
ones in the contributing data bits and the check bit will be odd when stored in memory.
Check bit 5 is calculated with even parity, to ensure that the cases of all-zero and allones returned from the memory are reported as uncorrectable errors.
When the tag is invalidated, a complete entry will be written, with zeros in all the data
bits; this will require that the check bits be written with Ox5F so that it can be read without error. All valid syndromes (in the sense that they represent correctable errors) have
either 1 or 3 ones. A syndrome with only 1 one represents an error in a check bit; a syndrome with 3 ones represents an error in a data bit. Any syndrome with an even number
of ones, or with 4 or more ones, represents an uncorrectable error. Customary practice is
to ignore the cases of 5 or 7 ones, and report an uncorrectable error when there is an
even number of ones in the syndrome.
Table 11-15 Scache TAG Syndrome Bits
Physical Address

VSD
"C

...as ~ ;g
G>

.: as co
~ .c
.... 0 O> co ...... co
-=:I' (I) N .... 0
O> co ...... co
-=:I' (I) N
0 CJ) c > "1:1' -=:I' -=:I' "1:1' -=:I' -=:I' "1:1' (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) N N N N N
c c
c
c
c
c
6 c c c c
c
c c
IJ)

IJ)

-=:I'

(I)

N
N

....N N0 ....O>
c

Compaq Confidential
11-30 Second-Level Cache and Controller (Cbox)

5 Jc1nw1ry 2001 ··· Subject To Change

Scache Tag Array - the ST Partition

Table 11-15 Scache TAG Syndrome Bits
5
4
3

2
1
0

"'r/.l

c c c c c c c c c c c c c
c c c
c c c c c c c c c c
c c c c
c c c c
c c c c c c
c
c c c
c
c c c
c
c c c
c
c
c
c
c c
c
c
c c
c c
c
c
c
c
c
c
c
c
c
c
c c c
c
c
M

"'" "'"

~· ~

"'"

Cl
0

.:r.i
0

;;:;

"'"

00
in

u
......

......

N
\0

"'"

......
M

N
M

1 C/S means Check/Syndrome
2

S means Syndrome

•

Masks to select data participating in each check bit:
check 6 mask= Ox7 A691A44
-

check 5 mask= OxOOOOlFFF

check 4 mask= Ox007FE007
check 3 mask= Ox1F81E078

•

check 2 mask= Ox238E2388

check 1 mask= Ox4CB 24C91

check 0 mask= Ox75549522

Tag compare
valid

•

paddr<46:19>

CB<5:3>

Error detection
OR (SYN<5:0>)

EVEN PARITY(SYN<6:0>) Even
Odd

Zero

Non-zero

Good data

Double or even number of
errors

Triple or odd number of
errors

Single bit error

If the memory returns all zeros in the data and check bits, the resulting syndrome will be Ox5F.
All zeros will result in a syndrome of Ox12. Both syndromes have an even number of ones, so
will be reported as uncorrectable errors.
•

Error correction
The SYN<5 :0> indicates the location of the bit error. Then the error bit simply
gets flipped.
Corrected bit= (Uncorrected bit) XOR (DECODE(SYN<5:0> ))

5 Jam.mry 2001 -~Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-31

Scache Data Array - the SG Partition

11.6 Scache Data Array - the SG Partition
This is mentioned in Tables 10-2 and 10-3.

11.7 Flows
11.7.1 Overall Pipeline Flow
Fills, Probes, and Misses all access the Scache using the same pipeline. Each cycle one
operation is picked for launch:

•

FillExcl, FillShared from system; displace a victim (2 cycle operation, granted on
even cycles. Highest priority)

•
•
•

Transfers between Router and Rambus

•
•
•

Replays (Tag ECC error, Data ECC error, resource or ordering conflict)
ChgToExclMark, ReadMark, Probes (Inval, Forward, Forwardlnval) and VictimAcks
Local memory accesses
Misses and Evict requests from the MAF
Early returns from system

(To avoid deadlocks the system priorities are: Fills-> Forwards -> Probes-> Requests)
11.7.1.1 Pipe Operation

Operations are launched into the pipe and either complete successfully or fail because
of an error in the Tag or Data ECC code, or fail because of a resource conflict with a
previously launched operation. Failed operations queue to be replayed. Corrected tag or
data for ECC fails is temporarily saved in a bypass register. When the operation is
retried it accesses the bypass register. The bypass register is cleared by Tag or Data
writes with the same index.

Compaq Confidential
11-32 Second-Level Cache and Controller (Cbox)
5 Jam1(1ry 2001

Subject To Change

";j

Ut
~

CTE_1 (fill from victim)

...,

CTE_2 (change)

CTE_ 1 (change)

Fill_1

Fill_2

Grant if cycle is even.

ii"

........
....I
O>

a-;::;:>

- w
II»

I\)

8
.....
tr;

a:::
~
G

""'$<

Read PA from MAE
Check MAF: CIB [3].
Update MAP state.

Read PA from MAE

Detect conflicts [1].
Row decode
Tag Ram.
CheckVAF.
If don't find shared block
follow 'change' above.

Row decode tag ram.

Read tag ram.

Read PA from MAE
Check MAF: CTE [3].
Update MAP state.

Read PA from MAE

Detect conflicts [1].
Row decode Tag Ram.

Detect conflicts [1].
Row decode Tag Ram.
Check VAF.
If find shared block follow 'fill
from victim' below.
Read tag ram.

Read PA from MAP.
Check MAP: CIB [3].
Update MAP state.

3C :u

.,, ~ :::s"~
)>CD

II» -I
c II»
CQ

5::r

(Don't read tags).

Read tag ram.

-I :u

II» CD
CQ II»

tn a.

0
"'C

::::s

ii'
!.

s·
CD

-.,,2.
0

Select victim set (LRU).
Generate victim tag syndrome.

Compare PA to tags.
Determine set number.
Generate tag syndromes .

Bypass victim set number from Fill_l

(Don't write tags) .
Correct victim tag.
Read data ram.

Set Exel, LRU (if match).
Correct tags.
Tag ECC replay [4].
Invalidate Tag_ECC register if
index matches CIB index.
Read data ram.

Write tag, VSD.
Invalidate Tag_ECC register if
index matches fill index.

Send victim PA to MBox. Send CTE PA to MBox. Set select data.
Write victim PA to VAF
Set select victim data.

Send fill PA to MBox.

..0

Select victim set (LRU). .,, 0
Generate victim tag syndrome. II» 0

Ci;~

CCl
I»

0
(/) ::3
<D ~

(") a..
0

::J :3

o..~·

r1 .....

se.
Q

(Don't write tags) .
Correct victim tag.
Read data ram.

Send victim PA to MBox.
Write vitctim PA to VAP.
Set select victim data.

::J

~
(j)

Write victim data to VDB.

Send fill data to MBox.
Correct data.
Data ECC r<;m_layj3J,

m&
m.,,
C=:

Send fill data to MBox.

tn -

Write vitctim data to VDB.

<
c ::e
~.

m ;-

0C"

......::e

C CD

""'I

.?.$.
......
......

"'8 -

!. g>
<21

Generate data syndrome.

<D
S»
::J
0..

(")

::r

:u
CD

;::::a:

Write data ram.

.,,

::e
N ~.

....
....
~

Probe (Forward)

Probe (lnval)

-I
m
tr

CTE_2 (fill from victim)

.....
.....
.....I

...g
...

(/)

a
CD

c.
r
~

CJ)

!!: :a

"'m

'T1 ~

::::r

CAM MAP; merge or allocate entry.
Detect hold conditions [2].
HoldinMAF.

Read PA from MAF.

Detect conflicts [1].
Detect conflicts [l].
Row decode Tag Ram.
Row decode Tag Ram.
CheckVAP. I Check and conditionally Invalidate VAF [5].
f victim is there mark and don't try reloading.
Still need to update tags.

Row decode tag ram.

CAM MAP; merge or allocate entry.
Detect hold conditions [2].
HoldinMAF.

92..

0
=r
CC>

Sl>

()

-Q.

0
:J

Read tag ram.

)>CD

"CJ

-a·
!!.
Read tag ram. (Not used) ;t :a s·
CD
ca
I
,,, Q.

"""'0

~
s:

""'
('!)

;·
""""
Ul

e....

!:l

s:
~

Compare PA to tags.
Determine set number.
Generate tag syndromes.

Bypass victim set
number from Fill_l

Update tag.
(Exel-> DirtyShared, DirtyShared ~> Shared)
Correct tags.
Tag ECCreplay [4].
Invalidate Tag_ECC register if index matches fill index.
Read data ram.

Write tag Invalid (if match).
Correct tags.
Tag ECC replay [4].
Invalidate Tag_ECC register if
index matches fill index.

Write tag, VSD.
Invalidate Tag_ECC register if
index matches fill index.

If was Exel send probe to MBox.
MBox write through to SData and VDB.
Set select forward data.

Send probe to MBox. Send InvalAck.

Generate data syndrome.

Send fill PA to MBox.

'-..f.

g>
~

Read VDB

If not already there, write forward data to VDB.

Send fill data from
VDBtoMBox.

(/)

!. g>

<
:a
CJ CD
m ,,

-·
,,,c =

......:e
::i:
CD

Write data ram.

5·

<:E

i'
ca

!.
CJ
a
Cl»

m ;-

0
0
::s

:a
CD

CJ:!.

; ~

Send fill data to MB ox.
Correct data.
Data ECC rnll.~.

"C (")
Cl» 0

()

0
....
::s
.
I! ;t ..
gu:a 2.
:::s"

Ci)(")

-3
~"O
0 ID
e,...o

()

N:::!.
CD

"Tl

(}';

Local
Probes

Miss

......
......

..a

I
......

==
lJ
)>Cl>

Cl)

;::::;:

l\)

8......

Cl>

trl

.s:

~
er

......
......

ocnncn

('!)

cn~.Q

('!)

(!)

~ ~

('!)

(ii

sc. s
('!)

('!)

as~s

~()'Q

O"'

(!)

......

!:;e; ..... nt:l
~ t:t 00 ...,_ d
!"1......
~
,..,
('/Q

e;
:=: O"' .g O"'
()'Q

e;....... ~ !"1 0~ ~~

(")

3
]

trl

m
w

0
'<

00
00

8 a.
a. a
r- s·

a>~

~
(")
::T

::J

g
::J

~
Ci)
""'I

0C'"
0

n~
i-3 00
tr.In
• i-3

§.

(/) :3

a> -

;

p:l

('!)

(")

::J ~

('!)

..0

U'I

ii'

...

';1

Probe (Forwardlnval)

.........

('!)

......
('!) ('/Q
~

~ ~
....... 'tj

~
>-t

('!)

>-!

(!)

......

('!)

......

t:! ::r'
......

~ 0..:::::
~

('!) ~

~~n
s.~ ....:i
...... ::r' tr.I

5· ()" .._
()'Q ::r- c/

g
~ ~
9: '< e;

l:!l ~ 0..
('!)

......

0.. >-!
O"' ~

>-!

'$:; Q

5'
~.. 6

::s

0
(')

::::s

g. Q

[

2.
CD

Read PA from MAP.
Update MAP in-flight bit.

Row decode tag ram.

CIJ

~
0
~

5.
0

~ l:!l

CAM MAP; merge or allocate entry.
Detect hold conditions [2].
Hold inMAF.

Read tag ram.

Detect conflicts [1].
Row decode Tag Ram.
CheckVAF.
If victim is there mark and
don't try reloading.
Still need to update tags.
Read tag ram.

('!)

::. s
00

§" 5·

('/Q

s· ~

•

~ ;'

5 CQ

~ lJ

Compare PA to tags.
Determine set number.
Generate tag syndromes.

Write tag LRU (if match).
Correct tags.
Tag ECC replay [4].
Read data ram.

Write tag Invalid. lJ
Correct tags. CD
Tag ECCreplay [4].
Invalidate Tag_ECC register if c
a
index matches fill index. I»
Read data ram.

::s

,,a
s·

= cCD
Q.

.,,o
I» 0
a; ~

~·
D1
3

I-'•

('")

CIJ

Send fill PA to

MBBox. Set select data.

If was Exel send probe to MBox.
Set select forward data.

Generate data syndrome.

::s

s·

!. g>

< lJ
ca>

m~
Generate data syndrome.

Correct data.

m.,,
=
-

c
0

Send fill data to MBox.
Correct data. Data ECC replay [3].

lfnot already there, write forward data to VDB.

< =e

C::!.

.....=e

=e
N ::!.

i5'
!!.

O"'

:::r
0

Compare PA to tags.
Determine set number.
Generate tag syndromes.

Q. ('!)

(")

......

'"""'0
............

.,, l

CQ
0

g. :::::

:::::::r..... ('!)

!'!

5·
stii =

oSe;
§"8

Cl)

(")

o.. n

oo· o..

s. t:I
~~

""I

2&°& ...
0

?); c=;· ~

~a.e;

=:::.

00
00

......
t:I ......

......
lJ
CD
tn

.......
.........

('!)

~ ()'Q

t:I

~
(I)

O"' >-!

ogioo2!
('!)

::::s

>-o o-

•

I
......

.........,

= ::s a
c. n
8
o
a. g

.g:::c 00~

~0"'&5'
>-!

c:;
e;
:=: O"' (')CD

c:;C:('l)I

;::;:

-n

Flows

Table 11-17 Resource and Order Conflicts
Resource conflicts

Order conflicts

Probe

Same tag bank as Fill or CTE 3
cycles earlier.
Same data bank as Fill or CTE 7
cycles earlier. (not Inval)
Same tag bank as Probe or Miss 2
cycles earlier.

In-flight Fill, CTE, Forward or Forwardinval to
the same index which may extract victim.
In-flight Fill or CTE to the same index about to
write the data that the Probe is reading.

Miss

As Probe

In-flight Fill, CTE, Forward or Forwardinval to
the same index which may extract victim.
In-flight Fill or CTE to the same block about to
write the data that the Miss is reading.

Write
Same tag bank as ..
Through Same data bank as ..

[2] MAF holds CTE or Probe when:
-

Match MAF entry for the same block which has received a ReadMark or
CTEMark but is not yet coherent.

Probe to the same block in-flight which might replay. (For CTE, In-flight Forward for case FillExcl; Forward; CTEMark)

[3] CTE cases:
-

FillExcl checks MAF. If already have Shared block, Fill is dropped and ReadMark is converted into a CTE. (or vice versa?)

CTEMark checks MAF. If already have Exel block, don't need to do anything.

[4] ECC replays:
-

CTE (change), Inval, Forward, Forwardinval or Miss may replay because of
Tag ECC error.

CTE (change) or Miss may replay because of a Data ECC error.

[5] Invalidating VAF entries:
-

Probe Inval invalidates and deallocates shared blocks in the VAF.

Probe Inval marks excl blocks which have not sent victim request as 'no victim
req needed'.

Compaq Confidential
11-36 Second-Level Cache and Controller (Cbox)

5 January 2001 - Subject To Cfumge

't:

S2D*

:eJ
::::=

Arbitrate for the Scache pipe

:eJ

Retry
Arbitrate for the
Scachepipe

Blk* 12DResp [4]
Arbitrate for the Scache pipe

Arbitrate for the Scache pipe

co >
...

Arb

.! !Q.

Read the PA & MAF states.
Bank conflict check.

Read PA & MAF states.
Allocate a VAF entry

Read PA.
Bank conflict check.

if'

Compute the new MAF states

Compute the new MAF state.

~
....,..

Write the new MAF state.
Read tag.

Error detection.
Set select.
(")

3
~

Read tag.

Write the new MAF state
Read LRU/tags

Tag ECC correct
Update the Stag.
SC data rd.

Victim set select

Error detection.
Set select
Tag ECC correct
Update the SC tag [3].
Read SC data.

::::r

C3
Rd Tag

Rd Scache data array.
Victim tag ECC.

(')
~

(")

(/)~

~
0 ~

Send address/tag to Mbox

Send Victim addr and VAF idx to Mbox.
Write the VAF.

Send addr/tag
to Mbox

a.-

-a·

r~·
CD ......
Q
fill data on the fill bus.

Fill data on
the fill bus.
Fill data on the fill bus.

(')

C10
WR2

Write the VDB (CllB).

C11
WR3

-a· ....
!..
s·
CD

Yi
CD

(I)

C13
WR5

0O"'

.§.
....

C1
4WR6
C15

:c
N

:;·
CD

:;·
ca

"'CJ

CD
(')

::::r

s:
>< "'
m a
:c ca

s::
C"
0

::!!

"'t'S
CD

-CD

CD
0

3
0

::::s

"'CJ

-a·
CD
3"
CD

S'
ca
CD
0

r-------1

s::
0

t/)
(')

m
(')
:::;
CD

<D
...,

1---1

m
ca r-------1

C12 a
WR4 m

[

_..
_..
:....

~·

- - - !.
~ I:c

:::;

C7
Drive Data

Victim data on the fill bus.
Victim data correction.

t/)

C9
WR1

92..

,, :c

Fill Bus

I---

(')

Send the Fill addr/tag to Mbox

::::s

I---

cs ca~

C6
Set Select

Write LRU/tag.

i" §.

t/)

"'CJ

-a·

-.2.. g"' ..c
-a·

I--C" CD

C4 m
(')
:::;
Tag Compare CD
Rd Data

CD
(')

C2
Ta_g_ launch

..0

(')

~
~

s::
>
'Tl
I---

(")

C1
MAF

C»

I---

:::::r
CD
Sl>
:J

iD
_..
_..
_..I

:c t/)

t.)

8.....

_..
_..
:....

f-------1

s::
....

f-------1

s::

"TI

(/,)

Write SC data array.

C16

........
I

-I

Misses

WRIOAck

Sharedlnval

Arbitrate for the
Scachepipe

Arbitrate for the Arbitrate for the Scache pipe
Scachepipe

Forwards

co
Arbitrate for the Scache pipe

en
CD
0

0
;:,

~,......,

!..>)
0. ":-:"
5' ==+:

Sl:>
0

::J"

CD
Sl:>
;:,

-as=(")
a

0
;:,

-3
() "C
O"' ~

.§, ..0

(")

-i.

s·
.....

::J
:i:::
~

8
~

i
tJ)

t:
~

~·

(')

'"'le

::r' 0

e; ><

g. (D

- 8i
-

0.
§

(') 0

~ 0.
(D ~

g. ~

tr1
_n

~n
(1Q

;a. ~
~ 0
ft v~
~ 5'
(D

..... U'.l

Fl 0~
0

::r'

cf6

::r'

~
00

(D
~

(')

r::r r::r
~ 0

~
r::r
0
><
:::+;

U'.l
0
~

g.
.....
(D

0
Po;"
(D
(D

I:;"

~
"<
r::r
(D
(D

"<
0

>-ti

:=
.§

0
CD

!'!

U'.l

::r'

l"I'J

r::r

a ~ ><
~~
~ o. B
B
~ a 0.
=~. . . ~~ e.
=
~ 5'
5'

- 0:
~
~

e...
~

o~
(') ..

Po;"()
U'.l r::r

=
* n
,......,

r::r,......,

()

CD CD

..I.
..I.

..I.

C»

Arb

ii'

::c en

co >
...

8r::r

Alloc/merge MAF entry.
Read out MAF state.
bank conflict check.

Read MAF state.

Alloc/merge MAF entry.
Rd MAF states
Bank conflict check.

C1
MAF

Compute the new MAF state.
Send ACK/MAF index to Mbox.

Compute the new
MAF state.

Write the new MAF state.
Read Scache Tag.

Write the new
MAF state.

Error detection.
Set select.
Tag Error correct.
Update the LRU.
Rd SC data.

Compute new MAF states.

Tag ECC correct
Update the Stag.
SC data rd.

Send tag to Mbox.
CAMtheVAF.

Send WRIOAck
toMbox.

Send address/tag to Mbox.
CAMtheVAF.

~
r::r

r::r
(D

.....

Fill data on the fill bus.

fill data on the fill bus [2].
ECC correction
WRthe VDB

,______ s::

.,,
CD

CD
(I)

::c
I'\)
m
1---1

"i"

1 - - - 1 "C

.,,

C16

cc t--------1

C15

::c

s· 1 - - - 1

C12 !.
WR4 D>

C1
4WR6

-s· ::c
!!. ....

C10
WR2 en
n
D>
C11 n
::T
WR3 CD

:;·
CD

-"'
D>

::c ca
n ,______ !. CD

::T
CD

C7
Drive Data

C13
WRS

"O
-s·

.,, en "

·er
CD

C9
WR1

Fill Bus

><
.....
>-ti

CD tr

tr
0

-a
::s

tr CD
-C

cs -;:

Rd Data

C6
Set Select

=t'
CD

C2
Tag_ launch

C3
Rd Tag en
n
D>
Set select. Error detection.
C4 n
::T
Tag Compare CD

WR new MAF states

"'nn
1---1

00
00

s::

>
"T1

s::
0

,______

....s::

,______

s::
I'\)

.,,

n
0

::s

:;·
c
CD
c.

Flows

[4]: Blk* and InvToDirtyResp take 2 Scache cycles. Cbox sends early warning to
Mbox if the fill address bus is not used by other transaction. CB ox must ensure to
send the early warning such that the Mbox retry will find the fill data (i.e. do not
send the early warning too early).
[5]: Writing the VDB with the ECC corrected victim data is not time critical as long
as we write the VDB before the merge buffer write to the VDB.
The Scache control reads the physical address from the MAF in Cl A and drive
them to the Scache tag in C 1B.
11.7 .1.3 Resource Conflict

The following resources are shared:
•

Scache tag pipe.

•

MAF PA read port.

•

MAF PA CAM port.

•

FillNictim/Probe address bus (47+ bits).

•

Fill data bus (512+ bits).

•

Scache tag bank.

•

Scache data bank.

Table 11-19 Required Resource

MAFPA
MAF

VAF

entry

WR RD

CAM

Scache

Fill
Tag
Write-thru
Bank Addr Bus
Bus

Data Fill Data
Bank Bus

InvalToDirty RespCnt

ShrToDirty*Cnt

Blk*

*Req/*Fwd

Sharedlnval

Miss

0- 1

Write-thru
1

Partial Fill

TIDS FOOTNOTE NOT SPECIFIED

Notes:

•

We need a VAF entry,
-

For a VictimClean if a ShrToDirtySuccessCnt does not find a Shared copy in
the Scache.

For a ShrToDirtyComplete or ShrToDirtyRelease for a ShrToDirtyProbCnt.

5 January 2001 - Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-39

Flows

•

Blk* and InvToDirtyRespCnt takes 2 Scache pipe cycle since may victimize a
cache block.
The victim block: send the victim address to Mbox & write the VDB via the fill
bus.
The fill block: send the fill address and the fill data to Mbox/Ibox.

•

Blk* writes the Scache data ram while Misses and Probes read the Scache data ram.

11.7.1.4 Scache Bank Conflict Check

Table 11-20 shows the Scache bank conflict timing
Table 11-20 Scache Bank Conflict Timing

MissO

:1~

iE
"C
5
U'.l

8
Miss2

]

~~
5'~

a
Cl)

u
Victim3

i
]

$10

$11

$12 $13

$14

$15

Notes

H
~

$5
~

iE
"C
5
U'.l

Cl)]

Missl

!+:I
"C

U'.l

.....:iU'.l

~~
8"~

1
~

=
]

!+:I
Cl)

U'.l

i..:
"C

:::'..

!+:I

"8
Cl)

U'.l

Cl)

~·

s
'+J

·~

:>'

§~

,t::i· ....

~.s
,_:i
.....

,.c: ti

U'.l

ul:l

Cl)

:s:

~ ......

Fill4

g~
~~
.s

Cl)

8"
Victim6

i
~
Cl)
~

~
]

]

iE
]

U'.l

·.gs
:>'

.a"C
<l:l

·;;
"C

U'.l

$
·i::

ti ...t

~s
OU:..
o.S

f~
,t::l

Compaq Confidential
11-40 Second-Level Cache and Controller (Cbox)

5 Jc1nuc1ry 2001 ~Subject To Change

Flows

Table 11-20 Scache Bank Conflict Timing
$2

Miss5

]

E"0

.a]

"g
<!)

ci
~

$10

$11

$12 S13

$14

$15

Notes

c'il

~
i.E

]

t:I)

tZl

<!)

Miss6

E
~

E"
8

ci rl

~~
~]
8'~

.a]
~

't:I

tZl

§:i
~~

't:I

:::::

o.o,.s

<+:::

$ .....

't:I

c:
<!)

<!)

'5-u
c:IS•-<

tZl

~~0

(.)

::::)

Miss13

"g
<!)

E
~

E"
8

§s

]

~..s
~-~

ci
~

0§

't:I
tZl

~ll..

(.)

't:I

Notes:

•

[l]: We have a separate LRU array; hence, this does not cause a Stag array bank
conflict (i.e. no miss-miss bank conflict).

•
•

The current Scache proposal has 16 independent banks .
The Scache bank conflict check prevents:
Two accesses to the same Scache bank (tag & data array).
The write through to the bank (index ??) that has a stale victim block (i.e. victim block in the process of being written to the VDB).
The write-through to the bank (index) that has an inflight fill block.
In case of a bank conflict, the request that entered the Scache pipe later gets
retried. However, the system fill preempts the preceding Scache access in case
of bank conflict to avoid retrying the system fill.

•

Do a system fill at even cycle to prevent the resource conflict with the immediately
following system fill.

•

We must ensure that there is no bank conflict between reading of the Scache data
for LRUVictim and writing of the Scache for a fill block. This means the

•

Scache write must be in the even cycle .

•

Misses(X) can have bank conflict to the Scache data bank with a preceding system
fill.

•

In case of bank conflicts, the request that entered the Scache pipe later gets retried
except for system fills which forces the conflicting request to get retried even if it
entered the Scache pipe earlier.

5 January 2001 ·-Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-41

Flows

11.7.2 Fill and LRU Evict Flow
11.7.2.1 Hiccup Flow

11.7.3 Probe Flow

11.7.4 Mbox Request Flow
The Cbox looks up the Scache tag and sends the tag and/or fill data to the Mbox in
response to the Mbox Miss request. In case of Scache miss, the Cbox sends the system
request. Miss request Scache look=up.
Table 11-21 Miss Request Command Summary
Miss Request
Command

Scache State Dcache Fill Command

VSD
to Mbox

Fill Data MAF .sys_cmd<2:
to Mbox 0>

Ifetch

oxx

C_DFILL_CMD_MGBPROBE

oxx

Read Shared

100/110

C_DFILL_CMD_MGBPROBE 1

100/110

None

101

Yes

None

Read

C_DFILL_CMD_FILLBLK

1001101 1110 Yes

None

FetchLineMod CtoD oxx

C_DFILL_CMD _FILLBLK2

oxx

ReadMod

110

C_DFILL_CMD _FILLBLK

110

Yes

ShrToDirty

100/101

C_DFILL_CMD_FILLBLK

101 3

Yes

None

oxx

C_DFILL_CMD _FILLBLK2

oxx

ReadMod

100/101/110

C_DFILL_CMD_FILLBLK

1001101 31110

Yes

None

oxx

C_DFILL_CMD _FILLBLK2

oxx

ReadShr

100/101/110

C_DFILL_CMD_NOCACHE

1001101 3/110

Yes

None

oxx

C_DFILL_CMD_FILLBLK2

ReadShr

100/101/110

C_DFILL_CMD_NOOP

None

oxx

C_DFILL_CMD_FILLBLK2

oxx
oxx
oxx

InvToDirty

110

C_DFILL_CMD_FILLBLK

110

Yes

ShrToDirty 4

lOX

C_DFILL_CMD_ITODRESP

101 3

Yes

None

oxx

C_DFILL_CMD_STCDONE2

oxx

None

110

C_DFILL_CMD_FILLBLK

110

Yes

ShrToDirtySTC

C_DFILL_CMD_STCDONE

101 3

Yes

None

FetchLine

101

C_DFILL_CMD_MGBPROBE

oxx

C_DFILL_CMD_FILLBLK2

100/101/110

PfetchLineMod

PfetchNocache

PfetchScache

ltoD

CtoDSTC

lOX
1

oxx
3

Mbox send dl%probe_hit when the Ifetch hits the dirty block in the MGB.

2 Mbox invalidates the Dcache block.
3

The cache state to the Mbox will be ExclCln if the block is not coherent to prevent the merge buffer
from wring the Scache.

Compaq Confidential
11-42 Second-Level Cache and Controller (Cbox)
5 Janw'*rY 2001 -- Subject To Change

Flows

In order not to replay InvToDirtyRespCnt due to Scache tag ECC error, we send a ShrToDirtyReq if
we have a shared copy even if Mbox has the full cache block modified.

Notes:

•
•

The MAEsys_cmd gets set after Misses look up the Scache tag .
The MAEsys_cmd can be changed before we make the system request (i.e .
MAEneed_sys_rqst):
A new Miss request gets merged.
A probe hits a ShrToDirtyReq.
If a probe hits a ShrToDirtySTCReq, we may de-allocatethe MAF entry.

System request pipe arbitration
•

The system request pipe is arbitrated between Miss requests from the MAF and
Victim from the VAE

•

System requests queued in the MAF are arbitrated based on the age priority.

•

The responses in the VAF has the priority over the reqeusts in the MAE

11.7.5 Victim Flow
The sources of Scache Victim(X) are:
•

External probe ( *Forwards).

•

Internal probe.

•

LRU displacement by a system fill.

The victim block has to be coherent before the Victim can be sent to the home node.
This requires a victim to CAM the MAE

The LRU evicted Victim(X) at the system fill time must:
1. Pull the victim out of the Scache.
2. Perform the victim Tag ECC correction.
3. If the victim block is shared, then the victim process completes.
4. If the victim block is Exclusive or Dirty:
Write the ECC corrected victim address to the VAE
Send the ECC corrected victim address to the Mbox.
Put the ECC corrected victim data into the VDB.
CAM the MAF to see if the victim block is coherent.
If the merge buffer has modified bytes, the merge buffer writes the modified

bytes to the VDB after the ECC corrected Scache victim is in the VDB.

5 January 2001 -- Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-43

Flows

There exists a time window where the merge buffer does not know the block
has been victimized until it receives the victim address even though the victim
block has been removed from the Scache. The bank conflict check in the
Scache prevent the merge buffer from writing to the displaced Scache block.
Probe induced victim. (CHECK WEBSITE)
Table 11-22 Victim Command Summary

SC command

SCVSD

Victim command 1
(cmd1)

Extract
MGB
Victim Command O(cmdO) Data

CAM
the
MAF1

OXX/ 110

cs_VAF_CMD_NOOP

cs_VAF_cMD_NOOP

100

CS_VAF_CMD_NOOP

cs_VAF_CMD_VICTIM_

Yes

101

CS_VAF_CMD_NOOP

cs_VAF_CMD_VICTIM

Yes

OXX/ 110

cs_VAF_CMD_NOOP

No2

100

CS_VAF_CMD_NOOP

cs_VAF_CMD_VICTIM_

Yes

101

CS_VAF_CMD_NOOP

cs_VAF_CMD_VICTIM

Yes

OXX/ 110

CS_VAF_CMD_NOOP

cs_VAF_CMD_NOOP

100

cs_ VAF_cMD_NOOP

cs_VAF_CMD_VICTIM-

Yes

101

CS_VAF_CMD_NOOP

cs_VAF_CMD_VICTIM_TO_

Yes

OXX/ 110

CS_VAF_CMD_NOOP

LRU Displacement
CS_MAF_SC_CMD_LRUEVICT

CLEAN

Internal Probe
CS_MAF_SC_CMD_VICTIM

CS_MAF_SC_CMD_VICTOSHR

CS_MAF_SC_CMD_INVAL

CLEAN

CLEAN_TO_SHR
SHR

cs_VAF_CMD_NOOP

100/101

Must not happen

External Probe

Compaq Confidential
11-44 Second-Level Cache and Controller (Cbox)

5 Jc1nuary 2001 - Subject To Change

Flows

Table 11-22 Victim Command Summary (Continued)
Victim command 1
(cmd1)

Extract
MGB
Victim Command O(cmdO) Data

CAM
the
MAF1
No3

SC command

SCVSD

CS_MAF_SC_CMD_FETCHFWD

OXX/ 110 CS_VAF_CMD_NOOP

cs_VAF_CMD_FOWARD_
MISS

100

CS_VAF_CMD_BLK_
INVAL

CS_VAF_CMD_FORWARD_
ACK_SHR

Yes

101

CS_VAF_CMD_BLK_
INVAL

cs_VAF_CMD_VICTIM_ACK_
SHR

Yes

OXX/ 110

CS_VAF_CMD_NOOP

cs_VAF_CMD_FOWARD_
MISS

100

CS_VAF_CMD_BLK_S
HARED

CS_VAF_CMD_FORWARD_
ACK_SHR

Yes

101

CS_VAF_CMD_BLK_S
HARED

CS_VAF_CMD_VICTIM_ACK_
SHR

Yes

OXX/ 110 CS_VAF_CMD_NOOP

cs_VAF_CMD_FOWARD_
MISS

100

CS_VAF_CMD_BLK_S
HARED

cs_VAF_CMD_FORWARD_
ACK_SHR

Yes

101

CS_VAF_CMD_BLK_S
HARED

cs_VAF_CMD_VICTIM_ACK_
SHR

Yes

OXXI 110 cs_VAF_CMD_NOOP

CS VAF CMD FOWARD
MISS -

100

CS_VAF_CMD_BLK_E
XCL

CS VAF CMD FORWARD
ACK_EXCL -

Yes

101

CS_VAF_CMD_BLK_
DIRTY

cs_VAF_CMD_FORWARD_
ACK_EXCL

Yes

OXX/ 110

CS_VAF_CMD_NOOP

CS_VAF_CMD_FOWARD_
MISS

100

CS_VAF_CMD_BLK_
EXCL

CS VAF CMD FORWARD
ACK_EXCL -

Yes

101

CS_VAF_CMD_BLK_
EXCL

cs_VAF_CMD_VICTIM_ACK_
EXCL

Yes

OXX/110

CS_VAF_CMD_NOOP

cs_VAF_CMD_FOWARD_
MISS

100

CS_VAF_CMD_INV_
TO_DIRTY_RESP

cs_VAF_CMD_FORWARD_
ACK_EXCL

101

CS_VAF_CMD_INV_
TO_DIRTY_RESP

CS_VAF_CMD_FORWARD_
ACK_EXCL

OXX/ 110 CS_VAF_CMD_NOOP

CS_VAF_CMD_FOWARD_
MISS

100

CS VAF CMD INV
To=.DIRfY_RESP -

CS VAF CMD FORWARD
ACK_EXCL -

Yes

101

CS_VAF_CMD_INV_
TO_DIRTY_RESP

CS_VAF_CMD_VICTIM_ACK_
EXCL

Yes

OXX/ 110

CS_VAF_CMD_NOOP

CS_VAF_CMD_INVAL_ACK

CS_MAF_SC_CMD_READSFWD

CS_MAF_SC_CMD_READFWD

CS_MAF_SC_CMD_READMFWD

CS_MAF_SC_CMD_READMFWD
requester is IO proc.

CS_MAF_SC_CMD_ITODFWD

CS_MAF_SC_CMD_ITODFWD
requester is IO proc.

CS_MAF_SC_CMD_SHRINVAL

CS_MAF_SC_CMD_SHRINVALB

100/ 101

Must not happen.

OXXI 110 CS_VAF_CMD_NOOP

cs_VAF_cMD_SPCL_INVAL_
ACK

100/ 101

Must not happen.

5 January 2001 ·-Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-45

Special Support

Table 11-22 Victim Command Summary (Continued)

SC command

SCVSD

Victim command 1
(cmd1}

Extract CAM
MGB
the
Victim Command O(cmdO) Data
MAF1

oxx

CS_VAF_CMD_NOOP

cs_VAF_CMD_VICTIM_

110

CS_VAF_CMD_NOOP

Non-Block Responses
CS_MAF_SC_CMD_STODSUCC

CLEAN

cs_VAF_CMD_NOOP

Yes

Must not happen.

100/ 101
CS_MAF_SC_CMD_STODPROB

Yes

oxx

CS_VAF_CMD_NOOP

110

CS_VAF_CMD_NOOP

cs_VAF_CMD_SHR_TO_
DIRTY_RELEAS
CS_VAF_CMD_SHR_TO_
DIRTY_COMPL

Must not happen.

100 / 101
1

CAM the MAF to check whether the victim block is coherent.

Internal probes are processed only when the block is coherent.

3 If the block is not coherent, we send ForwardMiss in response to a *Forward.

11.7.6 Retry Flow

11.8 Special Support
11.8.1 Input - Output
I/O request is reference to blocks in the I/O portion of the physical address space (i.e.
PA<47> = 1). In contrast to the memory address space, both 1/0 read and I/O write may
have side effects, and data may change without having been written. Since I/O space
behaves differently from the memory space, (1) I/O blocks are not cached, (2) I/O
requests may not be issued speculatively, and (3) I/O requests must follow the same
order given by the program order.
11.8.1.1 VO Request Ordering and Merging

•

The Mbox maintains the I/O ordering. Mbox will send a new I/O request only after
the previous I/O request for the same thread is sent out to the system. Cbox is
responsible for notifying Mbox when Cbox launches an I/O request to the system
by sending the thread processing unit(TPU). For WRIO requests, Cbox is also
responsible for notifying Mbox when the IOWrAck is received so Mbox can retire
Memory Barrier (MB).

•

RDIO and WRIO from the same thread get merged in Mbox. WRIO get merged in
the merge buffer while RDIO get merged by (YTD). The MAF provides Mbox with
the physical address CAM to assist the I/O request merging in Mbox.

•

if (m%miss_pa<47:6> == pa[i]<47:6> &io_ok_to_send[i] =0)
merge the new I/O request & return the merged MAF index.
Compaq Confidential

11-46 Second-Level Cache and Controller (Cbox)

5 Jc1nuary 2001

Subject To Change

Special Support

else
allocate MAF entry & return the new MAF index.

•
•

We may have only ONE I/O request per thread waiting for the system launch .

•

I/O requests whose merging window is open does not bid for the system request.

•

Mbox does not send multiple I/O requests to the same block from the different
threads unless the merging window is closed (i.e. no merging across threads).

•
•

I/O requests to the same block from different thread are not merged .

The merging window (i.e. io_ok_to_send) is managed by Mbox .

There will be no I/O requests from the Ibox .

11.8.1.2 1/0 System Request

•

Both I/O read (RDIO) and I/O write (WRIO) are queued in the miss address file
(MAF) waiting to be system launched.

•

Since WRIO has a victim-like flow, we may consider putting WRIO requests into
the VAF???

•

110 read and 1/0 write transfers can be variable length depending on the instruction
size (Quad-word, Long-word, Byte). I/O packets contain the byte mask to accommodate variable length transactions.

•

MAF may have the maximum of 4 I/O requests (one for each thread) to be system
launched at any given time. To save the MAF width, I/O specific flags will not be
kept in the MAF. Instead we have a small 4-entry buffer to store IO_mask<7:0> and
IO_sized :0>. This is possible because once we send out I/O request, we do not
need to keep those flags and data for RDIO or WRIO. Since we have one pending I/
0 request per thread, we use the thread ID of the pending I/O request to address the
4-entry buffer.

•

For a WRIO, we store the data block in the victim data buffer (VDB). When Mbox
does WRIO, Mbox sends the request along with the physical address to the MAF
(or VAF --- YTD). At the same time Mbox sends the data to the VDB. When we are
ready to launch the WRIO to the system, we read the data out of VDB. We reserve
the first 4 VAF/VDB entries for WRIO data.

11.8.1.3 Others

•

We victimize a cache block at fill time and each MAF system request may create a
victim. In order not to stall a fill, we need to guarantee a VAF spot at fill time. This
is implemented by stalling a new system request when the number of empty VAF
entries becomes less than r equal to the number of outstanding system requests.

•

The 21464 does not provide any special hardware support for WRIO handshaking .
Any synchronization of I/O is done by software using memory barriers (MB).

•

Ldx_L/Stx_C is not supported in I/O address space.

11.8.1.4 1/0 Request Flow

&/* Allocation */

5 January 2001 -·Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-47

Special Support

If (m%io_fill_rqst_valid && c%maf_full)

ask M'.box to ret:i:y;

else if (m%io_fill_rqst_valid && !c%maf_full)

allocate M'AF' ent:i:y;
write PA & flag bits;
allocate Pending Queue ent:ry and arbitrate for the system launch;

/* I/O request system launch *I

if (PA<47>[to_be_sys_launched_idx]

if (io_rqst_type[thread_id]

{ /* read I/O */

read PA<47:6>[to_be_sys_launched_idx];
IO_mask<7:0>[thread_id], IO_size<l:O>[thread_id];
send them to the router;

else
/* write I/O *I

read PA<47:6>[to_be_sys_launched_idx];
read IO_mask<7:0>[thread_id], IO_size<l:O>[thread_id];
read VDB [ thread_id] ;
send them to the router;

/* MAF de-allocation, no cache coherence check is necessary */

for (i = O; i < MAX_MAF_ENTRIES; i++)
{ /* we could follow the same routine as RdBlk request if desired */

/* this will require the marker to be sent for RDIO as well as WRIO * /

if (RDIO && have_data[i])

deallocate M'AF' and PQ entries;

if (WR.IO && have_marker[i])

deallocate M'AF', PQ, VAF, and VDB entries;

Compaq Confidential
11-48 Second-Level Cache and Controller (Cbox)
5 January 2001

Subject To Change

Special Support

11.8.1.5 VO Specific Structures/Operations

•

MBOX
Merge Buffer.
Memory Barrier.
RDIO merge.

•

Interface

•

CBOX
MAF allocation and system request
I/O requests to the same block get merged in the MAF until Mbox closes the
merging window. Then the closed MAF entry bids for the system request pipe.
When the I/O request is picked, we send the physical address along with the I/O
mask and the MAF index to the Router. If the request is WRIO, the data block
is read from VD B and is sent to the Router.
At the same time as we send the selected I/O request to the Router, we also
send the thread ID of the I/O request to the Mbox so that Mbox may send a new
I/O request for the thread.
We can de-allocate the MAF entry when the requested BlkIO and WrIOACK
are received for RDIO and WRIO respectively.
I/O buffer.
The small 4-entry buffer, one entry for each thread, contains I/O request-specific control flags. The control flags are loaded when a new MAF entry is allocated for I/O request and are sent to the router along with the physical address
when the I/O request is picked for the system launch.

Name

#Bits

Description

Valid

May not need this ???

IO_MASK<7:0>

Byte mask

IO_SIZE<l:O>

QW,LW,BYTE

IO_RD_WR

0: WRIO, l:RDIO

TOTAL

11.8.1.6 VO System Request Timing

I/O system request follows the same path and timing as ones for system requests for
memory space.

5 January 2001 ~· Subject To Change

Compaq Confidentia I
Second-Level Cache and Controller (Cbox) 11-49

Special Support

C3A

1. WR MAF
1. SetMAFValid 1. Arbitrate.
2. WR I/O buffer bit.
2. WRPQ.

C3B

C4A

C4B

1. RD MAF idx
and send to the
MAF.

1.Decode MAF
idx.

1. RD PA.
2. RD I/O flags
and data.
3. Drive them to
Router.

11.8.1.7 1/0 Request Packet Format

Table 11-23 1/0 Request Packet Format
Channel

Length (phit)

Command Name

Packet Format

Description

QIO

IORdBytes

PIO

I/O read with byte mask.

QIO

IORdLWs

PIO

I/O read with longword mask.

QIO

IORdQWs

PIO

I/O read with quadword mask.

QIO

3+15

IOWrBytes

PIO

I/O write with byte mask

QIO

3+15

IOWrLWs

PIO

I/O write with longword mask.

QIO

3+15

IOWrQWs

PIO

I/O write with quadword mask

11.8.1.7.1 Read 1/0 (RDIO)
Command

System Request

System Response

RdBytes
RdLWs
RDQWs

PA<47:5>
Mask<7:0>
MAFIDX<5:0>
Processor ID

BlkIO or NXMResp

NAMResp indicates the request referenced a non-exsistent block. This command is a
possible response to RdBlk, RdBlkMod, RdBlkShared, and RdIO.

Compaq Confidential
11-50 Second-Level Cache and Controller (Cbox)

5 J,1rmary 2001 -·Subject To Change

Special Support

11.8.1.7.2 Write 1/0 (WRIO)
Command

System Request

System Response

WrBytes
WrLWs
WrQWs

PA<47:5>
Mask<7:0>
MAF IDX<5:0>
Processor ID
DATA

WrIOAck or WrIONack

11.8.2 Memory Barriers -

the MB Instruction

A memory barrier retires when all prior (program order) memory references are visible
to the whole system. This is managed by MBox.
MB ox retires memory barriers after prior loads have retired and after prior stores have
obtained modify permission of the cache block they require. (The store may still be in
the write buffer, and if an invalidate arrives, this store may never get made.)
IO writes return a 'coherency' or 'completion' message once they have been made visible to the whole system, and this coherency message must be passed on to MBox so it
can retire following MB's. Regular stores must have been retired to the Merge Buffer,
and we must have obtained the ownership marker and data for the block which they
write before a following MB can retire. We do not need to wait for all the coherency
markers to have been received, and the cache block to be coherent, before we retire the
MB. (We need to wait for coherency before we can forward or victimise the block).
All other memory transactions have implicit completion marks which MB ox sees.
Loading shared data has the data as the marker, what about loads returning exclusive?
CTD has the state change.
The 21464 speculatively executes ahead of memory barriers, but any speculative loads
must be replayed if another processor or thread writes the location before the MB
retires. If another processor writes, we will receive an invalidate probe which will cause
the load to trap and reply.
CBox sends all invalidate probes to MBox so it can trap any speculated loads in the
shadow of an MB. (Not just invalidates hitting the MAF.

11.8.3 Load-Locked Store-Conditional {LDx_USTx_C) Instruction Processing
The basic LDx_L/STx_C flow is:
5. Mbox executes a LDx_L instruction at retire time and loads the lock address into a
TPU specific lock register. The Cbox no longer requires that the LDx_L be forced
to miss the Dcache.
6. Mbox executes a STx_C instruction at retire time, and sends a CTODSTC(X) to the
Cbox if Mbox does not have ownership of block X.
7. The Cbox may find block X shared, exclusive, or invalid in the Scache, and takes
different actions for each.
CTODSTC(X) finds block X shared in the Scache of Processor o (PO):

1. Cbox fills block shared to the Mbox, and sends S2DSTC(X) to home.
2. At the home:
5 January 2001 --·Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-51

Special Support

PO is on sharing list: home sends S2DsuccessCnt() to PO and Sharedlnvals to
sharers.

b. Block Xis shared or invalid, and PO is not a sharer (including sharing mask
case where PO's group is not a sharer): home sends S2DFail to PO.
c.

Block Xis exclusive at some other processor: home sends S2DFail to PO.

d. Sharing mask bit for PO is set: home sends S2DprobCnt() to PO. Home will
send Sharedlnval's if it receives S2Dcomplete from PO.
3. At PO: (3a corresponds to PO action in response to dir. message in 2a, etc.)
a.

Cbox sends CTOD to Mbox when block is coherent.

b. Cbox sends NOOP to Mbox (Sharedlnval will fail the lock)
c.

Cbox sends NOOP to Mbox (Sharedlnval will fail the lock)

d. If PO has received a Sharedlnval, PO sends S2Drelease to home and NOOP to
Mbox. If PO has not received Sharedlnval, Cbox sends S2Dcomplete to home
and CTOD to Mbox when block becomes coherent.
CTODSTC(X) finds block X exclusive in the Scache of PO:

1. Cbox fills block exclusive or dirty to Mbox (depending on coherence)
CTODSTC(X) finds block X invalid in the Scache of PO:

How does this situation arise? PO's Scache must have displaced block X (if PO has
instead received an inval or forward, the Mbox would have failed the lock.) We cannot
just fail the lock on a displacement (due to livelock). We also run into problems if we
just victimize an exclusive block that the Mbox has locked because the home will no
longer send us invals and we could incorrectly succeed the lock. When we displace a
block (due to either LRU eviction or an ECB instruction) for which the Mbox has a
lock, we want the home to let us know if another processor takes ownership of the
block, so we can fail the lock. Thus when a processor displaces an exclusive or dirty
block, we send C_DFILL_CMD_LRUVICTIM.
The Mbox should check the victim PA from the Cbox against the lock registers. If there
is a match, the Mbox responds to the Cbox with victim_addr_locked; the Mbox does
NOT invalidate the lock registers. When the Cbox sees victim_addr_locked asserted, it
sends a VictimToShared message (instead of a Victim) to the home. This message will
cause the home to add PO to the sharing list for block X, ensuring that PO's lock will get
invalidated should another processor succeed its STx_C.
1. Cbox sends ReadModSTC(X) to home.
2. At the home:
a.

PO is on sharing list: home sends BlkExclCnt() to PO and Sharedlnvals to sharers.

b. Block Xis shared or invalid, and PO is not a sharer (including sharing mask
case where PO's group is not a sharer): home sends S2DFail to PO.
c. Block Xis exclusive at some other processor: home sends S2DFail to PO.
d. Sharing mask bit for PO is set: home sends BlkExclProbCnt() to PO. Home will
send Sharedlnval's if it receives BlkExclComplete from PO.
3. AtPO:
Compaq Confidential
11-52 Second-Level Cache and Controller (Cbox)
5 Janwiry 2001 -- Subject To Change

IPRsj CSRs, and Error Handling

Cbox sends FILLBLK to Mbox.

b. Cbox sends NOOP to Mbox (Sharedlnval will fail the lock)
c.

Cbox sends NOOP to Mbox (Sharedlnval will fail the lock)

d. If PO has received a Sharedinval, PO sends S2Drelease to home and NOOP to
Mbox. If PO has not received a Sharedlnval, Cbox sends BlkExclComplete to
home and FILLBLK to Mbox.
11.8.3.1 Lock Register for Each Thread

•

Ldx_L retires when the requested data block is received.

•

The lock register is set when the requested data is received, which can be before all
the coherences are received.

•

Mbox is responsible for clearing the lock register for invalidate probe, write from
other threads, and exception.

•
•

Write from the same thread is considered legal and does not clear the lock register.
A new Ldx_L by the same thread overwrites the lock register causing all previous
Ldx_L/Stx_C to fail.

11.8.3.2 Stx_C Issuing

•

The lock register is compared with Stx_C address when STx_C retires. If the
STx_C address matches the address in the lock register:
In the normal mode, if DCache hit then the LDx_L/STx_C succeeds. If DCache
miss, then Mbox sends CTDSTx_C request to the MAF (need Scache tag
launch & system launch).
In the conservative mode, LDx_L/STx_C succeeds if addresses match and we
have the ownership of the block. If we do not have the ownership, we request
the ownership to the system. LDx_L/STx_C succeeds if the CTDSTx_C
request succeeds.

•

Ldx_L/Stx_C may work using regular CTD because the home node would send us
an Invalidate when it grants the ownership to other processor and the invalidate will
clear the lock register which will cause the Stx_C to fail. But Issuing CTD for a
Stx_C can cause unnecessary ownership changes and unnecessary data communication, which can have significant performance impact or possible live-lock.

11.8.4 Prefetch/Modify

11.9 IPRs, CSRs, and Error Handling
11.9.1 Required IPRs and CSRs
Cbox CSR's, with the exception of interrupt controls, contain static values that are
loaded at system initialization time and do not change while power is on. They therefore do not maintain the speculative/committed protocol that is required of many control registers; software may be required to jump through hoops to write them safely and/
or read them accurately.

5 January 2001 ··· Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-53

IPRs, CSRs, and Error Handling

In general, Cbox registers are accessed as memory-mapped 1/0 devices, with register
identifiers passed along address paths, and contents along data paths (we have yet to
decide which quadword of the block).

•

Interrupt requests and current level or mask
Interrupt mask
Interrupt request bits
Queue (Read, Delete, Append)

•

ECC correction reports (Scache Data, Scache Tag, Router Ports, Memories)
Physical Address (not useful in Route~ port)
Syndrome
Corrected Block (wrong if double-error)

•

Memory configuration
Access to presence-detect EEPROM on RIMM
Redundant-channel enables
Directory state-machine controls
Directory/ecc initialization
Fairness-scan timer
PLL controls
Datasheet constants
Debug stream controller
Select debug write mode
Current debug write pointer
Debug data read enable

•

Router configuration
This node number, first NXM
Output port to each node
Sharing mask for each node
Output port for each mask bit
ECC correction reports
PLL controls
Virtual-channel buffer thresholds (depend on link latency)

•

Diagnostic control and access
Force S cache hit or miss, per set
Examine tag
Examine selected MAF entry

•

Debugging/performance-analysis controls and logs
Compaq Confidential

11-54 Second-Level Cache and Controller (Cbox)

5 Janw~ry 2001 - Subject To Change

Profiling Support

•

Optimization enables:
-

LDx_L can issue RdBlkMod

Read/InMemory returns data Exclusive

Migratory data prediction

Timeout before forwarding ownership

Purge controls

Uniprocessor (no directory accesses required)
Small MP (mask bits uniquely identify processors)

11.9.2 Error Handling
11.9.3 Cbox Deadlock Avoidance Mechanisms

11.10 Profiling Support

11.11 Stuff From Original Cbox Spec Not in Outline
(That I can see anyway.... )
Were Hl 's 10.6 through 10.15 and are now all H2's
Last section was 10.20.9 Cbox Mechanisms

11.11.1 Scache Index (paddr<18:6>) Conflict
Scache index can be a hash function of the physical address. But the hashing introduces
the extra delay in the critical path. Our current thinking is that the potential performance
improvement does not justify the extra complexity.
In contrast to the 21264 and the 21364, the 21464 allows the Scache to service multiple
system requests to the same Scache index concurrently.
•

Scache index conflict is resolved by victimizing the LRU cache block at system fill
time (i.e. Blk* or InvToDirtyRespCnt).

•

The Scache fill schedules both the read and write pipeline and use the read pipe to
extract the victim just ahead of the fill data write.

•

Current proposal:
-

Cbox read the victim block out of the Scache, performs ECC correction of the
victim data, and writes the corrected victim data to the VD B.

Cbox does not send the victim data to the Mbox.

If the merge buffer has the cache block modified, the merge buffer overwrites
modified bytes in the VDB using byte write capability of the VDB. The merge

buffer should write the VDB after the ECC corrected victim block has been
written to the VDB.

5 January 2001 ···Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-55

Stuff From Original Cbox Spec Not in Outline

•

Cbox rejects the write-throughs to the banks that have in-flight system fills
(Le.Possible Stale Victim) until the ECC corrected data gets written to the
VDB. This can be implemented using the bank conflict check.

In order not to stall system fills, we need an available victim buffer slot before
launching a system fill request (Read*Req or InvToDirtyReq).
Outstanding system request to the memory space< Available VAFNDB entries
for memory space (4 slots are reserved for Wrlo) - 4 (reserved for Probes).

•

Mbox is responsible for victimizing a cache block in the Mbox.

•

We decided against the proposal to victimize the LRU block at the Scache miss
time as well as at the system fill time.
-

Pros:
Allows the victim process to start early (no need to wait until the fill)
Reduces the probability of stalling the MAF system fill request pipe.

Cons:
If a block that has been victimized at the miss time is referred again, then we

have to send the victim data to the home and bring the data back in again,
which may result in increased network traffics.
A MAF entry may cause two victims, which may require 2X VAF entries.
Incompatible victim/fill path.

11.11.2 ShrToDirty[STC]Req
Cbox sends a ShrToDirty[STC]Req to the home node when it needs the write permission of a Shared block for a non-speculative store (i.e. merge buffer write). Cbox does
NOT send a ShrToDirtyReq for a speculative store.
When the home memory receives a ShrToDirty[STC]Req:
•

If the cache block is not owned by other processor, i.e. the directory state is either
InMemory or Shared, and the requester is a sharer, the directory sends ShrToDirtySuccessCnt to the requester and Sharedlnval to other sharers.

•

If the directory state is InMemory or Shared but the requester is not a sharer, the
driectory sends BlkExcl to the requester and Sharedlnval to sharers in response to
the ShrToDirtyReq. The directory sends ShrToDirtySTCFail in response to a ShrToDirtyS TCReq.

•

If the directory state is RemoteExcl, the directory sends the ReadModFwd to the
current owner. However, the directory DOES not forward ShrToDirtySTCReq and
sends a ShrToDirtyFail to the requester.

•

If the directory state is SharedMask, the directory node can't tell whether the
requester is a sharer or not even though the requester is in the sharing mask.

A sendsReadModReq(X).

B sendsShrToDirtyReq(X).

g. The directory receives the ReadModReq(X) from A and sends Sharedlnval (X)
to node B.
Compaq Confidential
11-56 Second-Level Cache and Controller (Cbox)
5 Jc1nuary 2001 -· Subject To Change

Stuff From Original Cbox Spec Not in Outline

h. B invalidate the shared block and sends a InvalAck to node A.
i.

The directory receives Victim(X)frorn node A.

C sends ReadShrReq to the directory.

The directory processes the ReadShrReq from C. If A belongs to the same
group as C, A becomes a sharing node even though it does not have the block.

When the directory processes the ShrToDirtyReqfrom A, the directory does not
know whether A is true sharing node or not.

•

For a ShrToDirtyReq and a sharing mask is used, the directory optimistically succeeds the ShrToDirtyReq and sends a ShrToDirtySuccessCnt to the requester. The
requester is responsible for resolving the ShrToDirtySuccessCnt. The appropriate
action for the requester if the requester does not have the shared copy in its Scache
is to do a VictimClean followed by a ReadModReq.

•

To avoid a dead-lock, for a ShrToDirtySTCReq when a sharing mask is used, the
directory sends a ShrToDirtyProbCnt to the requester. If the ShrToDirtySTCReq
succeeds, the requester sends a ShrToDirtyComplete to the directory and then the
directory sends Sharedlnval to sharers. If the ShrToDirtySTCReq fails, the
requester sends a ShrToDirtyRelease to the directory.

If the directory receives a *Req from the current owner, it means that the victim is on its
way to the directory. The directory must wait for the victim. After the directory receives
the victim, it sends a response to the requester.

The EV7 protocol does not forward a ShrToDirty[STC]Req to the current owner. To
reduce the latency, the 21464 forwards ShrToDirtyReqs to the current owner but not
ShrToDirtySTCReqs.

•

If the requester is not a sharer and the directory state is not RemoteExcl, then the
DIFT sends BlkExclCnt to the requester.

•

If the requester is not a sharer and the directory state is exclusive, the DIFT sends a
ReadModFwd to the owner.

•

This proposed scheme does not break the assumption that Cbox will not receive a
Blk* if the Scache has a copy of the cache block.

ShrToDirtyResp
•
•

ShrToDirtySuccess.
ShrToDirtyProb.

11.11.3 Scache Tag Launch Pipe
Table 11-24 shows the Scache block state.
Table 11-24 Scache Block State
Scache state
Valid

Shared Dirty

State

Coherent

Ownership

Mbox Can Write Victim

Invalid

None

ExelClean

x
x

Yes

VictimCln or VictimClnToShr

5 January 2001 ···Subject To Change

Compaq Confidential
Second-Level Cache and Controller (Cbox) 11-57

Stuff From Original Cbox Spec Not in Outline

Table 11-24 Scache Block State
1

Dirty

Yes

Victim or VictimToShr

Dirty

Yes

Victim or VictimToShr

Shared

None

Must not happen

We have separate pipelines for the Scache tag launch and the write-through operations.
In case of the Scache bank conflict, we must stall one pipeline. For the performance
reason, we prefer to stall the write-through pipeline.
The Scache pipe is arbitrated with the following priority.
4. Blk*/InvToDirtyRespCnt from system or System fill hiccup recovery.
5. Replays (from the retry queue):
Scache bank conflicts.
Scache tag ECC error.
Scache data ECC error.
6.

Internal probes from the Internal probe queue.

7. Probes from the PRQ.
8. The system fill early warning.
9.

New Misses from the Pre-MAE

System fills (Blk* and InvToDirtyRespCnt) may cause victims:
•

Send an early warning (fill address) to Mbox if the early warning wins the Scache
pipe arbitration. Otherwise, we do not send the early warning incurring extra
latency to the Scache miss retry pipe.

•

Take 2 cycles of the Scache tag pipe.

•

To minimize the cost, we'd like to avoid skidding or bypassing of the system fill:
The system fill consists of 2 Scache tag pipes: one for victim extraction and the
other for fill.
Launch a system fill at even cycles to avoid the bank conflict with the previous
system fill.
System fill can have the bank conflict with a Scache accesses which is already
in the pipe (i.e. 2 cycle earlier). Then the conflicting Scache access gets preemptied and must be retried.

Scache retry pipe.
•

Any Scache access except for system fill (Blk*) may get retried due to
Scache tag ECC error.
Scache Data ECC error.
-

•

Bank conflicts.

The retry queue entry contains:
6-bit MAF index.

Compaq Confidential
11-58 Second-Level Cache and Controller (Cbox)
5 January 2001 -- Subject To Change

Stuff From Original Cbox Spec Not in Outline

6-bit VAF index.
Command type.
•

Need 2 - 3 write ports since the retry may come from
Scache bank conflict.
Scache tag ECC.
Sc ache data ECC.

•

The retry do not CAM the MAF again but it reads the PA from the MAF.

Internal probe queue

•
•

64 entry FIFO .
Timer queue is loaded when:
Invalidate: received a Shared block from the system and MAF.mb_retired = 0
& MAF.inval_seen = 1.
Victimize:
Received ownership from the system, the block is coherent, and MAF. victimize

= 1 I MAF.vic_to_shr = 1.

Received EvictBlk or CleanBlk requests from Mbox.
•

If Cbox has the pending internal probe which Invalidates the cache block, Mbox

must not retire a MB. Cbox sends the per-thread signal (i.e. cs%mb_ret_tpu_c4a_h)
to Mbox indicating whether Mbox can retire MB.
ShrToDirty*Cnt from the PRQ.

•

A ShrToDirty*Cnt can be retried due to Scache tag ECC error, Scache data ECC
error, or Scache bank conflict.

•

ShrToDirty*Cnt and subsequent Probe to the same block must be processed in
order.
-

Problem: ShrToDirty*Cnt (X) --->Tag ECC error---> Probe(X)---> ShrToDirty*Cnt (X) retry.
Probes to the block that has in-flight Scace transaction (MAF.sc_inflight) stalls
in the PRQ until the conflicting Scache transaction passes the retry point.

Misses check the Scache to see if the requested cache block is in the Scache with the
required state.

•

If Scache has the requested block for requests:

Fill the D-fill buffer and/or I-fill buffer.
Update the LRU bits in the Scache tag.
If Scache has a Shared block for an ownership request, then make a ShrToDirt-

yReq request to the system.
•

If the cache block doesn't exist in the Scache, then the MAF makes a Read*Req to

the system.
•

Since 1/0 blocks are not cache-able, I/O requests do not need the Scache tag
lookup.

5 January 2001 --·Subject To Change

Compaq Conficlentia I
Second-Level Cache and Controller (Cbox) 11-59

Stuff From Original Cbox Spec Not in Outline

Scache Bank conflicts.
Scache tag ECC error or Scache data ECC error.
•

We store the ECC corrected tag(data) in the Scache Tag (Data) ECC register and
replay the request.

•

The replayed request talces the tag (data) from the Scache Tag (Data) ECC register.

•

The ECC registers get cleared if we modifies the Scache tag in the same Scache
index (i.e. probe or fill).

•

We write back corrected tag into the tag array after the retry accesses the corrected
tag using the tag write cycle.

•

We do not write corrected data back to the Scache. If the error block gets referred
many times, we need software intervention or evict the block.

Arbitration among Miss request is done in the Pre-MAF to:

•

Reduces the Dcache miss retry latency.

•

Accesses to the Scache follow the program order more closely.

Scache Tag Request Command
Table 11-25 Scache Tag Request Command
Scache pipe Commands

cso/ost_req_cmd_c3a_h<3:0>

Notes

Bank conflict, WrIOAck, BlkIO

C_ST_CMD_NOOP

No need for Scache tag.

Ifecth, FetchLine,
PfetchLineMod, PrefNocache,
PrefScache

C_ST_CMD_MISS

Look up the tag and update the LRU.

FetchLineMod, CtoD[STC], ltoD C_ST_CMD_SETDIRTY

Set the dirty bit if the specified block is ExclCln.

LRUEvict

C_ST_CMD_LRUEVICT

Victimize the least recently used block from the
same Scache index as the specified cache block. The
victim set number is used for the fill block in the following cycle.

Victimize, ReadModFwd,
InvToDirtyFwd, Sharedinval

C_ST_CMD_INVAL

Invalidate the specified cache block.

VictimToShr, FetchFwd,
ReadShrFwd, ReadFwd

C_ST_CMD_C1DS

Make the specified cache block Shared.

ShrToDirty*Cnt

C_ST_CMD_S1DD

Make the cache block Dirty if shared.

BlkShared

C_ST_CMD_BLKINV

Invalidate the set victimized in the previous cycle.

BlkShared

C_ST_CMD_BLKSHR

BlkExclCln

C_ST_CMD_BLKEXCL

Fill block. Use the Scache set victimized in the previous cycle.

BlkExclCln, BlkDirty,
InvToDirtyRespCnt

C_ST_CMD_BLKDIRTY

Also see the Scache tag state transition table, Table 11-12.

11.11.4 Probe Processing in Cbox
See Section 6.5.2

Compaq Confidential
11-60 Second-Level Cache and Controller (Cbox)

5 Jc1nuary 2001 ···Subject To Change

Stuff From Original Cbox: Spec Not in Outline

11.11.5 Order Dependency
Scache access to the same cache block
Table 11-26 Scache Access Order to the Same Cache Block
MAF states

Miss from Mbox

miss_inflight_in_sc

probe_inflight_in_sc

Reject the miss request.

Reject the miss request [l]. Reject the miss request
[2].

Stall the probe queue [3]. Stall the probe queue [4].

*Fwd from PRQ
lnvalAck/Timer Expiration from
PRQ

fill_inflight_in_sc

Stall the probe queue
[2].

Stall the probe queue [6].

ShrToDirty*Cnt from PRQ

Must not happen [5].

Blk*/12DRespCnt from system

Must not happen [5]

Notes:

•

[1]: If a ShrToDirty*Cnt (or InvToDirty*Cnt) is in the retry queue, the Miss will see
the stale Scache tag and we may send a system request even though we have the
ownership of the block. This will break the cache coherence protocol.

•

[2]: Stale fill data access:
There are 6 - 8 cycle separation between the Scache tag update and the Scache
data update for a system fill.
A Miss(X) from Mbox gets rejected if we have a Blk*(X) in-flight in the
Scache to prevent the Miss from accessing the stale fill data.
A probe(X) to a stale fill block gets stalled in the PRQ.
A Blk* must not get evicted before the fill data gets written into the Scache.
We have a stale fill table to prevent the Scache tag control from evicting a stale
fill block..

•

[3]: The probe can proceed even if there is a miss request in-flight in the Scache .
But to simplify the design, we will stall the probe queue.

•
•

[4]: Probes to the same cache block must be serviced in order.
[5]: The miss request gets merged and do not enter the Scache pipe if we have an
outstanding system request for the cache block.
If Cbox gets a Load miss request from Mbox to a cache block which has an out-

standing ShrToDirtyReq to the system, Cbox will not fill the shared block to
Mbox until it receives a response for the ShrToDirtyReq even though the
Scache has a shared copy.
This case happens if the shared block which was filled at the time when Cbox
made the ShrToDirtyReq has been evicted from Mbox. In addition the load
can't retire until the store which needs the ownership gets retired. So we believe
the performance impact will be minimal. Alternatively we could allow the load
miss request enter the Scache pipe even if we have an outstanding ShrToDirtyReq to the system. However this may add a significant complexity.
Compaq Confidential
5 January 2001 ·-Subject To Change

Second-Level Cache and Controller (Cbox) 11-61

Stuff From Original Cbox Spec Not in Outline

•

[6]: This is not required to be functionally correct. But to simplify the de-allocation
of MAF entry, we do not allow changing the MAF states if the MAF entry has inflight Scache transaction.

11.11.6 Possible Race Conditions and Other Concerns
•

•

For the system launch, or system fill, we read out the PA and control flags. We must
not change control flags while we try to read them out:
-

Read the physical address in B-phase

Change control flags in B-phase ..

Read control flag bits a phase later in A-phase.

We must not fill the I-fill buffer twice for a MAF entry since the 1-FB entry may get
recycled.

11.11.7 CBox mechanisms

•
•

IO write 'coherency mark' to MBox
All Invalidate or Forwardlnvalidate probes to LQ

Compaq Confidential
11-62 Second-Level Cache and Controller (Cbox)

5 Jc1nuc1ry 2001 ·-Subject To Cfumge

Introduction to the Protocol

12
Cache Coherence Protocol Processing
The 21464 adopts the 21364 cache coherence protocol with small enhancements. The
protocol is a directory based CC-NUMA and tolerates out-of-order channels except for
the I/O channel, thereby supporting an adaptive packet routing.

12.1 Introduction to the Protocol
The coherence protocol is the mechanism by which large numbers of processors maintain a consistent image of the contents of memory, as required by the Alpha SRM.
Small-scale multiprocessors like Turbolaser maintain consistency by monitoring a bus,
so that all caches observe all memory transactions; this approach does not work well for
larger numbers of processors because the bus becomes physically too large to be fast,
and because the number of external transactions that each cache must process grows
with the number of processors.
Mid-scale systems like Wildfire depend on a central switch to impose the same order as
a bus, and require that messages be kept in order along their communication paths. The
directory serves as a filter to minimize the traffic to any particular node, but maintaining order is costly and inefficient, and limits the scalability of these systems.
Larger systems, such as the 21464, try to avoid dependence on any single resource for a
number of reasons, including reliability and load-distribution. Further, they use nondeterministic routing, to make the best use of available network resources. This means
that two messages can take different paths and get out of order, even if they start and
end at the same nodes.
The protocol is designed to ensure that all processors which cause and/or observe
changes in memory see those changes occur in the same apparent order, even though
the messages between processors and memories may get out of order along their way.
The order observed by all processors is the order in which requests are serviced in their
home memory, and in particular, in the D IFT, a control module in the memory controller. Caches communicate with the DIFT as they manipulate memory data, and the DIFT
delays multiple requests for any individual block until it has coordinated previous
requests with any caches affected by those requests.
There are two major states in which caches hold data (each with a number of minor
variants): Shared, meaning that any number of processors can read the data, but none
can write it, and Exclusive, meaning that there is exactly one processor with a valid
copy, and that processor is permitted to read or write it. The so-called Invalid state

5 January 2001 ·- Subject To Change

Compaq Confidential
Cache Coherence Protocol Processing

12-1

Structures that Maintain the Cache Coherence

means that there is no valid copy of the block in any cache; the name does not refer to
the validity of the block in memory. References to Invalid are being replaced by Local,
which is more accurately descriptive.
The protocol, as managed by the DIFT, is concerned with the transitions between states,
and with performing the transitions in such a way that as much of the communication
latency as possible is kept out of the critical paths.
Whenever a block is held Exclusive by some processor (which we refer to as the
owner), and another processor needs access, the protocol requires the DIFT to tell the
owner to give up Exclusive state, and the owner to report to the DIFT when it has done
so. Until the DIFT hears back from the owner, the DIFT does not process other requests
for the block.
On the other hand, when a block is held Shared by some processor(s), the DIFT can permit any number of other processors to read the block, until one needs write access. At
that time, it notifies all processors which have read the block to invalidate it in their
caches, and they report to the new owner, rather than the DIFT, when they have done so.
The new owner must not release exclusive access until it receives acknowlegement that
all previously-existing copies of the block have been invalidated.
During the transitions when Exclusive access is passed from one processor to
another, there can be periods during which two processors believe that they
have exclusive access; in some circumstances it is possible for the "second"
processor to complete its writes and send a victim block to the memory which
arrives before that sent by the first writer. This rare event is called a dual-victim
rac~, and is sorted out by special rules in the DIFT (see Section 12.4 ).
The memory system is designed with the expectation that a disproportionate fraction of
the memory traffic produced by any processor will be addressed to its own local memory; this is true for most multiprocessor applications, though precisely how much is
highly application-dependent. We use this fact, and the on-chip communication
between a cache and its local controller, to optimize references to the local memory.
The directory cache optimizes the directory accesses for requests both from local and
remote processors. The on-chip directory cache stores the directory information of most
frequently used cache blocks to minimize memory accesses for directory information.
Requests from the local Cbox as well as remote processors update the directory, thereby
eliminating needs for the LPR.

12.2 Structures that Maintain the Cache Coherence
The Cbox maintains the cache coherence with the following structures.
•

Miss address file (MAF)

•

System request pending queue (SRQ)

•

Victim buffer
Victim address file (VAF)
Victim data buffer (VDB)

12-2

•

Probe queue (PRQ): probe queue

•

Directory In-Flight Table (DIFT)

Compaq Confidential
Cache Coherence Protocol Processing
5 Jc1nuc1ry 2001 ·-Subject To Change

Structures that Maintain the Cache Coherence

•

Scache tag array (Sbox) (not described)

•

Scache data array (Sbox) (not described)

12.2.1 Miss Address File (MAF)
The Scache is a non-blocking cache, which means that it continues to accept new
requests while waiting for responses from the system resulting from previous misses. In
order to associate those responses with the original requests, and to know how to process each response, the Cbox records each request in the MAF as processing begins,
and compares each request address against the addresses of all outstanding requests to
detect multiple requests to the same block and avoid conflicts. The MAF contains:

•
•
•

Miss Requests from Ibox/Mbox, both to local addresses and to remote addresses .
Probes that are in-flight in the Scache pipe .
Coherence states that are associated with the outstanding request.

12.2.2 System Request Queue (SRQ)
Requests which miss in the Scache must be serviced by memory, either local or remote,
but the paths to memory cannot accept requests as rapidly as they can be generated. The
system request pending queue (SRQ) stores the MAF index of requests which are waiting to be sent to memory. As buffers become available, the SRQ arbitrates among pending requests round-robin among threads, then FIFO within thread, to select which
request to launch next. The current proposal is to do away with the thread round-robin
since Mbox does consider thread fairness for load retry. Hence the system request pending queue is a simple 64-entry FIFO queue.

12.2.3 Victim Buffer
The Scache retains most blocks until the space they occupy is needed for another block.
If a block is not held exclusively in this cache at the time it is evicted, then it is simply
overwritten, but if it is in exclusive state, then the directory must be notified that this

cache is releasing exclusive access, and if the block is dirty, it must be written back to
memory. Rather than delaying the fill which overwrites this block, the Scache moves
the old contents to the victim buffer, where it waits before being sent. The victim buffer
contains:
•

Victim, VictimToShr, VictimCln, and VictimClnToShr, which can be sent to the home

memory.
To minimize the number of *Forwards, EV8 is considering adding the Purging
option to the protocol where Cbox gives up the ownership of the cache block voluntarily and send Victim[Cln]ToShr to the remote directory. Notice that if the cache
block is in the local memory space, there is no benefit of purging. EV8 decided
against the purging option since performance simulation indicated the purging provides no significant benefits.

•

Response to the home memory:
VictimAckShr, VictimAckExcl, FwdAckShr, or FwdAckExcl to the remote
memory or to the local memory.
FwdMiss to a remote or local memory.
Compaq Confidential

5 January 2001 ··· Subject To Change

Cache Coherence Protocol Processing

12-3

Structures that Maintain the Cache Coherence

ShrToDirtyComplete or ShrToDirtyRelease to a remote or local memory.

•

Response to the remote requester
Blk* to be sent to the remote requester in response to a forward.
InvToDirtyRespCnt to a remote requester.
InvalAcks to be sent to a remote requester.

12.2.4 Probe Queue (PRQ)
The Scache must service several kinds of requests from other nodes in the system. For
some of these requests, it is important that probes to the same cache block be serviced
in the same order in the Scache as they are in the DIFT, so they are all stored in the
probe queue, called the PRQ, while awaiting service in the Scache. The protocol does
not mandate keeping the same order of the probes in the Scache as in the DIFT. The
probe queue is a simple 24-entry FIFO. The PRQ contains:
•

*Forward: Forwards from remote and local directories.

•

Response without a fill data.
ShrToDirtySuccessCnt.
ShrToDirtyFail.
ShrToDirtyProbCnt.
InvalAck.
-

WrloAck.

WrloNack.
NXMResp/ERRResp.

•

Since the InvalToDirtyRespCnt must allocate a Scache set, it takes the same path as
the fill block (i.e. Blk*) taking two Scache cycles. Since the InvToDirtyResp does
not accompany the data, the Scache has an invalid data until the merge buffer completes the write-through. It will be functionally incorrect if Ifecth grabs the data out
of the Scache before the merge buffer writes the cache block. There are two proposals:
Option 1 (current favorite): Have Ifecth probe the merge buffer. If the merge
buffer has the block, then notify the Ibox and retry the !fetch.
Option 2: always bypass the InvToDirtyResp straight to the victim buffer and
have the merge buffer write the cache block to the victim buffer. This scheme
prevents from filling Icache with the invalid cache line but will require victimizing the cache block and re-requesting the block. Since we often read the
cache block after store, this scheme has some performance impact.

•

Remote requests, Local requests, and local victims no longer come to the probe
queue for the new EV8 protocol.

•

To avoid a deadlock problem, the PRQ takes:
Any message if there are more than one free entries.
Only a Response from the router if there is exactly one free entry.
No transaction if there is no free entry.

12-4

Compaq Confidential
Cache Coherence Protocol Processing
5 J(1nuary 2001 -~Subject To Change

Overview of the Cache Coherency Protocols

•

2-4 VAF entries are reserved for probes.

12.2.5 DIFT
The memory controller may have a large number of requests in progress at any
moment, some being serviced in the local memory, others awaiting responses from
remote nodes. All requests are recorded in the DIFT as soon as they arrive, and removed
when the transaction is complete. For a any given cache block, there is one active transaction at any given time. In other words, the DIFTwill not process another request to
the same cache block until the current transaction is complete.

12.3 Overview of the Cache Coherency Protocols
12.3.1 Comparison Between 21363 and 21464 Cache Coherence Protocols
Table 12-1 summarizes the differences between the 21363 and 21464 cache coherency
protocols.
Table 12-1 Comparison Between 21364 and 21464 Cache Coherence Protocols
Assumptions

21364

21464

Have multiple outstanding system requests to the same Scache index.

Yes

A dirty block can't be sent in response to a request from an 1/0 device. Yes
This requires a snarf (i.e. VictimAckExcl) when a modified cache block
is to be sent to an 1/0 device.

Yes

Cbox may send a VictimToShr or a VictimClnToShr to its local memory.

Yes 1

A successful ReadMod, ShrTodirty[STC], or InvToDirty never results
Yes
in a VictimCln or a VictimClnToShr. This means that the cache block is
considered dirty even if the block is not modified.

No2

The current owner sends a BlkExclCnt(O) to the requester and a VictimAckExcl to the home in response to a ReadFwd when the block is
dirty. Otherwise, it must send a Shared block to the requester and a
FwdAckShr to the home.

No 3

Yes

The directory doesn't have the requester for a ShrToDirtyReq as a sharer ShrToDirtyFail--> BlkExclCnt ->
requester.
requester.
and no one including its own Cbox owns the block exclusive.
No

Yes

To simplify the DIFT design, Cbox can't respond with a FwdAckExcl in Yes
response to a forwarded request that originated from an 1/0 device.

No4

Cbox never responds FwdAckExcl to any ReadForward 3.

Yes

Send a BlkExclCnt to the requester and a a VictimAckExcl to the direc- Yes
tory in response to a ReadReq or ReadFwd if the block is modified.

No 3

Local memory references update the directory state5.

Yes

ShrToDirtyReqs get forwarded as ReadModFwds.

In order to support the CleanCacheBlk instruction, the processor wants to give up the ownership voluntarily but keeps a Shared copy, Cbox must be able to do a VictimToShr or a VictimClnToShr.
Scache will send VictimToShr or VictimCleanToShr if the block addressed by CCB is in dirty or
exclusive clean state, respectively (it will send nothing if there is a miss or the block is in shared state).

Compaq Confidential
5 January 2001 - Subject To Change

Cache Coherence Protocol Processing

12-5

Overview of the Cache Coherency Protocols

For the 21464, a BlkExcl is filled as BlkDirty if the MAF state indicates a merge buffer write which is
non-speculative and the block is coherent. Otherwise, it is filled as ExclClean block and Mbox is not
allowed to write the cache block. Mbox will send CtoD to Cbox to gain write permission.

The 21464 is considering a selective migration option where an exclusive block can be passed in
response to a ReadFwd. But it is unlikely that the 21464 will adopt the 21364 migration scheme. The
current default is to send BlkShared to the requester and a VictimAckShared to the directory.

This restriction is imposed on the 21364 to avoid read-modify-write of the directory for a VictimClean
from a new owner (if FwdAckExcl comes after the VictimClean, then the DIFT entry must do Read Write - Read - Write which makes the DIFT state machine complicated). Currently, the 21464 is planning to have one-to-one correspondence between the DIFT and the fill buffer. Hence, the VictimClean
case above results in Read - Write - Write which is identical to a Victim.

For the 21364, local references do not update the directory to minimize the memory access for directory updates. Instead the cache coherence is maintained by forcing all remote requests to probe the
local caches. EV8 has an on-chip 256KB directory cache to store the directory states of the most frequently used cache blocks. The directory cache significantly cut the memory access for both local and
remote requests. Thanks to the directory cache, the 21464 local references update the directory thereby
eliminating local probes.

12.3.2 Onchip Directory Cache
The 21464 has an onchip directory cache:
•

Unlike the 21364, local references DO change the directory states thereby eliminating needs for local Cbox probes.

•

No local victim and local request race.

•

The DIFTis responsible for keeping order between requests, both local and remote
requests. So local *Req, *Req, and local victim* do not come to the probe queue.

12.3.3 Coherence Messages are Split into Three Types
•

Requests (*Req)
All requests come to the home node and the home node is responsible for either
responding to the requests or forwarding the request to the current owner including
its local Cbox.

•

Forwards (*Fwd) from home nodes
Cbox receives *Forward both from remote directories and the local directory.
Cbox may not receive another Forward for the same cache block until Cbox
sends a response to the previous Forward to a cache block because the directory
does not process another request to the same cache block until it receives a
response from the previous owner for the *Forward.
For *Forward, Cbox forwards the cache block to the requester and sends a
response to the directory. However for the following cases, Cbox sends a FwdMiss to the directory:
Cache block is not coherent
Cbox did not receive the Blk* (i.e. The exclusive block is on its way to
us).
Cbox did not receive all the InvalAcks from sharers.

12-6

Compaq Confidential
Cache Coherence Protocol Processing
5 Jtmuc1ry 2001 - Subject To Change

Protocol Races

For theses cases, Cbox victimize the cache block when the cache block
becomes coherent.
The Scache doesn't have the block (i.e. the victim block is on its way to the
directory).

•

Responses

12.4 Protocol Races
•

Victim races
Remote Victim race
Cbox does not have to maintain the ordering between a request to a remote
memory and a remote victim. Remote victim races are handled by the directory.
If the directory receives the *Req before the Victim*, since the *Req is from
the current owner, the DIFT knows that the Victim* is on its way and waits for
the victim before servicing the *Req.
Local victim race
Since local requests update the directory, the directory knows the current
owner, including the local processor, of a cache block. Cbox no longer has to
maintain the ordering between local requests and local victims.
Dual Victim race
A Victim* from the new owner arrives before the forward acknowlegement
arrives from the previous owner in response to the *Forward. The DIFT is
responsible for resolving the race.

•

Early Forward race
A forward arrives while there is an outstanding system request or not all InvalAck's
have been received. The early forward race can happen due to either:
The exclusive block is in its way to us and the *Forward is to the the yet-to-be
received block.
We victimized the block. The directory forwarded a request to us. Before we
received the *Forward we send the system request asking for the cache block.
Send a ForwardMiss to the home memory and victimize the cache block when the
cache block arrives or becomes coherent.

•

Late Forward race
A *Forward arrives after we victimized the cache block. We simply send a ForwardMiss to the home memory for this case.

•

Early InvalAck race
An Invalack arrives before the *Cnt arrives. Like the 21364, EV8 has the coherence
counter which can keep count of early InvalAck's.

•

Early Sharedlnval race
A Sharedlnval arrives while we have the outstanding Read[Shr]Req. The early
Sharedlnval race happens due to one of the following:

5 January 2001 ·- Subject To Change

Compaq Confidentia I
Cache Coherence Protocol Processing

12-7

Probe Processing

The Shared block is in its way to us and the Sharedlnval is to the returning
block.
The Sharedlnval is to the previous copy of the block we already displaced from
the Scache.
The Sharedlnval is to the previous copy of the block which has been invalidated before. When the sharing mask is used, we can receive the Sharedlnval if
another processor which shares the same sharing mask bit is invalidated even
though we do not have the block.
When the Shared block arrives, we must not blindly discard the block to avoid
potential live-lock problem.
If the fill block is after the MB retire, we must discard the block and re-send the

system request.
If the fill block is before the MB retire, we fill Mbox with the block and invali-

date the block.
•

Wrong SharedToDirtySuccess race
A SharedToDirtySuccess finds no shared copy in the Scache. The wrong SharedToDirtySuccess race happens due to either:
The shared copy has been invalidated. But another processor which shares the
sharing mask sends a Read[Shr]Req thereby setting the sharing mask bit.
Hence when the directory receives a ShrToDirtyReq and the sharing mask is
used, the directory doe not know whether the requester is a true sharer or not.
The directory optimistically succeeds the SharedToDirtyReq and the the
requester is responsible for resolving the problem by doing VictimClean first
them sending a ReadMod.
The shared copy has been displaced out of the Scache. We bypass the ShrToDirtyResp directly to the victim buffer and extract the merge buffer into the
victim buffer thereby making forward progress.
We can distinguish two cases by recording Sharedlnval in the MAF. In other words,
if we receive a ShrToDirtySuccess and the MAF.inval_seen bit is set, then we know
the ShrToDirtyReq was incorrectly granted.
To avoid live-lock, the directory can't not optimistically succeed a SharedToDirtySTCReq when the sharing mask is used. Instead the directory sends a ShrToDirty Prob to the requester but do not send Sharedlnval to sharers. If the
SharedToDirtyProb succeeds, the requester sends a ShrToDirtyComplete to the
directory and the directory sends Sharedlnval to sharers. If the SharedToDirtyProb
fails, then the requester send a SharedToDirtyRelease to the directory.

12.5 Probe Processing
•

Probe Pipeline Stages

•

Unlike the 21364 protocol, the directory sends a Sharedlnval to its local Cbox since
remote requests (*Req) no longer come to Cbox.

•

Responses without a fill block from the PRQ :
Changes the states of corresponding MAF entries.
Updates the Scache tag.

12-8

Compaq Confidential
Cache Coherence Protocol Processing
5 Jmmary 2001 ·-Subject To Change

Probe Processing

Unlike a Blk*, it can be replayed if it encounters a Scache tag ECC error or a
Scache bank conflict.
•

A ShrToDirtySuccessCnt does not find the shared cache block in the scache.
The previous proposal was to do a VictimClean and send a ReadModReq if the
request was not StxC. If the request was for a StxC, then we would send a StxFail message to Mbox. However, this creates a potential live-lock.
The current proposal
If the MAF.inval_seen bit is set:

Do VictimClean and send a ReadModReq if the request was not StxC.
If the MAF.inval_seen bit is not set:

This means the shared block has been displaced from the Scache.
Send the StxCSuccess to Mbox.
Extract the merge buffer entry to the VDB.
Send Victim to the home node.
Cbox sends a ReadShrReq if it has a pending Ifetch to the cache block.
•

A ShrToDirtyProbCnt does not find the Shared block in the Scache.
If the MAF.inval_seen bit is set:

This means the shared block was invalidated.
Send a ShrToDirtyRelease to the home node.
If we have pending non StxC request, send a system request.
If the MAF.inval_seen bit is not set:

This means the shared block has been displaced from the Scache.
Send the StxCSuccess to Mbox.
Extract the merge buffer entry to the VDB.
Send ShrToDirtyComplete and Victim to the home node.
Cbox sends a ReadShrReq if it has a pending Ifetch to the cache block.
•

The only case where Cbox send a StxCFail to Mbox is when Cbox gets a CtoDSTC
from Mbox and we do not have the block in the Scache.

•

Cbox does not send STCFail message to Mbox when Cbox receives a ShrToDirtyFail from the directory since a Sharedlnval must have failed the StxC.

•

Forwards from remote directories need the Scache pipe:
To update the Scache tag.
To send a response to the requester and the directory.

•

To avoid a dead-lock, we reserve four VAF and four MAF slots for probes.

Compaq Confidential
5 January 2001 ··· Subject To Change

Cache Coherence Protocol Processing

12-9

Coherence State

12.6 Coherence State
Cbox implements a coherence protocol which ensures that all processors have a consistent image of memory.
Table 12-2 MAF Coherence State Bits
Name

Meaning

Victim to Home

Set By

MAP.coherent

The exclusive cache block is coherent.

MAF.coh_cnt<5 :0>

The number of coherence messages
received.

MAF. timer_on 1

Has a timer queue entry.

MAF. sc_inflight

Has an inflight transaction in the Scache
pipe.

Forward, ShrToDirty*Cnt, Sharedlnval.

MAF. victimize

After the ownership is received and the
Victim or Victimblock becomes coherent, the cache block Cln
must be victimized (i.e. invalidate the
cache block and send a Victim or a VictimCln to the directory). If a BlkShared is
received, Cbox may keep the Shared
copy.

Remote Request
(MemFwdMiss) or
Forward (ForwadMiss)

MAF. vct2shr

After the ownership is received and the
block becomes coherent, the cache block
must be changed to Shared. Cbox should
send VictimToShr or VictimClnToShr to
the remote home directory.

VictimToShr or
VictimClnToShr

Forward (ForwadMiss)

MAF.inval_seen 2•3

Sharedlnval has been received before the
fill block.

None

Sharedlnval

InvalAck, *Cnt.

The timer is set due to either: 1) An exclusive fill block is received, the block is coherent, and either
MAP.victimize or MAF.vic2shr bit is set; or, 2) A shared fill block is received and the MAF.inval_seen
bit is set.

This bit is set if we receive a Sharedfuval and have an outstanding system request. If a BlkShared is
returned and the MAF.inval_seen is set, then we do not know whether the block is before the Sharedlnval or after the Sharedlnval. To be conservative, the block has to be invalidated.
If the MAF.mb_retired is set, discard the fill block.
If the MAF.mb_retired bit is no set, fill the Mbox with the block and when the timer expires,
invalidate the block.
If BlkExclCnt or BlkDirty returns, then we know the block is after the Sharedlnval and we fill the
Scache with the exclusive block.

If we receive SharedToDirtySuccess or ShwedToDirtyProb and the MAF.inval_seen is set, then the the
SharedToDirty has been incorrectly granted.

Notes:

•

Can more than one bit set for a MAF entry?
MAF. victimize and MAF. vct2shr
The MAF. victimize must override the MAF. vct2shr.
Any other case???
Compaq Confidential

12-10 Cache Coherence Protocol Processing

5 Jtmwiry 2001 ···Subject To Change

MAF Address CAM

12.7 MAF Address CAM
•

MAFmiss
Load the probe into the MAF to store the probe while the probe is in-flight in the
Scache pipe.
For a possible probe retry due to bank conflict, Scache tag ECC error, or
Scache Data ECC error.
To maintain probe sequence to the same cache block.

•

MAFhit
If the probe hits a MAF entry which has a transaction in-flight in the Scache (i.e.

MAF.sc_inflight is set), then the PRQ stalls until the conflicting transaction completes and clears the MAF.sc_inflight.
Otherwise, the probe changes MAF states and probes the Scache if necessary.
Table 12-3 Forwards hit MAF (Full Address Match)
MAF states

.c
en
iS
c

.2>

·:1 ·:1>
"i

Forwards (*Fwd)
*Fwd

FetchFwd
ReadFwd
Read.ShrFwd

ReadModFwd
InvToDirtyFwd

Sharedlnval SharedInvaILeaf
SharedlnvalMaster

'Ii

:ii

c
f

Cl)

.c
0

Need
Scache
action

(,)

u:
<
:ii

MAF.sys_cmd<2:0>

Response

Action
Stall the Probe pipe (PRQ)

x xxx
x xxx

1 xxx

[2]

xxx

FwdMiss

Set MAF.vct2shr [3]

x Read

FwdMiss

Set MAF.vct2shr

FwdMiss

x ReadShr
x ReadMod/I2D/S2D/S2DSTC

0
0

[1]

Yes

FwdMiss

Set MAF.vct2shr

xxx

FwdMiss

Set MAP.victimize [3]

x Read

FwdMiss

Set MAP.victimize [4]

FwdMiss

x ReadShr
x ReadMod/I2D/S2D/S2DSTC

x xxx

FwdMiss

xxx

No
Set MAF.victimize[4]

Must not happen

InvalAck
[5][6]

Set MAF.inval_seen [7]

Yes

Notes:

•

We send a FwdMiss for a *Fwd to a non-coherent cache block.

5 January 2001 -··Subject To Change

Compaq Confidentia I
Cache Coherence Protocol Processing 12-11

MAF Address CAM

•

[1]: This case happens because we clear the MAF.sc_inflight late in order to give
the Mbox enough time to consume the fill block before the block gets victimized or
invalidated. The current design guarantees 15 cycles between the fill address and a
probe to the fill block. Since we delay clearing the MAF.sc_inflight bit, we send a
system request before clearing the sc_inflight bit. We believe it takes more than 8
cycles to receive a fill block from the system after sending a system request. Hence,
we must not receive a fill block from the system while the MAF.sc_inflight bit is
set.

•

[2]: This case happens because either:
Cbox received a exclusive block and victimized the block. Then Cbox wanted
the cache block back but hasn't sent a system request yet OR
Cbox received the ownership of the cache block but the cache block is not
coherent.

•
•

[3]: Cbox received the ownership of the block but the cache block is not coherent.
[4]: Cbox does not know whether the *Fwd is for an earlier version of the block we
victimized or for new block which is on its way to us. To be conservative Cbox victimizes the block.

•

[5]: Cbox sends a InvalAckMaster and a InvalAckLeaf in response to a SharedlnvalMaster and a SharedinvalLeaf respectively.

•

[6]: We must send the InvalAck right away. If the Sharedlnval was before the current request and we do not send the InvalAck, then the owner of the cache block
will wait for the InvalAck from us indefinitely and we will not receive the fill
block.

•

[7]: If a BlkShared is returned in response to the outstanding Read[Shr]Req, then
we do not know whether the block is before the Sharedlnval or after the Sharedlnval; hence, block has to be invalidated. If BlkExclCnt or BlkDirty returns, then we
know the block is after the Sharedlnval and it is safe to fill the Scache with the
block.

Table 12-4 Response Hit MAF (MAF Index)

...

...
;;
I
ca> ...I
.c
G>
G>

u;
LL

Responses

<C
:i

...

<C
:i

c
~I

<C
:i

·s:

(I)

;,,;

:;1
·s:LL

u.:

G>
N

·e

._I

;;

::J

(,)

.cl

u.:

...G>

E Action

Blklnval

Must not happen

BlkIO

x x x x Send the IO block to Mbox.
0 x x x Fill the Scache as Shared.

BlkShared

Compaq Confidential
12-12 Cache Coherence Protocol Processing

5 Jc1nuc1ry 2001 ·- Subject To Change

MAF Address CAM

Table 12-4 Response Hit MAF (MAF Index) (Continued)

...

0,
...u0 ,

·:;:
cG)

.,,

u.:

:a:G)

...il:

._I

;::

:::J
.tl
I

u.:

...en

:a:

E Action

...
;::
_,
·e :a::
...
0
as
G)
(I)

...as .!:> .a'E
G)

ti)

Responses

BlkExclCnt

:a:

u
·;:

u.:

:a:

x Discard the fill block and send a system request.

0 Fill the Scache as ExclCln. Update the coherence

Fill the Scache/Mbox and set the timer. When the timer
expires, invalidate the block.

counter.

x x

1 Fill the Sc ache as Dirty if the block is coherent. If the
block is not coherent, fill the Scache as exclusive.
Update the coherence counter.

BlkDirty
InvToDirtyRespCnt 1

x
x
x

Fill the Scache as ExclCln. Update the coherence
counter. Set the timer if coherent.

1 Fill the Scache as Dirty. Update the coherence counter.
Set the timer if coherent.

0
1

x Fill the Scache as Dirty.
x Fill the Scache as Dirty. Set the timer.
x Fill the Scache as dirty if coherent and as exclusive if not
coherent.

x Fill the Scache as dirty if coherent and as exclusive if not
coherent. Set the timer and when the timer expires vietirnize the block.

ShrToDirtySuccessCnt
ShrToDirtyProbCnt

x
x
x

x
x

x Incorrectly granted ShrToDirtyReq. Do VictimClean.

x
ShrToDirtyFail

Enter the Scache pipe to update the Scache tag. Set the
coherence timer. Update the coherence counter.

0 Must not happen.

x x Cbox does not need to send ShrToDirtyFail to
Mbox but Cbox must send a Read*Req if a nonStxC request has been merged.

Inva1Ack2

WrioAck/WrIONack

x
x

NXMResp/ERRResp

x
x
x

x Update the coherence counter.
x

Notify the Mbox.

1 If we have a pending Ifecth, put it in the retry queue.

5 January 2001 ···Subject To Change

Compaq Confidentia I
Cache Coherence Protocol Processing 12-13

Scache Hit

The InvalAck makes the block coherent. If the MAF. victimize or MAF. vic_to_shr is set, enter the
Scache pipe and victimize the block. If the MAF entry is for a merge buffer write, then

change the Scache tag to dirty and send it to Mbox.
Notes:

•

Cbox does not receive a ShrToDirtyFail in response to a ShrToDirtyReq. Instead
the ShrToDirtyReq gets forwarded and Cbox must receive:
ShrToDirtySuccessCnt if the Scache has a shared copy.
ShrToDirtySuccessCnt, BlkExclCnt, or BlkDirty if the Scache does not have a
shared copy.

12.8 Scache Hit
Note that the 1, 3, 6, and 13 "footnotes" to the tables in the web spec are not called out
in the tables. Is this okay? They are:
•
•
•
•

Cbox sends ShrToDirtyReq if Mbox needs the ownership of the cache block.
We may send BlkDirty to the requester and transition to Invalid. Unlike the 21364,
the current proposal is to have extra state (migratory bit) for cache blocks.
The directory, not the Scache, sends this.
If a StxC request from Mbox finds no shared copy in the Scache, fail the StxC and
send a STCFail message to Mbox.

Tables 12-5, 12-6, and 12-7 show.....
Table 12-5 Miss Requests from Mbox

Scache State
Commands
!Fetch FetchLine
PfetchLine

PfetchLineMod

FetchLineMod

CtoD ltoD

CtoDSTC

EvictBlk

ExclClean
Merge Buffer
Does Not Have
Block

Invalid
SC State

-->

l/Mbox

Home

Read[Shr]Req

SC State

-->

Invalid

l/Mbox

Home

ReadModReq

SC State

-->

Invalid

l/Mbox

Home

ReadModReq

SC State

-->

Invalid

l/Mbox

Home

InvToDirtyReq

SC State

-->

Invalid

Dirty

StxCFail

FillBlkDirty

l/Mbox

Home

SC State

-->

Invalid

ExclClean
Merge Buffer
has Block
Modified

ExelClean
FillBlkExcIClean

ExelClean
FillBlkExclClean

Dirty

Shared

Dirty

Shared

FillBlkDirty

FillBlkShared

Dirty

Shared

FillBlkDirty

FillBlkShared

Shared

Dirty
FillBlkDirty

FillBlkShared
ShrToDirtyReq

Dirty

Shared

FillBlkDirty

FillBlkShared
ShrToDirtyReq
Shared
FillBlkShared
ShrToDirtySTCReq

Invalid3

Invalid

Compaq Confidential
12-14 Cache Coherence Protocol Processing

5 January 2001 -~Subject To Cfumge

Scache Hit

Table 12-5 Miss Requests from Mbox (Continued)

Scache State
Commands

Invalid
Home

CleanBlk

Victimize
MAE victimize*

VictimToShr
MAF.vct2shr*

Invalidate
MAF.inval_seen*

ExclClean
Merge Buffer
Does Not Have
Block

SC State

-->

Home

SC State

-->

Home

<--

SC State

-->

Home

<--

SC State

-->

Home

ExclClean
Merge Buffer
has Block
Modified

VictimCln

Shared

Dirty

Victim

Invalid

Shared
VictimClnToShr

VictimToShr

Invalid3

Invalid 1

Invalid

JVictim

VictimCln
Invalid3

Shared2
VictimClnToShr

Invalid

Shared

VictimToShr

Must not happen.

Invalid

Must not happen

This case happens because the MAP.victimize bit was set when a ReadModFwd or a InvToDirtyFwd
hit the MAF entry which had outstanding ReadReq. If the BlkShared was returned, the cache block is
after the ReadModFwd or the InvToDirtyFwd. Hence Cbox can keep the shared copy. However, it is
also possible that this case happens because a Chg2Shared from Mbox hit the Exclusive block. Then
we must invalidate the cache block. To be conservative, we invalidate the Shared copy.
2

If the merge buffer has modified data, the merge buffer writes both to the VDB and to the Scache:
- Cbox sends cache block to the merge buffer.
- The cache block gets merged with the merge buffer data.
- The merge buffer writes to the Scache and to the VDB.

The block has already been victimized by a fill.

Table 12-6 lists the forwards from the remote directory.
Table 12-6 Forwards From (Remote) Directory

Scache State
Commands
FetchFwd

ReadShrFwd
ReadFwd

ReadModFwd
Requester is a processor

Invalid
SC State

Invalid

Home

FwdMiss

Requester

SC State

Invalid

Home

FwdMiss

Requester

SC State

Invalid

Home

FwdMiss

Requester

ExclClean
Merge Buffer
Does Not Have
Block

FwdAckShr

ExclClean
Merge Buffer
has Block
Modified

Dirty

Shared

Shared1

Shared

FwdMiss

VictimAckShr

Blklnval

FwdAckShr

Shared1

Shared

FwdMiss

VictimAckShr

BlkShared

FwdAckExcl
BlkExclCnt(O)

FwdMiss

BlkDirty

Compaq Confidential
5 Janw1ry 2001 ···Subject To Change

Cache Coherence Protocol Processing 12-15

Scache Hit

Table 12-6 Forwards From (Remote) Directory (Continued)

Scache State
Commands
ReadModFwd
Requester is a I/O device

InvalToDirtyFwd
Requester is processor

InvalToDirtyFwd
Requester is I/O device

Sharedlnval

ExclClean
Merge Buffer
Does Not Have
Block

Invalid

ExclClean
Merge Buffer
has Block
Modified

SC State

Home

Requester

BlkExclCnt(O)

SC State

Invalid

Home

<--

Dirty

Shared

Invalid
FwdMiss

VictimAckExcl

FwdMiss

FwdAckExcl

FwdMiss

Requester

Inva1ToDirtyRespCnt2

SC State

Invalid

Home

FwdMiss

FwdAckExcl

VictimAckExcl

FwdMiss

Requester

SC State

Invalid

Must not happen

Invalid

Requester

InvalAck

Must not happen

InvalAck

InvalToDirty RespCnt

If the merge buffer has modified data, the merge buffer writes both to the VDB and to the Scache:

Cbox sends cache block to the merge buffer.
The cache block gets merged with the merge buffer data.
The merge buffer writes to the Scache and to the VDB.
2

The Scache, not the directory, sends this.

Table 12-7 lists the responses (fills) from the system.
Table 12-7 Responses (Fills) from System

Scache State
Commands

ExclClean
Merge Buffer
Does Not Have
Block

Invalid

ExclClean
Merge buffer
has Block
Modified

Dirty

Blklnval

Must not happen (21464 processors do not send Fetch request.)

BlkIO

Mbox <- DataIO

BlkShared

BlkExclCnt

BlkDirty

InvalToDirtyRespCnt

ShrToDirtySuccessCnt

SC State

Shared

Must not happen

l/Mbox

FillBlkShared

Must not happen

SC State

ExclClean

Must not happen

l/Mbox

FillBlkExclCln

Must not happen

SC State

Dirty

Must not happen

l/Mbox

DataDirty

Must not happen

SC State

Invalid

Must not happen

l/Mbox

Victimize

Must not happen

SC State

Invalid

l/Mbox

STCFail

Shared

Must not happen

ExelClean

Must not happen

DataExclCln

Compaq Confidential
12-16 Cache Coherence Protocol Processing

5 Jc1nwiry 2001 ···Subject To Change

VAF Address CAM

Table 12-7 Responses (Fills) from System (Continued)

Scache State
Commands

ExclClean
Merge Buffer
Does Not Have
Block

Invalid

ShrToDirtyProbCnt

ShrToDirtyFail [14]

ExclClean
Merge buffer
has Block
Modified

Dirty

Shared

Home

VictimCin,
ReadModReq2

Must not happen

SC State

Invalid

Must not happen

I/Mbox

STCFail

Must not happen

DataExclCin

Home

ShrToDirtyReIease3

Must not happen

ShrToDirtyComplete

SC State

Invalid

l/Mbox

Home

ExelClean

Must not happen
Must not happen

Read*Req

Must not happen

If the response was for a StxC request. If a non-StxC and a StxC request gets merged at the MAF,
Cbox send a ShrToDirtyReq. So when ShrToDirty*Cnt or ShrToDirtyFail returns, Cbox has to look at
the MAF.miss_stxc bit, not the request that was sent, to determine whether the response is for a StxC.

Cbox does VictimCln when all the InvalAcks are received and then sends ReadModReq if non-StxC
request.

Cbox needs to send Read.ModReq if a non-StxC request.

A ShrToDirtyFail does not need Scache action. Cbox sends a Read*Req if a non-StxC request has
been merged.

Notes:

•

According to the current proposal, InvalToDirtyReqs get forwarded even if they are
originated from processor as well as 1/0 devices. Hence Sharedlnval must not happen if we have the Exclusive block.

•

Since there can be only one outstanding request for a cache block:
Must not receive ShareToDirty*Cnt if the Scache has the block Exel Clean or
Dirty.
Must not receive Blk* if the Scache has a copy of the block, either Shared or
Exclusive.

•

FwdMiss may happen due to:
The cache block has been victimized.
The cache block is not coherent.
The cache block hasn't been received.

12.9 VAF Address CAM
•

If there is a VAF hit, then the victim entry gets a high priority to expedite the pro-

cessing (i.e The DIFT needs the Victim before servicing the request).
•

No local victim and local request race for the new cache coherence protocol with no
LPRs.

Compaq Confidential
5 January 2001 -·Subject To Change

Cache Coherence Protocol Processing 12-17

Directory Responses

Table 12-8 VAF Hit
Victims to home directory
Probes

Victim*

Response to remote node
*Ack*

lnvToDirtyRespCnt, Blk* linvalAck

Miss

Set the high priority bit for the VAF entry

Victimize*

Set the high priority bit for the VAF entry

Forwards (*Fwd)

Set the high priority bit for the VAF entry.

Must not happen

Sharedlnval

Must not happen

•

Victim*
Victim, VictimToShr, VictimCln, VictimClnToShr.
Cbox is in the process of giving up ownership.
Victim-Request race gets resolved at the DIFT.

•

*Ack* (VictimAckShr, VictimAckExcl, FwdAckShr, FwdAckExcl):
VictimAckShr, VictimAckExcl, FwdAckShr, FwdAckExcl
Cbox is in the process of giving up the ownership in response to A *Fwd.
The DIFf has an entry waiting for the *Ack*. Subsequent requests to the block
will not be serviced until the *Ack* is received by the DIFT, thereby maintaining the order.

•

A VAF entry can generate two messages to the system.
A ForwardMiss and a Victim* to the home node ( if *Fwd hit a Victim* ).
An *Ack* to the home node and Blk* to the requester.
A Victim* and a ShrToDirtyComplete.

12.10 Directory Responses
Table 12-9 show the directory state request responses.
Table 12-9 Directory State Request Responses
Directory State
Request

Shared2 (S1 ,S2)

SharedM (M)

Local (lnMemory)

RemoteExcl (0)

Shared1 (S1 )

FetchReq

Dir: ->Local
Requester<- Blkfuval

Dir:-> Shrl(O)
Owner <- Fet.chFwd

Dir:-> Shrl(Sl)
Dir:-> Shr2(Sl,S2)
Requester <- Blklnval Requester<- Blkfuval

Dir: -> ShrM(M)
Requester <- Blklnval

ReadShrReq

Dir: ->Shrl(R)
Requester<- BlkShared

Dir: -> Shr2(0,R)
Owner <- ReadShrFwd

Dir: ->Shr2(Sl,R)
Requester<- BlkShared

Dir:->
ShrM(Sl,S2,R)
Requester<- BlkShared

Dir: -> ShrM(M,R)
Requester <- BlkShared

ReadReq

Dir: ->Shrl(R)
Requester<- BlkExclCnt(O)

Dir:-> Shr2(0,R) 1
Owner <- ReadFwd

Dir:-> Shr2(Sl,R)
Requester<- BlkShared

Dir:->
ShrM(S 1,S2,R)
Requester<- BlkShared

Dir: -> ShrM(M,R)
Requester <- BlkShared

ReadModReq

Dir:-> RemoteExcl(R)
Requester<- BlkExclCnt(O)

Dir: ->RemoteExcl(R)
Owner<- ReadModFwd

Dir: ->RemoteExcl(R)
Requester<- BlkExclCnt(l)
Sl <- Sharedfuval

Dir: -> RemoteExcl(R)
Requester<- BlkExc1Cnt(2)
Sl,S2 <- Sharedlnval

Dir: ->RemoteExcl(R)
Requester <- BlkExclCnt(M)
M <- SharedlnvaIBcast

Compaq Confidential
12-18 Cache Coherence Protocol Processing

5 J~1muiry 2001 -~Subject To Change

Directory Responses

Table 12-9 Directory State Request Responses (Continued)
Directory State
Request

Local (lnMemory)

RemoteExcl (0)

Shared1 (S1)

Shared2 (S1,S2)

SharedM (M)

InvToDirtyReq

Dir:-> RemoteExcl(R)
Requester <- InvToDirtyRespCnt(O)

Dir: ->RemoteExcl(R)
Owner <- InvToDirtyFwd

Dir: ->RemoteExcl(R)
Requester <- InvToDirtyRespCnt(l)
Sl <- SharedDirty

Dir: ->RemoteExcl(R)
Requester <- InvToDirtyRespCnt(2)
Sl,S2 <- SharedDirty

Dir: ->RemoteExcl(R)
Requester <- InvToDirtyRespCnt(M)
M <- ShrDirtyBcast

ShrToDirtyReq
Requester is a sharer
(R= Sl)

Dir:-> RemoteExcl(R)
Requester<- BlkExc1Cnt(0)2

Dir:-> RemoteExcl(R)
Owner <- ReadModFwd

Dir: ->RemoteExcl(R)
Requester <S2DSuccCnt(O)

Dir:-> RemoteExcl(R)
Requester<S2DSuccCnt(l)
S2 <- Sharedinval

Dir:-> RemoteExcl(R)
Requester <S2DSuccCnt(M)3 M
<- ShrinvalBcast

ShrToDirtyReq
Requester is not a
sharer

Dir:-> RemoteExcl(R)
Request~r <- BlkExclCnt(O)

Dir: ->RemoteExcl(R)
Owner <- ReadModFwd

Dir: ->RemoteExcl(R)
Request'41" <- BlkExclCnt(l)
Sl <- Sharedinval

Dir: -> RemoteExcl(R)
Requester<- BlkExclCnt(2)
Sl,S2 <- Sharedinval

Dir: ->RemoteExcl(R)
Requester <- BlkExclCnt(M)
M <- ShrinvalBcast

ShrToDirtySTCReq
Requester is a sharer
(R= Sl)

Dir:-> Local
Requester <Shr2Dirty Fail

Dir: ->RemoteExcl(O)
Requester<- ShrToDirtyFail

Dir: ->RemoteExcl(R)
Requester <Shr2DirtySuccCnt(l)

Dir: -> RemoteExcl(R)
Requester<Shr2DirtySuccCnt(2)
S2 <- Sharedinval

Dir: -> ShrM(M)
Requester <- ShrToDirty ProbCnt(M)

ShrToDirtySTCReq
Requester is not a
sharer

Dir: -> Local
Requester <Shr2DirtyFail

Dir: ->RemoteExcl(O)
Requester<- ShrToDirtyFail

Dir: ->Sharedl(Sl)
Requester <Shr2DirtyFail

Dir:->
Shared2(Sl,S2)
Requester<Shr2DirtyFail

Dir: -> ShrM(M)
Requester <Shr2DirtyFail

The directory state transitions to Shr2(0, R) if Victim[/Fwd]AckShared is received and transition to
RemoteExcl(O) if Victim[/Fwd]AckExcl is received.
2
3

The 21364 fails the ShrToDirtyReq since the request is not a sharer.
The directory may incorrectly succeed the ShrToDirtyReq when the sharing mask is used. The request
is responsible for the recovery (i.e. do VictimClean and re-send the request).
No corresponding footnote text

Notes:

•

EV8 processors are not allowed to send a FetchReq. Only I/O processors may send
a FetchReq.

•

This table is based on the new EV8 cache coherence protocol where local requests
update the directory. Hence local request and remote requesters are treated exactly
the same way.

•

If the requester is the exclusive owner, this indicates that the victim block from the
requester is on its way to the directory. The DIFT send a response after it receives

the victim.

•

For *Req if the directory state is exclusive, the DIFT forward the request and speculatively write the directory. If the DIFT receives a ForwardAck*, then the directory does not have to write the directory again. If the DIFT receives a VictimAck*,
then the directory has to write the whole cache block. However if the new owner
does VictimClean* and the VictimClean* arrives before the ForwardAck*, the
DIFT entry must do read-modify-write of the directory. To avoid this scenario, the
21364 Cbox is not allowed to respond with a ForwardAckExcl in response to a
ReadFwd. The 21364 Cbox is also not allowed to send a VictimClean* for a cache
block obtained through a ReadModReq. EV8 has the same number of the fill buffer
entries as for the DIFT thereby eliminating speculative directory writes.

•

The 21464 does not support the "Shared3" state .
Compaq Confidential

5 January 2001 ·-Subject To Change

Cache Coherence Protocol Processing 12-19

System Command Opcodes

•

If the directory state is "Incoherent", then the directory sends "ERRResp" to the

requester.
•

The 21464 forwards ShrToDirtyReq if the directory is certain that the requester
does not have a shared copy.

12.11 System Command Opcodes
Table 12-10 lists the system command opcodes.
Table 12-10 System Command Opcodes
Type

Command

Name

Opcode

RdIOQWS

CR_OP_IO_RD_QWS

Ox40

RdIOLWS

CR_OP_IO_RD_LWS

Ox41

RdIOBytes

CR_OP_IO_RD_BYTES

Ox43

RdIOIPR

CR_OP_IO_RD_IPR

Ox44

WrIOQWS

CR_OP_10_WR_QWS

Ox50

WrIOLWS

CR_OP_10_WR_LWS

Ox51

WrlOBytes

CR_OP_10_WR_BYTES

Ox53

WrlOLPR

CR_OP_10_WR_LPR

Ox54

ReadReq

CR_OP_REQ_RD

Ox60

ReadShrReq

CR_OP_REQ_RD_SHR

Ox61

FetchReq

CR_OP_REQ_FETCH

Ox62

ReadModReq

CR_OP_REQ_RD_MOD

Ox64

InvToDirtyReq

CR_OP_REQ_INVAL_TO_DRTY

Ox65

CR_OP_REQ_SHR_TO_DRTY

Ox66

ShrToDirtySTCReq

CR_OP_REQ_SHR_TO_DRTY_STC

Ox67

ReadFwd

CR_OP_FWD_RD

Ox80

ReadShrFwd

CR_OP_FWD_RD_SHR

Ox81

FetchFwd

CR_OP_FWD_FETCH

Ox82

ReadModFwd

CR_OP_FWD_RD_MOD

Ox84

InvToDirtyFwd

CR_OP_FWD_INVAL_TO_DRTY

Ox85

SharedlnvalSingle

CR_OP_FWD_SHR_INVAL_SINGLE

Ox86

SharedlnvalMask

CR_OP_FWD_SHR_INVAL_MASK

Ox87

BlkDirty

CR_OP_RSP_BLK_DRTY

OxCO

BlkShared

CR_OP_RSP_BLK_SHR

OxCl

Blklnval

CR_OP_RSP_BLK_INVAL

OxC2

BlkExclCnt

CR_OP_RSP_BLK_EXCL_CNT

OxC4

BlkIO

CR_OP_RSP_BLK_IO

OxC5

Victim

CR_OP_RSP_VIC

OxD8

VictimToShared

CR_OP_RSP_VIC_TO_SHR

OxD9

Request

ShrToDirtyReq

Forwards

Response with Block

Victim response

Compaq Confidential
12-20 Cache Coherence Protocol Processing

5 Januc1ry 2001 ·-Subject To Change

Protocol Message Descriptions

Table 12-10 System Command Opcodes
Type

Responses without Block

Release response

Special

Command

Name

Opcode

VictimAckExcl

CR_OP_RSP_VIC_ACK_EXCL

OxDa

VictimAckShared

CR_OP_RSP_VIC_ACK_SHR

OxDb

NXMResp

CR_OP_RSP_NXM

OxEO

ERRResp

CR_OP_RSP_ERR

OxEl

InvalAck

CR_OP_RSP_INVAL_ACK

OxE2

ShrToDirtyS uccessCnt

CR_OP_RSP_SHR_TO_DRTY_SUCC_CNT

OxE4

ShrToDirtyProbCnt

CR_OP_RSP_SHR_TO_DRTY_FROB_CNT

OxE5

ShrToDirtyFail

CR_OP_RSP_SHR_TO_DRTY_FAIL

OxE6

InvalToDirtyRespCnt

CR_OP_RSP_INVAL_TO_DRTY_CNT

OxE7

WrIOAck

CR_OP_RSP_WR_IO_ACK

OxE8

WrIONack

CR_OP_RSP_WR_IO_NACK

OxE9

InvalAckLeaf

CR_OP_RSP_INVAL_ACK_LEAF

OxEA

InvalAckMaster

CR_OP_RSP_INVAL_ACK_MASTER

OxEB

VictimClean

CR_OP_RSP_VIC_CLN

OxFO

VictimClnToShr

CR_OP_RSP_VIC_CLN_TO_SHR

OxFl

ForwardAckExcl

CR_OP_RSP_FWD_ACK_EXCL

OxF2

ForwardAckShared

CR_OP_RSP_FWD_ACK_SHR

OxF3

ForwardMiss

CR_OP_RSP_FWD_MISS

OxF4

SharedToDirtyComplete

CR_OP_RSP_SHR_TO_DRTY_COM

OxF5

SharedToDirtyRelease

CR_OP_RSP_SHR_TO_DRTY_REL

OxF6

NZ-NoOp

CR_OP_SPEC_NZNOP

OxAO

SharedlnvalBcast

CR_OP_RSP_SHR_INVAL_BRD

OxBl

SharedlnvalBcastLeaf

CR_OP_RSP_SHR_INVAL_BRD_LEAF

OxB2

SharedlnvalBcastMaster

CR_OP_RSP_SHR_INVAL_BRD_MASTER

OxB3

12.12 Protocol Message Descriptions
12.12.1 10 CHANNEL Message Details
12.12.1.1 RdBytes, RdLWs, RdQWs, RdlPR

This processor/1/0 device wishes to do a load to IO space. The request includes an
address, a MAF#, PID, and a Mask indicating which parts of the block are being read.
QWADD(5:3) contains the exact address bits of the first load in the block (i.e. the load
with the lowest address). For RdQWs the mask indicates the merged quadword loads.
For RdLWs 32 bytes of information is expected to be returned (double-pumped into
64-bytes) and the mask indicates the merged longword loads within the given
hexaword. For RdBytes no merging is allowed and the mask indicates the valid bytes/
words - one or two bytes of properly aligned information is expected to be returned.

Compaq Confidential
5 January 2001 ·-Subject To Change

Cache Coherence Protocol Processing 12-21

Protocol Message Descriptions

RdIPR is identical to RdQWs, except that the different opcode indicates that the reference was within the range of the processor UO space rather than the ASIC 1/0 space
(i.e. the address references an 21364 IPR).
The likely response to a Rd* command is BlkIO. Also possible is NXMResp and ERRResp.
Note that 21364 does not support executing instructions directly from 1/0 space, so
these commands can only be generated by load instructions that reference I/O space.
These commands are used on both 1/0 and router channels. Note that an 1/0 device can
source one of these commands.

12.12.1.2 WrBytes, WrLWs, WrQWs, WrlPR

The processor-1/0 device did a store to I/O space. The request has an address, a PID, a
mask indicating which parts to write, a write I/O identifier, and the block of data.
QWADD(5:3) contains the exact address bits of the first store in the block (i.e. the store
with the lowest quad word address). For WrQWs and WrIPR the mask indicates the
merged quadword stores. For WrLWs 32 useful bytes of information is sent and the
mask indicates the merged longword stores in the hexaword. For WrBytes no merging
is allowed and the mask indicates the valid bytes/word.
The opcode, mask and QWADD are always consistent. Mask is never zero.
The tables below show the useful data for I 0 writes that are initiated by the 21464 processor.
Table 12-11 shows the location of the useful data for fully-merged WrQW's and
WrIPR's. In the table, N.L is the lower longword of quadword N, N.H is the upper longword of quadword N. Xis unused data. Quadwords are merged only if they are issued in
ascending address order. Noncontiguous quadwords can be merged.

In the case of a WrIPR, the two longwords indicated by a single bit in the mask
and by QWADD(5,3) contain all the useful information. The first longword contains bits 0-31 and the next contains bits 3 2-63.
Table 12-11 Location of Useful Data for Fully-Merged WrQW's and WrlPR's

QWADD(5:3)

Mask

Ordered Block Data for WrQWs and WrlPR (in Long-words)

I
I
I
I
I
I
I
I

OxFF

O.L,0.H,1.L,l.H,2.L,2.H,3.L,3.H,4.L,4.H,5.L,5.H,6.L,6.H,7.L,7.H

OxFE

X.X,X.X,1.L,l.H,2.L,2.H,3.L,3.H,4.L,4.H,5.L,5.H,6.L,6.H,7.L,7.H

OxFC

x.x,x.x,x.x,x.x,2.L,2.H,3.L,3.H,4.L,4.H,5.L,5.H,6.L,6.H,7.L,7.H

OxF8

x.x,x.x,x.x,x.x,x.x,x.x,3.L,3.H,4.L,4.H,5.L,5.H,6.L,6.H,7.L,7.H

OxFO

x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,4.L,4.H,5.L,5.H,6.L,6.H,7.L,7.H

OxEO

x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,5.L,5.H,6.L,6.H,7.L,7.H

OxCO

x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,6.L,6.H,7.L,7.H

Ox80

x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,7.L,7.H

Compaq Confidential
12-22 Cache Coherence Protocol Processing

5 Jc1m.1c1ry 2001 ···Subject To Change

Protocol Message Descriptions

Table 12-12 shows the location of useful data for fully-merged WrLWs. The numbers
shown are the sequential longwords. Longwords are merged only if they are issued in
ascending address order. Noncontiguous longwords can be merged.
Table 12-12 Location of Useful Data for Fully-Merged WrLW's
QWADD(5:3)

Mask

Ordered Block Data for WrLWs (in Long-Words)

000

OxFF

O,l,2,3,4,5,6,7,X,X,X,X,X,X,X,X

000

OxFE

X,l,2,3,4,5,6,7,X,X,X,X,X,X,X,X

001

OxFC

X,X,2,3,4,5,6,7,X,X,X,X,X,X,X,X

001

OxF8

X,X,X,3,4,5,6,7,X,X,X,X,X,X,X,X

010

OxFO

x,x,x,x,4,5,6,7,X,X,X,X,X,X,X,X

010

OxEO

x,x,x,x,x,5,6,7,X,X,X,X,X,X,X,X

011

OxCO

x,x,x,x,x,x,6,7,X,X,X,X,X,X,X,X

011

Ox80

x,x,x,x,x,x,x,7,X,X,X,X,X,X,X,X

100

OxFF

x,x,x,x,x,x,x,x,o,1,2,3,4,5,6,7

100

OxFE

x,x,x,x,x,x,x,x,x,1,2,3,4,5,6,7

101

OxFC

x,x,x,x,x,x,x,x,x,x,2,3,4,5,6,7

101

OxF8

x,x,x,x,x,x,x,x,x,x,x,3,4,5,6,7

110

OxFO

x,x,x,x,x,x,x,x,x,x,x,x,4,5,6,7

110

OxEO

x,x,x,x,x,x,x,x,x,x,x,x,x,5,6,7

111

OxCO

x,x,x,x,x,x,x,x,x,x,x,x,x,x,6,7

111

Ox80

x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,7

Here is the complete table of the useful data in the quadword specified by
QWADD(5,3) of a WrByte. (The other 7 quadwords in the packet are always garbage so
they are not listed in the table.) The table byte values are listed from low-order bytes
(left) to high-order bytes (right).
Table 12-13 Location of Useful Data for Quadword Specified by QWADD(5,3) of a WrByte
Mask

Quadword of Data for WrBytes (in Bytes)

Ox03

0,1,X,X, x,x,x,x

Ox02

X,l,X,X, x,x,x,x

OxOC

X,X,2,3, x,x,x,x

Ox08

X,X,X,3, x,x,x,x

Ox30

x,x,x,x, 4,5,X,X

Ox20

x,x,x,x, X,5,X,X

OxCO

x,x,x,x, X,X,6,7

Ox80

x,x,x,x, x,x,x,7

Compaq Confidential
5 January 2001 ··· Subject To Change

Cache Coherence Protocol Processing 12-23

Protocol Message Descriptions

WrIPR is identical to WrQWs, except that the different opcode indicates that the reference was within the range of the processor 1/0 space rather than the ASIC 1/0 space
(i.e. the address references a 21364 IPR). The "is for IO" bit is set for a WrlPR, though
the message is really destined for the CSR master.
The only responses to a Wr*s command are WrIOAck or WrIONAck. (WrIONAck can
only be returned in response to an IO write to the RBOX_INTA IPR.)
These commands are used on both 1/0 and router channels. Note that an 1/0 device
may source one of these commands.

12.12.2 REQUEST CHANNEL Message Details
12.12.2.1 ReadReq

A load miss. The block may either be returned in shared or exclusive state. The request
includes an address (offset and routing information), MAF #, PID, and wrap.
The likely response to a ReadReq is BlkShared or BlkExclusiveCnt. Also possible is
NXMResp and ERRResp.
This command is used only on the interprocessor channels.
12.12.2.2 ReadSharedReq

Same as ReadReq except we must end up in shared state. Usually generated by an
instruction fill. The request includes an address (offset and routing information),
MAF#, PID, and wrap.
The likely response to a ReadSharedReq is BlkShared. Also possible is NXMResp and
ERRResp.
This command is used only on the interprocessor channels.
12.12.2.3 ReadModReq

A processor store miss, ordered DMA read, or a DMA write. The block may be
returned in either exclusive or dirty state. (It will normally be written into the cache in
the dirty state by the processor core. If generated by a DMA read request, data returned
in the exclusive state may ever be converted to the dirty state, though.) The request
includes an address (offset and routing information), MAF#, PID, and wrap.
The likely response to a ReadModReq is BlkExclusiveCnt and InvalAck. NXMResp
and ERRResp are also possible. BlkExclusiveCnt returned to an 21364 processor in
response to a ReadModReq forces the block to be written into the cache dirty.
This command is used on both 1/0 and interprocessor channels.
An 1/0 device may source this packet. In this case, the memory must always have an
up-to-date copy of the block. For instance, this means that it is not acceptable to return
a dirty block copy from a processors cache directly without also updating memory with
a VictimAckExcl. This restriction is so that the I/O device only has to victimize a dirty
block if it actually has written to the block-minimizing bandwidth on the I/O port.
12.12.2.4 FetchReq

A no-cache load request or a (possibly unordered) DMA read. The block must be
returned in invalid state. The request includes an address (offset and routing information), MAF#, PID, and wrap.
Compaq Confidential
12-24 Cache Coherence Protocol Processing

5 J<·t11uary 2001 ·-Subject To Cfumge

Protocol Message Descriptions

The likely response to a FetchReq is Blklnval. NXMResp and ERRResp also possible.
This command is used on both 1/0 and interprocessor channels. An 1/0 device may
source this packet. When used for DMA reads, requests to multiple outstanding blocks
may be (from a coherence perspective) reordered. See Section 13.6.
12.12.2.5 SharedtoDirtyReq

This processor has/had a shared copy of this block and wishes to write it. There are two
possi le responses to this request: success or failure. The request includes an address
(offset and routing information), MAF#, and PID.
This command should be failed if it reaches the directory and does not find the block in
shared state with the corresponding sharing mask bit set or if it does not find the processor on the sharing list (when the requesting processor is remote).
The likely response to a SharedtoDirtyReq is SharedtoDirtySuccessCnt, SharedtoDirtyFail, or InvalAck. Also possible is ERRResp.
This command is used only on the interprocessor channels.
The directory may incorrectly succeed this request in which case the source processor
must recover from the mistake. See Section 12.13.6.
Store-conditional instructions will generate SharedtoDirtyReq commands when the
source processor is in STC-optimistic mode. When the source processor is in STC-conservative mode it will instead generate SharedtoDirtySTCReq. See Section 12.13.8.
12.12.2.6 SharedtoDirtySTCReq

This processor is in STC-conservative mode, has/had a shared copy of this block, and
wishes to succeed a store-conditional. There are three possible responses to this request:
success, probable success, or failure. This conservative request allows the requesting
processor to avoid unnecessary invalidates. The request includes an address (offset and
routing information), MAF# and PID.
The possible responses to a SharedtoDirtySTCReq are: SharedtoDirtyProbCnt, SharedtoDirtySuccessCnt, SharedtoDirtyFail, or InvalAck. Also possible is ERRResp.
See Section 12.13.8. SharedtoDirtyFail should be the directory's response when the
requesting processor is not in the sharing mask or sharing list. SharedtoDirtyProbableCnt should be the directories response when the processor is in the sharing mask.
SharedtoDirtySuccessCnt should be the response when the processor is in the sharing
list.
This command is used only on the interprocessor channels.
12.12.2.7 lnvaltoDirtyReq

A full-block write request (most likely) from the processor or DMA write. The data
need not be returned. The request includes an address (offset and routing information),
MAF#, andPID.
The response to a InvaltoDirtyReq is InvaltoDirtyRespCnt or InvalAck. NXMResp is
also possible (only in the presence of a true software error when from the processor or a
mispeculation by a DMA engine). ERRResp is also possible.

Compaq Confidential
5 January 2001 - Subject To Change

Cache Coherence Protocol Processing 12-25

Protocol Message Descriptions

This command is used on both 1/0 and interprocessor channels. An 1/0 device may
source this packet.
Note that the system guarantees that the old value of the memory location resides in
memory when responding with InvaltoDirtyRespCnt to a InvaltoDirtyReq generated by
a DMA device. This allows the DMA device to "speculatively" launch an InvaltoDirtyReq. See Section 13.6.

12.12.3 FORWARD CHANNEL Message Details
12.12.3.1 ReadForward, ReadSharedForward, ReadModForward, FetchForward, lnvaltoDirtyForward

The corresponding requests reached the directory and found the block to be in exclusive
state at different processor. A DIFT (Directory in-flight table) entry has been created,
which will typically be cleared when the forward request is acked. The command
includes an address (including both an offset and a PID of the directory), a MAF#, PID,
wrap (except for InvaltoDirtyForward), and routing information to reach the forwarded
destination. The MAF# and PID indicate the original source of the request.
There are two categories of responses (to the DIFT) to a *Forward command. The most
likely category of responses are: VictimAckExcl, VictimAckShared, ForwardAckExcl,
or ForwardAckShared. In these cases a Blk* response is also sent to the requestor.
The less likely response (to the DIFT) to a *Forward command is ForwardMiss. A ForwardMiss is accompanied by a Victim, VictimClean, VictimtoShared, or VictimCleantoShared.
If the block had been thought to be exclusive at the same processor as generated the

request, the request must block rather than generate a forward. Otherwise, the block
might continually be reloaded and evicted- livelock. (It doesn't make any sense to forward to yourself anyway.)
If a block is exclusively held by a DMA device, forwards must always be generated,
even if the exlusive owner is the requestor. A DMA device has the uppermost PID bit

set. In response to the forward, a DMA device will send a ForwardMiss and eventually
evict the block.
These commands are used on both the 1/0 and interprocessor channels.
12.12.3.2 SharedlnvalSingle

The directory sends these when a processor wishes to gain exclusive access to a block
that is in shared state. SharedlnvalSingle is generated when a block is in the Sharedl or
Shared2 states, or if each mask bit refers to a single processor and the block is in
SharedM state. The SharedlnvalSingle command contains an address (including both
an offset and a PID of the directory), MAF#, PID, and routing information to reach the
forwarded destination.
Note that the cache invalidate associated with a SharedlnvalSingle should never be performed at the requesting PID, even though it may be included in the sharing list. On
21364, the DIFT never launches a SharedlnvalSingle to the requesting processor (and
the coherence count never includes the requestor).
This command is used only on the interprocessor channels.

Compaq Confidential
12-26 Cache Coherence Protocol Processing
5 Jc1nuc1ry 2001 --Subject To Change

Protocol Message Descriptions

12.12.3.3 SharedlnvalBroadcast

The directory sends these when a processor wishes to gain exclusive access to a block
that is in SharedM state and each mask bit refers to more than one processor. The router
is required to fanout the inval within the cluster of processors that share a mask bit, fan
back in the completion, and send a single InvalAck back to the requesting processor.
The SharedlnvalBroadcast command contains an address (including both an offset and
a PID of the directory), MAF#, PID, and routing information to reach the forwarded
cluster.
Note that it is required that during the fanout at the cluster that includes the requesting
PID, the inval should never be executed on the requesting PID, though it must be executed on all other processors in the cluster that share the same sharing mask bit.
See Section 12.12.3.3 for more detail on the fanin/fanout operation.
This command is used only on the interprocessor channels.

12.12.4 RESPONSE CHANNEL Message Details
12.12.4.1 BlkShared

Data returned in response to a ReadReq or ReadSharedReq command. The data should
be deposited into the cache in the shared state. The command header contains a MAF #,
wrap, and routing information to reach the destination. The data in the packet must
always be wrapped in the octa-word order specified by the corresponding request. The
wrap bits must always match this wrap order.
This command can be returned in response to a ReadReq or ReadSharedReq request.
This command is used only on the interprocessor channels.
12.12.4.2 BlkExclusiveCnt

Data returned in response to a ReadReq or ReadMod command. The data should be put
into the cache in exclusive-clean state in response to a ReadReq, the dirty state in
response to a Readmod generated by a store.The command header contains a MAF #,
wrap, a coherence count, and routing information to reach the destination. The data in
the packet must always be wrapped in the octa-word order specified by the corresponding request. The wrap bits must always match this wrap order.
A BlkExclusiveCnt with non-zero count can be generated when a ReadMod finds a
block in the shared state. The count is the number of InvalAck's to expect in response at
the requestor. (Except when solving the for the "Local CBOX Too Far Ahead" problem
described below.)
BlkExclusive can be returned in response to a ReadReq or ReadModReq request. A
non-zero count can only occur in response to a ReadMod (except when solving the for
the "Local CBOX Too Far Ahead" problem described below).
This command is used on both I/O and interprocessor channels.
12.12.4.3 Blklnval

Data returned in response to a FetchReq command. The data should not be cached. The
command header contains a MAF #, wrap, and routing information to reach the destination. The data in the packet must always be wrapped in the octa-word order specified by
the corresponding request. The wrap bits must always match this wrap order.
Compaq Confidential
5 January 2001 ···Subject To Change

Cache Coherence Protocol Processing 12-27

Protocol Message Descriptions

This command may be returned in response to a FetchReq request.
This command is used on both I/O and interprocessor channels.
12.12.4.4 BlklO

Data is returned in response to a read 1/0 command. The command header contains a
MAF #, wrap, and routing information to reach the destination. The data is laned and
data bytes in the packet are positioned according to their address.
The wrap bits field must be zero for BlkIO commands.
The coherency count field must be zero for BlkIO commands.
Table 12-14 shows the location of the useful data in the 16 longwords contained in a
BlkIO in response to a fully-merged RdQW or RdIPR. N.L is the lower longword of
quadword N, N.H is the upper longword of quadword N. Xis unused data.
In the case of a RdIPR, the two longwords indicated by a single bit in the mask and by
QWADD(5,3) contain all the useful information. The first longword contains bits 0-31
and the next contains bits 32-63.
Table 12-14 Location of Useful Data in a BlklO in Response to a Fully-Merged RdQW or RdlPR
QWADD(5:3)

Mask Ordered Block Data for RdQWs or RdlPR (in Long-Words)

OxFF

O.L,O.H,l.L,l.H,2.L,2.H,3.L,3.H,4.L,4.H,5.L,5.H,6.L,6.H,7.L,7.H

OxFE

x.x,x.x,l.L,l.H,2.L,2.H,3.L,3.H,4.L,4.H,5.L,5.H,6.L,6.H,7.L,7.H

OxFC

x.x,x.x,x.x,x.X,2.L,2.H,3.L,3.H,4.L,4.H,5.L,5.H,6.L,6.H,7.L,7.H

OxF8

x.x,x.x,x.x,x.x,x.x,x.x,3.L,3.H,4.L,4.H,5.L,5.H,6.L,6.H,7.L,7.H

OxFO

x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,4.L,4.H,5.L,5.H,6.L,6.H,7.L,7.H

OxEO

x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,5.L,5.H,6.L,6.H,7.L,7.H

OxCO

x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,6.L,6.H,7.L,7.H

Ox80

x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,x.x,7.L,7.H

Table 12-15 shows the location of the useful data in response to fully-merged RdLWs. The
numbers shown are the (up to 8) merged longwords. Xis unused data.
Table 12-15 Location of Useful Data in Response to Fully-Merged RdLW's
QWADD(5:3) MAsk Ordered Block Data for RdLWs (in Long-Words)

000

OxFF

000

Ox FE X,l,2,3,4,5,6,7,X,X,X,X,X,X,X,X

001

OxFC X,X,2,3,4,5,6,7,X,X,X,X,X,X,X,X

001

OxF8

X,X,X,3,4,5,6,7,X,X,X,X,X,X,X,X

010

OxFO

x,x,x,x,4,5,6,7,X,X,X,X,X,X,X,X

010

OxEO

x,x,x,x,x,5,6,7,X,X,X,X,X,X,X,X

011

OxCO

x,x,x,x,x,x,6,7,X,X,X,X,X,X,X,X

0,1,2,3,4,5,6,7,X,X,X,X,X,X,X,X

Compaq Confidential
12-28 Cache Coherence Protocol Processing

5 Januc1ry 2001 ···Subject To CfJange

Protocol Message Descriptions

Table 12-15 Location of Useful Data in Response to Fully-Merged RdLW's
QWADD(5:3) MAsk

Ordered Block Data for RdLWs (in Long-Words)

011

Ox80

x,x,x,x,x,x,x,7,X,X,X,X,X,X,X,X

100

Ox FF

x,x,x,x,x,x,x,x,o,1,2,3,4,5,6,7

100

Ox FE

x,x,x,x,x,x,x,x,x,1,2,3,4,5,6,7

101

OxFC

x,x,x,x,x,x,x,x,x,x,2,3,4,5,6,7

101

OxF8

x,x,x,x,x,x,x,x,x,x,x,3,4,5,6,7

110

OxFO

x,x,x,x,x,x,x,x,x,x,x,x,4,5,6,7

110

OxEO

x,x,x,x,x,x,x,x,x,x,x,x,x,5,6,7

111

Ox CO

x,x,x,x,x,x,x,x,x,x,x,x,x,x,6,7

111

Ox80

x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,7

Table 12-16 lists all of the useful data in the quadword specified by QWADD(5,3) of a
BlkIO packet that is a response to RdBytes. (The other 7 quadwords in the packet are
unused so they are not listed in the table.)
The table byte values are listed from low-order bytes (left) to high-order bytes (right). X
is unused data.
Table 12-16 Location of Useful Data in Quadword Specified by QWADD(5,3) of a BlklO Packet
Mask

Quadword of Data for WrBytes (in Bytes)

Ox03

O, l,X,X, x,x,x,x

Ox02

X,l,X,X, x,x,x,x

OxOC

X,X,2,3, x,x,x,x

Ox08

X,X,X,3, x,x,x,x

Ox30

x,x,x,x, 4,5,X,X

Ox20

x,x,x,x, X,5,X,X

OxCO

x,x,x,x, X,X,6,7

Ox80

x,x,x,x, x,x,x,7

This command may be returned in response to a RdBytes, RdLWs, RdQWs, or RdIPR
request. This command is used on both I/O and interprocessor channels. An I/O device
may source this packet.
12.12.4.5 Victim

Data written back to memory because it was dirtied. The block must have been in the
exclusive state in the directory (when non-local). The command header contains an
address.
Victim writes a block into memory in the state invalid as well as CAM'ing the DIFf
and updating DIFT state.
The Victim command is used on both I/O and interprocessor channels. An I/O device
may source this packet.
Compaq Confidential
5 January 2001 ·-Subject To Change

Cache Coherence Protocol Processing 12-29

Protocol Message Descriptions

12.12.4.6 VictimtoShared

Write back data to memory and change the state of the block to shared. The block must
have been in the exclusive state in the directory. The command header contains an
address and a sharing PID.
VictimtoShared is similar to Victim except the final state of the block is shared in the
directory.
VictimtoShared is similar to VictimAckShared except that a ForwardMiss may (or may
not) also be in flight to the DIFT. VictimAckShared is sure to find a DIFT entry waiting
when it returns to the directory, whereas VictimtoShared is not.
This command is used only on the interprocessor channels.
12.12.4.7 VictimAckExcl

Data written back to memory as a result of a ReadForward or an InvaltoDirtyForward
that orignated from a DMA reference. The command header contains an address.
This command is similar to the victim command except in how it affects the DIFT and
the final state of the block in the directory. VictimAckExcl implies that a forwarded
command found a dirty block at the exclusive owner. The final state of the block should
be exclusive owned by the requesting PID. It is assumed that the requesting PID is
stored in the DIFT at the time the DIFT entry was created, and can be extracted in
response to the VictimAckExcl address cam.
In a special race case, the write of the block to memory and/or the directory update
must not be performed. See Section 12.13.3.
VictimAckExcl may be generated by an InvaltoDirtyForward or ReadModForward that
originated from a DMA reference - so that memory always has an up-to-date copy of
the block. See Section 13.6.
This command can only be generated in response to a ReadForward when the migratory
data optimization is implemented. See section "Migratory Data Optimization" TBS.
This command is used only on the interprocessor channels.
12.12.4.8 VictimAckShared

Data written back to memory as a result of a ReadForward or ReadSharedForward. The
command header contains an address and the sharing PID of the (prior) exclusive
owner of the block.
This command is similar to the victim command except in how it affects the DIFT And
the final state of the block in the directory. VictimAckShared implies the final state of
the block should be shared, with both the prior exclusive owner and the requestor (if the
requestor is non-local) on the sharing list. It is assumed that the PID of the requestor
was stored at the time the DIFT entry was created and can be extracted from the DIFT
to update the sharing list.
This command is used only on the interprocessor channels.
12.12.4.9 lnvaltoDirtyRespCnt

Response to a InvaltoDirtyReq command. The command contains the requestor MAF
#,a coherence count, and routing information to reach the destination.
Compaq Confidential
12-30 Cache Coherence Protocol Processing

5 Jcwwiry 2001 - Subject To Change

Protocol Message Descriptions

This command is a possible response to a InvaltoDirty Req request.
Note that InvaltoDirtyRespCnt responses to a DMA device must ensure that memory
has an up-to-date copy of the block. See Section 13.6.
These commands are used on both the 1/0 and interprocessor channels.
12.12.4.1 o SharedtoDirtySuccessCnt

Success response to a SharedtoDirtyReq or a SharedtoDirtySTCReq. The command
contains the requestor MAF #,coherence count, and routing information to reach the
destination.
This command is a possible response to a SharedtoDirty Req or a SharedtoDirtySTCReq. SharedtoDirtySuccessCnt is generated in response to a SharedtoDirtyReq whenever the block is in shared state and the source processor is on the sharing list or the
sharing mask bit corresponding the source processor is set. SharedtoDirtySuccessCnt is
generated in response to a SharedtoDirtySTCReq only when the block is in Sharedl or
Shared2 state (or in SharedM state when there is only one processor per mask bit) and
the source processor is on the sharing list.
The directory may incorrectly succeed a SharedtoDirtyReq, but may not incorrectly
succeed a SharedtoDirtySTCReq. See Sections 12.13.6 and and 12.13.8.
This command is used only on the interprocessor channels.
12.12.4.11 SharedtoDirtyProbCnt

Probable success response to a SharedtoDirtySTCReq (the source processor must be
the final arbiter of success). The command contains the requestor MAF #,a coherence
count, and routing information to reach the destination.
This is a response to a SharedtoDirtySTCReq when the block is in SharedM state and
the mask bit corresponding to the source processor is set. See Sections 12.13.6 and
12.13.8. Note that Sharedinval's are not sent out together with SharedtoDirtyProbCnt;
Rather, the Sharedinval's are sent out only with SharedtoDirtyComplete receipt.
This command is used only on the interprocessor channels.
12.12.4.12 SharedtoDirtyFail

Failure response to a SharedtoDirtyReq or SharedtoDirtySTCReq. The command contains the requestor MAF # and routing information to reach the destination.
This command is a possible response to a SharedtoDirtyReq or SharedtoDirtySTCReq
requests. A SharedtoDirty must fail when the block is not in shared state or if the source
processor is not on the sharing list or the sharing mask bit corresponding to the source
processor is not set.
This command is used only on the interprocessor channels.
12.12.4.13 NXMResp

Indicates the request referenced an area of memory that does not exist. The response
contains the requestor MAF # and routing information to reach the destination.
This command is a possible response to a ReadReq, ReadSharedReq, ReadModReq,
FetchReq, InvaltoDirtyReq, or Rd* (IO read) request. Of this list, on 21364 the InvaltoDirtyReq (from a processor, not a DMA engine) is one one that cannot be generated
Compaq Confidential
5 January 2001 --·Subject To Change

Cache Coherence Protocol Processing 12-31

Protocol Message Descriptions

speculatively, and therefore indicates a software error. Software may use NXM
responses to RdBytes requests to detect the presence of 1/0 devices. For all the other
requests a NXMResp is not known to be an error.
This command is used on both the 1/0 and interprocessor channels. An 1/0 device may
source this packet.
12.12.4.14 ERRResp

Indicates the request referenced an area of memory that encountered a hardware error.
The response contains the requestor MAF # and routing information to reach the destination.
This command is a possible response to a ReadReq, ReadSharedReq, ReadModReq,
FetchReq,InvaltoDirtyReq, SharedtoDirty*Req, or RdBytes request.
This command is used on both the 1/0 and interprocessor channels. An 1/0 device may
source this packet.
12.12.4.15 lnvalAck

A command sent to the requesting processor in response to a SharedlnvalSingle or
SharedlnvalBroadcast command at a (possible) sharer. The command contains a MAF #
and routing information to reach the destination.
When a processor wishes to write a (non-local) shared block (a ReadMod, SharedtoDirty*, InvaltoDirty, or SharedtoDirtyComplete), two things normally happen: (1) a
*Cnt response is returned to the requestor, and (2) a Sharedlnval* (forward) is sent to
all sharers. The coherence count is the number of InvalAcks to expect, and also the
number of Sharedlnval * messages sent.
Note that the router broadcasts invalidates and sends one InvalAck in response to a
SharedlnvalBroadcast message.
This command is a possible response to a ReadModReq, SharedtoDirtyReq, SharedtoDirtySTReq, or InvaltoDirtyReq request. Note that SharedtoDirtySTCReq invals are
sometimes delayed until receipt of a SharedtoDirtyComplete to avoid unnecessary
invalidates. See Sections 12.13.6 and 12.13.8.
This command is used on both the 1/0 and interprocessor channels. An 1/0 device may
source this packet.
12.12.4.16 WrlOAck

A command sent back to the source of an 1/0 write in response to each 1/0 write command. The command contains a requesting WRIO #. This response indicates that the
write is MB complete. The source processor is expected to track outstanding 1/0 write
requests.
This command is the response to a Wrbytes, WrLWs, WrQWs, or WrIPR request.
This command is used on both the 1/0 and interprocessor channels. An 1/0 device may
source this packet.
12.12.4.17 WrlONAck

This command is identical to the WrIOAck command except for the following:
1. It can only be returned in response to a write of the RBOX_INTA register,
Compaq Confidentia I
12-32 Cache Coherence Protocol Processing

5 Jt1nw~ry 2001 -~Subject To Change

Protocol Message Descriptions

2. In this special case the WrIONAck indicates that the interrupt request was not
accepted.
See Section 13.5 for more information on interrupts.
This command is the response to a WrIPR request.
This command is used on both the 1/0 and interprocessor channels.
12.12.4.18 VictimClean

A command sent to the directory to release exclusive access to a clean block (much like
a victim with no data). The command contains the address to release.
This command updates the directory state to invalid. It also cams the DIFT and may
update DIFT state.
This command is used on both the 1/0 and interprocessor channels. An 1/0 device may
source this packet.
12.12.4.19 VictimCleantoShared

A command sent to the directory to release exclusive access to a clean block (much like
a victim with no data) but leave the block in shared state. The command contains the
address to release and the sharing PID of the (prior) exclusive owner of the block.
VictimCleantoShared is similar to ForwardAckShared except that a ForwardMiss may
(or may not) also be in flight to the DIFT. ForwardAckShared is sure to find a DIFT
entry waiting when it returns to the directory, whereas VictimCleantoShared is not.
VictimCleantoShared is similar to VictimClean except the final block state should be
shared rather than invalid.
VictimCleantoShared is similar to VictimtoShared except the data need not be written
back.
This command is used only on the interprocessor channels.
12.12.4.20 ForwardAckExcl

Directory/DIFT update as a result of a ReadModForward or InvaltoDirtyForward. The
command contains an address.
This command is similar to the VictimAckExcl command except that the block need
not be written back. The final state of the block at the directory should be exclusive
owned by the requesting PID. It is assumed that the requesting PID was stored in the
DIFT at the time the DIFT entry was created and can be extracted in response to the
ForwardAckExcl address cam.
This command is used only on the interprocessor
channels.
12.12.4.21 ForwardAckShared

Directory/DIFTupdate as a result of a ReadForward or ReadSharedForward. The command contains an address and the sharing PID of the (prior) exclusive owner of the
block.

5 January 2001 -~Subject To Change

Compaq Confidential
Cache Coherence Protocol Processing 12-33

Protocol Message Descriptions

This command is similar to the VictimAckShared command except that the block is not
written back. ForwardAckShared implies the final state of the block should be shared,
with both the prior exclusive owner and the requestor (if the requestor is non-local) on
the sharing list. It is assumed that the requesting PID was stored at the time the DIFf
entry was created and can be extracted from the DIFT to update the sharing list.
This command is used only on the interprocessor channels.
12.12.4.22 ForwardMiss

This command indicates that a forwarded request did not find the block at the exclusive
owner of the block. The command contains the address.
This command is generated in the unlikely event that a forwarded request does not find
the block at the exclusive owner of the block (this can only happen when the victim
was/is in flight from the exclusive owner of the block to the directory). The DIFT can
determine the occurence of this unlikely event when it sees a Victim, VictimClean, or
ForwardMiss while a forward request is pending to the block. When this unlikely case
does happen, the DIFf regenerates the request (after waiting for both the Victim* and
ForwardMiss responses to return) as if the block were originally found invalid in the
directory. See Sections 12.13.1 and 12.13.2.
This command is used on both I/O and the interprocessor channels. An I/O device may
source this packet.
12.12.4.23 SharedtoDirtyComplete

This command is a response from the source processor to the DIFT indicating that the
source processor agreed that the SharedtoDirty should succeed. The command contains
the address.
This command results when a SharedtoDirtyProbCnt was successful. The result of
receipt of a SharedtoDirtyComplete is that SharedlnvalBroadcast's are launchedby the
DIFT.
See Sections 12.13.6 and 12.13.8 for the usage of this command.
This command is used only on the interprocessor channels.
12.12.4.24 SharedtoDirtyRelease

This command is a response from the source processor to the DIFT indicating that the
source processor disagreed that the SharedtoDirty should succeed. The command contains the address.
This command results when a SharedtoDirtyProbCnt was unsuccessful.
See Sections 12.13.6 and 12.13.8 for the usage of this command.
This command is used only on the interprocessor channels.

12.12.5 SPECIAL CHANNEL Message Details
12.12.5.1 NZNOP

This command is used to fill idle slots on the interconnect.

Compaq Confidentia I
12-34 Cache Coherence Protocol Processing
5 J(1nw1ry 2001 -- Subject To Change

Protocol Message Descriptions

The ALERT wires are used for system broadcast interrupt. These signals fan out to all 5
ports (north, south, east, west, 1/0). The alert wires also contain the SYNCH functionality. The SYNCH signals need only be used on the compass points (north, south, east,
and west).
Three of the four ALERT wires are currently allocated:
Table 12-17 ALERT Wire Allocation
ALERT Wire

Type of Connection

ALERT[O]

Hardware ALERT

ALERT[l]

Software ALERT

ALERT[2]

Hardware SYNCH

ALERT[3]

Unused

An ALERT is a (high-priority) broadcast interrupt. When an 21364 receives an ALERT
and does not already have its ALERT bit set, then it sets its ALERT bit and sends out an
ALERT pulse on all five output ports. An 21364 can receive an ALERT in the following ways:

•
•

It can receive an ALERT assertion on one of the five input ports

•

It can receive a IPR write indicating that it should launch an ALERT.

It can suffer a hardware error that is expected to produce an ALERT (see Rbox Port
Config IPR). (This is only true for a hardware ALERT.)

The current ALERT status is indicated in the Rbox_INT register. Software clears the
(local) ALERT status (and allows the receipt of another ALERT) by a write to this register.
A software ALERT is initiated by a write to the Rbox_IREQ register. This sets the local
SW ALERT bit and causes the ALERT to propagate throughout the network.
A SYNCH forces the SYNCH counter(s) to remain in synchronization with the other
counters in the other 21364's in the system. (See the Rbox Config register for a description of the function of the SYNCH counter(s).)
The following things may cause the SYNCH counter(s) to enter the SYNCH interval.
Note that the period counter is re-initialized at each entry into the SYNCH interval in
order to keep the counters in sync. In both of these cases the local 21364 sends out a
SYNCH pulse on all four compass pointsif it is not already in the SYNCH interval:

•

The local period has been covered (i.e. the period counter has stepped through the
required number of cycles since the last SYNCH)

•

A synch is received on one of the four input ports

This command is used both on the I/O and interprocessor channels. An I/O device may
source this packet.
12.12.5.2 SpeciallnvalBroadcast

This command is used to complete a SharedlnvalBroadcast. The message is broadcast
among the processors that share the same sharing mask bit. The command contains the
full address (DPID and offset) and the requesting processor's PID.
Compaq Confidential
5 January 2001 ~·Subject To Change

Cache Coherence Protocol Processing 12-35

Protocol Race Descriptions

SpeciallnvalBroadcast has many special properties that are different from other messages. See Section 12.12.3.3 for more details.
This command is used only on the interprocessor channels.

12.13 Protocol Race Descriptions
12.13.1 Early Forward Race
A forward arrives while there is an oustanding ReadReq or ReadModReq (for example,
an exclusive block has not been returned) or if not all InvalAck's have arrived. This
may be caused by either:
(A) the block is returning from the directory (or another processor) and the forward
is for a subsequent request, but the forward beat the block
or
(B) The forward was for a prior copy of the block (the victim is on its way to the
directory)
1.

Forward sets "force shared" ("force vie" if ReadModPorward) bit in the MAP,
sends back a ForwardMiss to DIPT.

2. When the block returns and coherence count is zero, the MAP entry is not deleted
until it forces a Victim, VictimClean, VictimtoShared, or VictimCleantoShared,
whichever is appropriate
3. DIFT waits for both the Victim (or VictimClean, VictimtoShared, orVictimCleantoShared) and the ForwardMiss before proceeding
4. Then, the DIPT entry returns data to the requestor and updates directory
This solution conservatively victimizes the block even though in some cases (for example, case (B) above) it would not have been necessary.
The final directory state should be set as if the state were invalid before the last request
when Victim or VictimClean are produced. The final directory state should be set as if
the state were shared with the prior exclusive owner on the sharing list when VictimtoShared or VictimCleantoShared are produced. If a Victim or VictimtoShared was created, the request must get the data from the Victim or VictimtoShared command, and
the memory copy must be updated to this value also. Otherwise, the memory copy of
the data was/is correct.

12.13.2 Late Forward Race
A forward arrives to a block that has been recently victimized. This is simular to case
(B) of the above "Early Forward" race, except it is simpler in that there is no outstanding request to the block.
1. The forward sees the block is not in the cache and a ForwardMiss is returned to the
directory
2. DIFTwaits for both the Victim (or VictimClean) and the PorwardMiss before proceeding
3. Then, the DIPTentry returns data to the requestor and updates directory

Compaq Confidential
12-36 Cache Coherence Protocol Processing

5 Jtmuc1ry 2001 --Subject To Change

Protocol Race Descriptions

With a Victim, the final directory state is set as if the state was invalid before the new
request. With a VictimClean, however, the final directory state is shared (due to speculative directory writes). If a Victim was created, the request must get the data from the
Victim command and the memory copy must also be updated to this value. Otherwise,
the memory copy of the data was correct.

12.13.3 Dual Victim Race
A Victim (or VictimClean) arrives from the requesting node before the forward acknowledgement (VictimAckExcl or ForwardAck.Excl) arrives from the node the forward went to.
1. Victim sets "block written" and "set invalid" bits (VictimClean sets only "set
invalid") as it cams the DIFf
2.

VictimAckExcl does not write to memory when the "block written" bit is set

VictimAckExcl and ForwardAckExcl set the directory state to invalid whenever the
"set invalid" bit is set

The final directory state must be invalid. If it was a Victim command, the final memory
copy of the data must be the data from the Victim command. If it was a VictimClean
command and the ack was VictimAckExcl, then the final memory copy must be the
data from the VictimAckExcl.

12.13.4 Early lnvalAck Race
An InvalAck arrives at the requestor before the *Count response arrives.

•

CMAF must keep count of early InvalAck's (i.e. the count must be able to go negative) before the count from the *Count command is added in

12.13.5 Early lnvalShared Race
An InvalShared arrives while a ReadReq or ReadSharedReq is still outstanding (for example,
the block hasn't yet been returned). This may be caused by either:
(1) the block is returning from the directory and the InvalShared is for a subsequent
request, but the InvalShared beat the block

or
(2) The InvalShared was for a prior copy of the block (either in this processor or in
another processor that shares the same sharing mask bit at the directory)
a.

InvalShared sets "force inval" bit in CMAF (and fails set dirties) and sends InvalAck

b. If the block returns shared, put the data in the cache (shared) but do not delete
the MAF entry until it has invalidated the block
This solution conservatively invalidates the block even though in some cases (for
example, case (2) above) it would not be necessary.
The final state of the block at this processor is invalid.

5 January 2001 ··· Subject To Change

Compaq Confidential
Cache Coherence Protocol Processing 12-37

Protocol Race Descriptions

12.13.6 Wrong SharedtoDirtySuccess Race
A SharedtoDirtySuccess arrives from the directory even though the cache copy of the
block has been invalidated (and probably the "force inval" bit is set?). The directory
incorrectly responded success in this case. This can happen when the block was written
(thus causing an InvalShared) while the set dirty was outstanding, and then another processor that shares a sharing mask bit at the directory received a shared copy of the
block, and only then did the directory see the set dirty, which it responds success to
since the corresponding sharing mask bit is set for the source of the set dirty.
•

Fail the set dirty in this case (and make sure the cache is invalid)

•

But do not release the MAF until a VictimClean is sent to the directory

Once the VictimClean reaches the directory the final state of the block is invalid. In
response to the set dirty failure, the processor will generate a new ReadModReq MAF
(unless it was due to a store conditional).

12.13.7 A Note on SharedtoDirties and their Resolution
The ultimate arbiter of StoD success is always the requesting MAF entry (and the cache
at that processor). If the cached copy of the block has been deleted between the time
that a request is launched and resolved, the StoD fails. If the cached copy has not been
invalidated, the StoD succeeds.
For local StoD's (i.e. StoD's from the local processor), success or failure of the StoD is
determined at the time that the request enters the DIFT and CAM's across the local
MAF. If no prior reference has invalidated the block the StoD succeeds.
For StoD's from a remote processor, success or failure is determined at the time of the
response from the directory is received at the requesting processor. In some cases the
directory can prove that a StoD has succeeded or failed. The request will certainly fail
when the block is not in shared state, the processor is not on the sharing list, or the corresponding sharing mask bit is not set. The request will certainly succeed if the block is
in Sharedl or Shared2 state and the requesting processor is in the sharing list. But when
the block is in SharedM state and the corresponding sharing mask bit is set, the directory cannot be certain whether the request should succeed or fail since it may be that
another processor that shares the same mask bit has the up-to-date copy of the block,
and the requesting processor's copy has been invalidated. See Section 12.13.6 for more
details.

12.13.8 Special Store-Conditional Support
Even though the directory cannot determine the success or failure of a StoD for certain
in some cases, it normally optimistically assumes that the StoD will succeed and acts
accordingly. If the directory turns out to be incorrect in this assumption, the source processor corrects the error. See Section 12.13.6.
General use of this optimistic behavior could lead to a store-conditional livelock problem. The problem is that in optimistically assuming the StoD will succeed, the directory
ends up performing unneeded invalidates in the cases when it was wrong. It may be
possible to get into the situation where no store-conditionals could ever succeed if there
are too many unneeded invalidates.

Compaq Confidential
12-38 Cache Coherence Protocol Processing

5 Jc1nuc1ry 2001

Subject To Change

Protocol Race Descriptions

We resolve this problem by supporting an optimistic and conservative mode for storeconditional SharedtoDirty's. When in optimistic mode, a store-conditional generates a
normal (i.e. optimistic) SharedtoDirty. When in conservative mode, a store-conditional
generates a SharedtoDirtySTC (i.e. a conservative StoD). When a processor detects that
a StoD was incorrectly succeeded by the directory, it enters conservative mode. This
conservative mode is controlled by a counter. The counter is set to seven whenever a
STOD that was generated from a store-conditional incorrectly succeeds. The counter is
decremented on every successful SharedtoDirtySTC. Conservative mode is when the
counter is not zero. A set CBOX_CTL[FORCE_STXC_CONS] always makes SharedtoDirtySTC 's conservative.
The directory never optimistically succeeds a StoDSTC, it only responds success when
it can prove that the StoDSTC will succeed. In cases where the directory cannot determine success/fail exactly, a SharedtoDirtyProbCnt is returned to the requesting processor, the invals are not performed, and a DIFT entry is retained to wait for the success/
failure determination from the source. After the source processor receives the SharedtoDirtyProbCnt, it responds to the DIFT with a SharedtoDirtyComplete or SharedtoDirtyRelease. Upon receipt of the SharedtoDirtyComplete, the directory/DIFT sends out the
invals and updates the directory state. Upon receipt of a SharedtoDirtyRelease, the
DIFT entry is deallocated.

12.13.9 Local CBOX Too Far Ahead
Since the directory state is generally not updated when a block is loaded into the local
cache, all remote requests arriving at a destination 21364 must probe the local L2 cache
tags as well as the directory state to determine where the response should come from
(L2 or memory) and what the final state of the block is. As the remote request arrives it
is queued in the DIFT and the CBOX probe queue. Eventually, the CBOX responds to
the DIFT after checking the L2 tags. The DIFf does not perform any coherence actions
other than reading the current directory state until it receives the CBOX response. Generally, the CBOX runs ahead of the DIFT since the L2 tags are faster than memory and
the DIFT must wait for the CBOX response.
Consider the following scenario. In the beginning a block is owned by a remote processor. A local ReadMod is forwarded to the remote processor by the (local) DIFT. The
remote processor responds with a BlkExclusiveCnt and (A) a ForwardAckExcl. The
(A) ForwardAckExcl gets stalled in the network but the BlkExclusiveCnt arrives at the
destination. The local CB OX evicts the block, so there is a (B) Victim being transferred
to the DIFT. Meanwhile another remote request arrives, which the CBOX responds to
the DIFT with (C) ForwardMiss. At this point there are three things in-flight to the
DIFT: (A) The ForwardAckExcl from the original forward, (B) The Victim from the
local CBOX, and (C) the ForwardMiss from the local CBOX. It is easy for the DIFf to
get confused and think that the Victim applies to the original forward, then leaving the
DIFT waiting for the Victim pair to the (C) ForwardMiss.
The solution to this race is to prevent the local cbox from sending the (C) Victim until
after the DIFT has received the (A) ForwardAckExcl. The remote CBOX sets the
coherence count for the B lkExclusiveCnt to 1. This causes the local CB OX to wait.
Once the DIFT receives the ForardAckExcl it responds to the local CBOX (internally)
with an "InvalAck" that allows the Cbox to proceed with this block.

Compaq Confidential
5 January 2001 ·-Subject To Change

Cache Coherence Protocol Processing 12-39

Protocol Race Descriptions

Specifically, the remote CBOX sets the coherence count to one to solve this instance
whenever (1) it sends a response to a requestor resulting from a forward (2) with a
matching DPID and RPID (and the request did not originate from the 1/0 ASIC at the
node) (3) where the response gives the requestor exclusive access to the block. The
DIFT sends an "lnvalAck" to the CBOX after it receives a ForwardAckExclusive or
VictimAckExclusive when the request source is local.

Compaq Confidential
12-40 Cache Coherence Protocol Processing

5 Jc1nuc1ry 2001

Subject To Change

Protocol Messages

13
Router Interface -

the Rbox

Introductory information about the Rbox is located in Section 2. 7 .2. Information about
the Rbox IPRs is located in Section 16.5.

13.1 Protocol Messages
The following direction symbols are used in Tables 13-1through13-5:
Symbol

Meaning

Sent to 107 ASIC

Received from I07 ASIC - can be sent to a 21464 processor only

Received from I07 ASIC - can be sent to an I07 ASIC only

Received from I07 ASIC- can be sent to a 21464 processor or an I07 ASIC

Compaq Confidential
5 January 2001 -· Subject To Change

Router Interface - the Rbox

13-1

Protocol Messages

13.1.1 Messages on the IO_CHANNEL

Table 13-1 lists messages on the IO_CHANNEL.
Table 13-1 Messages on the IO_CHANNEL
Command

Direction Contents

RdBytes Packet Format (TBD Ticks)

RdBytes

T,G

Route(16), Dealloc(9), Opcode(8), Block Address(32), req MAF(6),
req PID(l 1), QW add(3), Mask(8)

RdLWs

T,G

Route(16), Dealloc(9), Opcode(8), Block Address(32), req MAF(6),
req PID(l 1), QW add(l), Mask(8)

RdQWs

T,G

Route(16), Dealloc(9), Opcode(8), Block Address(32), req MAF(6),
req PID(l 1), QW add(O), Mask(8)

RdIPR

Route(16), Dealloc(9), Opcode(8), Block Address(32), req MAF(6),
req PID(l 1), QW add(O), Mask(8)

WrBytes Packet Format (19 Ticks)

WrBytes

T,G

Route(16), Dealloc(9), Opcode(8), Block Address(32), reqWRI0(6),
reqPID(ll), Mask(8), QWadd(3), Data

WrLWs

T,G

Route(16), Dealloc(9), Opcode(8), Block Address(32), reqWRI0(6),
reqPID(ll), Mask(8), QWadd(l), Data

WrQWs

T,G

Route(16), Dealloc(9), Opcode(8), Block Address(32), reqWRI0(6),
reqPID(ll), Mask(8), QWadd(O), Data

WrIPR

Route(16), Dealloc(9), Opcode(8), Block Address(32), reqWRI0(6),
reqPID(ll), Mask(8), QWadd(O), Data

Compaq Confidential
13-2

Router Interface - the Rbox

5 Jmumry 2001 - Subject To Change

Protocol Messages

13.1.2 Messages on the REQUEST_CHANNEL
Table 13-2 lists messages on the REQUEST_CHANNEL.
Table 13-2 Messages on the REQUEST_CHANNEL
Command

Direction

Contents

Request Packet Format (3 Ticks)

ReadReq

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
req MAF(6), req PID(ll), Wrap(2)

ReadSharedReq

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
req MAF(6), req PID(ll), Wrap(2)

ReadModReq

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
req MAF(6), req PID(ll), Wrap(2)

FetchReq

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
req MAF(6), req PID(ll), Wrap(2)

SharedtoDirtyReq

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
req MAF(6), req PID(ll)

SharedtoDirtySTCReq

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
req MAF(6), req PID(ll)

InvaltoDirtyReq

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
req MAF(6), req PID(ll)

13.1.3 Messages on the FORWARD_CHANNEL
Table 13-3 lists messages on the FORWARD_CHANNEL.
Table 13-3 Messages on the FORWARD_CHANNEL
Command

Direction

Contents

Forward Packet Format (3 Ticks)

ReadForward

Route(16), Dealloc(9), Opcode(8), DPID(lO), Stripe(l),
Block Address(31), req MAF(6), req PID(ll), Wrap(2)

ReadSharedForward

Route(16), Dealloc(9), Opcode(8), DPID(lO), Stripe(l),
Block Address(31), req MAF(6), req PID(ll), Wrap(2)

ReadModForward

Route(16), Dealloc(9), Opcode(8), DPID(lO), Stripe(l),
Block Address(31), req MAF(6), req PID(ll), Wrap(2)

FetchForward

Route(16), Dealloc(9), Opcode(8), DPID(lO), Stripe(l),
Block Address(31), req MAF(6), req PID(ll), Wrap(2)

InvaltoDirtyForward

Route(16), Dealloc(9), Opcode(8), DPID(lO), Stripe(l),
Block Address(31), req MAF(6), req PID(ll)

SharedlnvalSingle

Route(16), Dealloc(9), Opcode(8), DPID(lO), Stripe(l),
Block Address(31), req MAF(6), req PID(ll)

SharedlnvalBroadcast

Route(16), Dealloc(9), Opcode(8), DPID(lO), Stripe(l),
Block Address(31), req MAF(6), req PID(ll)

Compaq Confidentia I
5 Jam.1ary 2001 - Subject To Change

Router Interface -

the Rbox

13-3

Protocol Messages

13.1.4 Messages on the RESPONSE_CHANNEL
Table 13-4 lists messages on the RESPONSE_CHANNEL.
Table 13-4 Messages on the RESPONSE_CHANNEL
Command

Direction

Contents

Block Response Packet (18 Ticks)

Route(16), Dealloc(6), Opcode(8), req MAF(6), Wrap(2), Data

BlkShared
BlkExclusiveCnt

Route(16), Dealloc(6), Opcode(8), req MAF(6), Wrap(2), CohCnt(5),
Data

Blklnval

Route(16), Dealloc(6), Opcode(8), req MAF(6), Wrap(2), Data

BlkIO

T,H

Route(16), Dealloc(6), Opcode(8), req MAF(6), Wrap(2), Data

Victim Block Response Packet (19 Ticks)

Victim

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
Source PID(l 1), Data

VictimtoShared

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
Source PID(lO), Data

VictimAckExcl

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31), Data

VictimAckShared

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
Source PID(lO), Data

Compaq Confidential
13-4

Router Interface - the Rbox

5 Jc1nuc1ry 2001 ··· Subject To Change

Protocol Messages

Table 13-4 Messages on the RESPONSE_CHANNEL (Continued)
Command

Direction

Contents

No Block Response Packet (2 Ticks)

InvaltoDirtyRespCnt

Route(16), Dealloc(6), Opcode(8), req MAF#(6), CohCnt(5)

SharedtoDirtySuccessCnt

Route(16), Dealloc(6), Opcode(8), req MAF#(6), CohCnt(5)

Sharedto Dirty ProbCnt

Route(16), Dealloc(6), Opcode(8), req MAF#(6), CohCnt(5)

SharedtoDirtyFail

Route(16), Dealloc(6), Opcode(8), req MAF#(6)

NXMResp

T,H

Route(16), Dealloc(6), Opcode(8), req MAF#(6)

ERRResp

T,H

Route(l6), Dealloc(6), Opcode(8), req MAF#(6)

InvalAck

Route(16), Dealloc(6), Opcode(8), req MAF#(6)

WrlOAck

T,H

Route(l6), Dealloc(6), Opcode(8), req WRI0#(6)

WrIONAck

Route(16), Dealloc(6), Opcode(8), req WRI0#(6)

Release Response Packet (3 Ticks)

VictimClean

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
Source PID(l 1)

VictimCleantoShared

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
Source PID( 10)

ForwardAckExcl

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31)

ForwardAckShared

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31),
Source PID( 10)

ForwardMiss

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31)

SharedtoDirtyComplete

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31)

SharedtoDirtyRelease

Route(16), Dealloc(9), Opcode(8), Stripe(l), Block Address(31)

13.1.5 Messages on a SPECIAL_CHANNEL
Table 13-5 lists messages on the SPECIAL_CHANNEL.
Table 13-5 Messages on a SPECIAL_CHANNEL
Command

Direction

Contents

NZNOP

T,F

Opcode(8), Dealloc(6), Alert(5)

SpeciallnvalBroadcast

Route(16), Dealloc(9), Opcode(8), Dir PID(lO), Stripe(l),
Block Address(31), req PID(ll)

Compaq Confidential
5 January 2001 -- Subject To Change

Router Interface - the Rbox

13-5

Message Format Details

13.2 Message Format Details
Physical channels have 32 bits of information plus a 7-bit SECD ED ECC code. Packets
are always sent contiguously in time.

13.2.1 Route Information
This information tells where to route a message. Along the taken route it also (dynamically) determines the buffers that are used.
The 16 route information bits in the first tick of each message are shown in Table 13-6.
This table shows that:
•
•
•
•
•

Most of these bits do not change their value as a message hops.
"Virtual channel" is recalculated for every corner turn.
"Did adapt" inverts every time a message converts from adaptive/deadlock-free
buffers (and vice-versa).
"Virtual channel" is unused (but must be preserved) when "did adapt" is asserted.
"Did adapt" need not be zero when "can adapt" is zero:
A message in the IO_CHANNEL may take an initial hop into the adaptive/initial buffers.
A message not in the IO_CHANNEL that has "can adapt" clear can still route
in the adaptive buffers - a clear "can adapt" implies only that the message must
route in a fixed path.

•

The 21464 uses only four of the NS value and EW value bits because the largest
configuration is 16 processors in each dimension.

Table 13-6 Route Information Bits
Bits

Bit Position(s)

Meaning

NS dir (1)

NS direction to travel (north or south).

NS value (5)

No more NS travel needed when NS WHOAMI is
this value.

EW dir (1)

EW direction to travel (east or west).

EW value (5)

No more EW travel needed when EW WHOAMI is
this value.

Can adapt (1)

This message can travel in adaptive directions.

Is for 1/0 (1)

This message is for the 1/0 channel (or CSR master) at the node.

Virtual channel ( 1)

Select which of the two virtual channels in this
direction.

Did adapt (1)

Place this message into an adaptive buffer.

The opcode bits are included in the first tick of each message so that the length of the
message and the exact virtual channel can be determined quickly.

Compaq Confidential
13-6

Router Interface - the Rbox

5 Janwiry 2001 - Subject To Cbange

Message Format Details

13.2.2 Flow Control and Dealloc Information
The flow control works as follows (thought of as "credit-based"):

•
•
•
•
•

•

Senders and receivers are paired (in one direction one is the sender and the other the
receiver, and in the other direction their roles are inverted).
The sender has N buffers in each class to send into (the actual N value varies for
each class).
Each sent packet allocates one of the buffers .
The sender knows how many buffers are available. If there may be an overflow,
sender stops sending until space is available.
The "dealloc" encodings listed in Table 13-7 are used to free up buffer space of
each class.
There is one deallocation response from the receiver for each message sent.

The buffer space is deallocated using 3-bit signals sent as part of the header information
with each message sent along the corresponding return channel. Each tick in the control
portion of the packets contains at least one 3-bit signal. The packets, including cache
blocks, contain one or more 3-bit signals in each control tick.
The concatenation of all the dealloc information included in all messages produces a
string of 3-bit deallocation signals. The encoding is a "huffman" encoding that may
require multiple 3-bit signals per deallocation. (The dealloc encoding to release a multiple-signal deallocation may span across multiple messages.) As listed in Table 13-7,
deallocation of the adaptive messages requires a single 3-bit signal, while the virtual
channels require two 3-bit signals. Also, the RDIO and WRIO packets are merged into
one.
Table 13-7 Dealloc 3-Bit Variable-Length Encoding (IPs)
Code

Meaning

Nop

Special inval broadcast complete

Special inval broadcast (special)

Request adaptive

Request virtual channel 0

Request virtual channel 1

Forward adaptive

Forward virtual channel 0

Forward virtual channel 1

Response non-block adaptive

Response non-block virtual channel 0

Response non-block virtual channel 1

Response block adaptive

Response block virtual channel 0

Response block virtual channel 1

Compaq Confidentia I
5 January 2001 ·-Subject To Change

Router Interface -

the Rbox

13-7

Message Format Details

Table 13-7 Dealloc 3-Bit Variable-Length Encoding (IPs) (Continued)
Code

Meaning

Read I/O initial/adaptive

Read I/O virtual channel 0

Read I/O virtual channel 1

Write I/O initial/adaptive

Write I/O virtual channel 0

Write I/O virtual channel 1

The inval broadcast packets are given special buffering, separate from the rest of the
traffic. There is also inval completion information that flows on the dealloc channel.
See Section 13.3 for a description of the special operation of the inval broadcast packets.
Table 13-8 shows the message formats that can flow on each set of buffers.
Table 13-8 Buffer Message Formats
Buffer Pool

Size of Buffer
(Ticks)

Number of Entries
(adap+vcO+vc 1)

Formats Included

Request

8+1+1

Request Packet

Forward

8+1+1

Forward Packet

Response non-block 3

8+1+1

No Block Response Packet
Release Response Packet
Interrupt Response Packet

Response block

3+1+1

Block Response Packet
Victim Block Response Packet

Readl/O

1+2+2

RdBytes Packet

Write I/O

1+2+2

WrBytes Packet

In.val broadcast

Inval Broadcast Packet

The "size of buffer" is the maximum number of ticks in each format included in that
buffer pool.

Compaq Confidential
13-8

Router Interface - the Rbox

5 Jc1mJc1ry 2001 - Subject To Change

Message Format Details

The 1/0 port has a simpler single-tick dealloc encoding as listed in Table 13-9. It is simpler because the 1/0 port does not separate the adaptive and virtual channel buffers.
Table 13-9 Dealloc 3-Bit Encoding (1/0 port)
Code

Meaning

Nop

Unused

Request

Forward

Response non-block

Response block

Read I/O

Write I/O

Table 13-10 shows the size and number of each buffer on the 1/0 port.
Table 13-10 1/0 Port Buffer Size and Number
Buffer Pool

Size of Buffer Number In

Request

Forward

Response non-block

1-8

Response block

1-8

Readl/O

1-8

Write I/O

1-8

Number Out

8
1-8

The column, Number In, indicates the number of buffers that the 21464 has on its I/O
port. The number out (on the I07 ASIC) is variable as specified in the RBOX_IO_BUF
register.
Table 13-11 shows the size and number of each buffer for each of the Zports.
Table 13-11 Zport Buffer Message Format
Buffer Pool

Size of Buffer Number of
(ticks)
Entries

Formats Included

Forward

Forward Packet

Response non-block 3

Response block
1

8
8+1

No Block Response Packet
Release Response Packet
Interrupt Response Packet
Block Response Packet

The 1 extra is used by the broadcast inval widget

Compaq Confidential
5 January 2001 -Subject To Change

Router Interface -

the Rbox

13-9

Message Format Details

Table 13-12 shows the size and number of each buffer on the Cport.
Table 13-12 Cport Buffer Message Format
Buffer Pool

Size of Buffer Number of
(ticks)
Entries

Formats included

Request

Request Packet

Response nonblock

4+41

No Block Response Packet
Release Response Packet
Interrupt Response Packet

Response block

Block Response Packet

Read IO

Rdbytes Packet

Write IO

Wrbytes Packet

1 The 4 extra are used by the broadcast inval widget

Compaq Confidential
13-1 o Router Interface - the Rbox

5 J(1mi(1ry 2001 -· Subject To Change

Message Format Details

13.2.3 Packet Formats
Table 13-13 lists the packet format identifiers and thier contents.
Table 13-13 Packet Formats
Identifier

Contents

Opcode (8 bits).

ADD

Block address of block at DPID (offset at DPID) (32 bits).

STRIPE

Stripe bit.

RMAF

Requestor MAF (6 bits).

RWRIO

Requestor WRIO number (6 bits).

RPID 1

Requestor PID (11 bits) - uppermost bit is IJO.

DPID

PID of directory for block (10 bits).

SPID 1

Source PID (11or10 bits depending if it can be sourced from IJO).

DESTPID

PID for a packet generated by the I07 ASIC (11 bits). The DESTPID indicates the destination
for the packet. (An I07 ASIC can address another I07 ASIC.) DESTPID[lO] must be set for
RdIPR and WrIPR operations received by the 21464 on the incoming I/O port.

IOADD

I/O space address bits (32 bits).

QWADD

Bits below cache block for IJO address (3 bits).

IOMASK

Bits indicating bytes/longwords/quadwords read/written (8 bits).

COUNT

Coherency count.

DATA

Data bits.

ECC

ECC bits.

TBD

For header ticks from the I07 ASIC - value depends on the opcode.

WRAP

Indication of which 128 bits from the block will/should arrive first. The wrap is address bits [5 :4]
and specifies the order of the octaword transfers. The following table specifies the order of the
longwords, where X. Y means longword Y within octaword X.
Wrap Value

Wrapped Longword Order in the Packet

0.0, 0.1, 0.2, 0.3, 1.0, 1.1, 1.2, 1.3, 2.0, 2.1, 2.2, 2.3, 3.0, 3.1, 3.2, 3.3

1.0, 1.1, 1.2, 1.3, 2.0, 2.1, 2.2, 2.3, 3.0, 3.1, 3.2, 3.3, 0.0, 0.1, 0.2, 0.3

2.0, 2.1, 2.2, 2.3, 3.0, 3.1, 3.2, 3.3, 0.0, 0.1, 0.2, 0.3, 1.0, 1.1, 1.2, 1.3

3.0, 3.1, 3.2, 3.3, 0.0, 0.1, 0.2, 0.3, 1.0, 1.1, 1.2, 1.3, 2.0, 2.1, 2.2, 2.3

ALERT

4 bits allocated, 3 used: one HW ALERT, one SW ALERT, one HW SYNCH (not needed for the
I/O port).

MBZ

Must be zero.

RPID, SPID, and DESTPID are typically one more bit than DPID because a DPID cannot be an I07
ASIC.

Compaq Confidential
5 January 2001 ··· Subject To Change

Router Interface -the Rbox 13-11

Message Format Details

Packet formats are listed in Tables 13-14 through 13-19. In all formats, bits
ADD[34:7] equal OFF[34:7] (??OLD SECTION 3.1 ??). Bits ADD[36:35] and
IOADD[37:35] are unused (zeros). Bit ADD[34] must be zero when not in 32GB/processor mode.
In Tables 13-14 through 13-19, RPID[lO], SPID[lO], DESTPID[lO] when asserted,
indicate the I07 ASIC connected to the corresponding 21464. RPID[9:8], SPID[9:8],
IPID[9:8], DESTPID[9:8], and DPID[9:8] are unused (zeros).
Bits RMAF[5:4] are unused, but all of RMAF[5:0] are passed on; therefore, each 107
ASIC can have up to 64 outstanding references.
Bits RWRI0[5:0] are unused by the 21464 because outstanding write I/Os are tracked
only by means of the account; however, all of RWRI0[5:0] are passed on. Therefore,
each 107 ASIC can track up to 64 outstanding 1/0 writes.
In the following formats, the most-significant bits are on the left (the ECC bits are a
separate field). The line following each header tick line indicates the bit positions of the
corresponding field in the header tick line.
13.2.3.1 IO_CHANNEL Formats

Table 13-14 lists the IO_CHANNEL packet formats.
Table 13-14 l/O_CHANNEL Formats (3 Ticks)
Tick

Contents

RdBytes Packet Format (3 Ticks)
TickO

Tick 1

ROUTE[15:0]

DEALLOC[2:0]

OP[7:0]

IOADD[10:6]

[31 :16]

[15:13]

[12:5]

[4:0]

DEALLOC[5:3] IOADD[21:11]

QWADD[5:4]

IOADD[37:22]
[31 :16]

Tick2

[15:13]

[12:2]

[1:0]

RMAF[5] SPARE[2:0] DEALLOC[8 :6]

IOMASK[7:0]

RMAF[4:0]

[15:13]

[12:5]

[4:0]

ROUTE[15:0]

DEALLOC[2:0]

OP[7:0]

IOADD[l 0:6]

[31 :16]

[15:13]

[12:5]

[4:0]

DEALLOC[5:3] IOADD[21:11]

QWADD[5:4]

RPID[lO:O] QWADD[3]
[31 :21]

[20]

[19]

[18:16]

ECC[6:0]

WrBytes Packet Format (19 Ticks)
TickO

Tick 1

IOADD[37:22]
[31 :16]

Tick2

[15:13]

[12:2]

[1:0]

RPID[lO:O] QWADD[3] RWRI0[5] SPARE[2:0] DEALLOC[8:6]

IOMASK[7:0]

RWRI0[4:0]

[12:5]

[4:0]

[31 :21]

[20]

[19]

[18:16]

[15:13]

ECC[6:0]

First Wrapped Octaword:
Tick 3

DATA[31:0]

ECC[6:0] (low-order longword of the octaword)

Tick 4

DATA[63:32]

ECC[6:0]

Tick 5

DATA[95:64]

ECC[6:0]

Tick 6

DATA[127:96]

ECC[6:0] (high-order longword of the octaword)

Compaq Confidential
13-12 Router Interface -

the Rbox

5 Janut1r;12001 ~-Subject To Change

Message Format Details

13.2.3.2 REQUEST_CHANNEL Format

Table 13-15 lists the REQUEST_CHANNEL packet format.
Table 13-15 REQUEST_CHANNEL Format
Tick

Contents

REQUEST_CHANNEL Packet Format (3 Ticks)
TickO

Tick 1

Tick2

ROUTE[15 :O]

DEALLOC[2:0]

OP[7:0]

ADD[10:6]

[31:16]

[15:13]

[12:5]

[4:0]

DEALLOC[5:3]

ADD[21:11]

WRAP[5:4]

[15:13]

[12:2]

[1:0]

SPARE[2:0] DEALLOC[8:6]

SPARE[7:0]

RMAF[4:0]

[12:5]

[4:0]

STRIPE[O] ADD[36:22]
[31]

[30:16]

RPID[lO:O]

SPARE[O]

RMAF[5]

[31:21]

[20]

[19]

[18:16]

[15:13]

ECC[6:0]

13.2.3.3 FORWARD_ CHANNEL Format

Table 13-16 lists the FORWARD_CHANNEL packet format.
Table 13-16 FORWARD_CHANNEL Format
Tick

Contents

FORWARD_CHANNEL Packet Format (3 Ticks)
Tick 0 ROUTE[15:0]

DEALLOC[2:0]

OP[7:0]

[31:16]

[15:13]

[12:5]

Tick 1

STRIPE[O] ADD[36:22]
[31]

Tick 2

RPID[lO:O]
[31:21]

[4:0]

DEALLOC[5:3] ADD[21:11] WRAP[5:4] ECC[6:0]

[30:16]

[15:13]

SPARE[O] RMAF[5] SPARE[O] DPID[9:8] DEALLOC[8:6]
[20]

ADD[10:6] ECC[6:0]

[19]

[18]

[17:16]

[15:13]

[12:2]

[1:0]

DPID[7:0] RMAF[4:0] ECC[6:0]
[12:5]

[4:0]

Compaq Confidential
5 January 2001 - Subject To Change

Router Interface -the Rbox 13-13

Message Format Details

13.2.3.4 RESPONSE_CHANNEL Formats

Table 13-17 lists the RESPONSE_CHANNEL packet formats.
Table 13-17 RESPONSE_CHANNEL Formats
Tick

Contents

Block Response Packet (18 Ticks)
TickO

Tick 1

ROUTE[15:0]

DEALLOC[2:0]

OP[7:0]

RMAF[4:0]

[31:16]

[15:13]

[12:5]

[4:0]

SPARE[l 1:O]

RMAF[5]

[31:20]

[19]

SPARE[2:0] DEALLOC[5 :3]
[18:16]

ECC[6:0]

SPARE[5:0] COUNT[4:0] WRAP[5:4] ECC[6:0]

[15:13]

[12:7]

[6:2]

ECC[6:0]

[1:0]

First Wrapped Octaword:
Tick2

DATA[31:0]

ECC[6:0] (low-order longword of the octaword)

Tick3

DATA[63:32]

ECC[6:0]

Tick4

DATA[95:64]

ECC[6:0]

Tick5

DATA[127:96] ECC[6:0] (high-order longword of the octaword)

Second Wrapped Octaword:
Tick6

DATA[31:0]

ECC[6:0] (low-order longword of the octaword)

Tick7

DATA[63:32]

ECC[6:0]

Tick8

DATA[95:64]

ECC[6:0]

Tick9

DATA[127:96] ECC[6:0] (high-order longword of the octaword)

Victim Block Response Packet (19 Ticks)
TickO

Tick 1

ROUTE[15:0]

DEALLOC[2:0]

OP[7:0]

ADD[10:6]

[31 :16]

[15:13]

[12:5]

[4:0]

STRIPE[O]

ADD[36:22] DEALLOC[5:3]

ADD[21:11]

MBZ[l:O]

[15:13]

[12:2]

[31]
Tick2

SPID[lO:O]
[31:21]

[30:16]

SPARE[4:0] DEALLOC[8:6]
[20:16]

ECC[6:0]

[1:0]
SPARE[l 2:0]

ECC[6:0]

[12:0]

[15:13]

First Unwrapped Octaword:
Tick3

DATA[31:0]

ECC[6:0] (low-order longword of the octaword)

Tick4

DATA[63:32]

ECC[6:0]

Tick5

DATA[95:64]

ECC[6:0]

Tick6

DATA[127:96] ECC[6:0] (high-order longword of the octaword)

No Block Response Packet (2 Ticks)
TickO

Tick 1

ROUTE[15:0]

DEALLOC[2:0]

OP[7:0]

RMAF[4:0]

[31 :16]

[15:13]

[12:5]

[4:0]

SPARE[l 1:O]

RMAF[5]

[31:20]

[19]

SPARE[2:0] DEALLOC[5 :3]
[18:16]

ECC[6:0]

SPARE[5:0] COUNT[4:0] SPARE[l:O] ECC[6:0]

[15:13]

[12:7]

[6:2]

[1:0]

Compaq Confidential
13-14 Router Interface -the Rbox

5 Jc1nuc1ry 2001 ··· Subject To Change

Message Format Details

Table 13-17 RESPONSE_CHANNEL Formats (Continued)
Tick

Contents

Release Response Packet (3 Ticks)
TickO

Tick 1

ROUTE[15:0]

DEALLOC[2:0]

OP[7:0]

ADD[10:6]

[31:16]

[15:13]

[12:5]

[4:0]

STRIPE[O]

ADD[36:22] DEALLOC[5:3]

ADD[21:11]

SPARE[l:O]

[12:2]

[1:0]

[31]
Tick2

SPID[lO:O]

[30:16]

[15:13]

SPARE[l 2:0]

SPARE[4:0] DEALLOC[8:6]

[31 :21]

[20:16]

ECC[6:0]

[12:0]

[15:13]

13.2.3.5 SPECIAL_CHANNEL Formats

Table 13-18 lists the SPECIAL_CHANNEL packet formats.
Table 13-18 SPECIAL_CHANNEL Formats
Tick

Contents

Nop Packet (1 Tick)
TickO

SYNCH[O] SW_ALERT[O] HW_ALERT[O] ECC[6:0]

SPARE[l 2:0] DEALLOC[5:0] OP[7:0] SPARE[l:O]

[2]

[1]

ROUTE[l 5 :O]

DEALLOC[2:0]

OP[7:0]

[31:16]

[15:13]

[12:5]

[31:19]

[12:5]

[18:13]

[4:3]

[0]

lnval Broadcast Packet Format (3 Ticks)
TickO

Tick 1

Tick2

STRIPE[O]

ADD[37:22]

DEALLOC[5:3]

ADD[21:11]

[31]

[30:16]

[15:13]

[12:2]

RPID[lO:O]

SPARE[2:0]

DPID[9:8] DEALLOC[8:6]

DPID[7:0]

[31 :21]

[20:18]

[15:13]

[12:5]

[17:16]

ADD[10:6] ECC[6:0]
[4:0]
SPARE[l:O] ECC[6:0]
[1:0]
SPARE[4:0] ECC[6:0]
[4:0]

13.2.3.6 INPUT 1/0 PORT HEADER TICK Formats

Table 13-19 lists the INPUT 1/0 PORT HEADER TICK packet formats.
Table 13-19 INPUT 1/0 PORT HEADER TICK Formats
Tick

Contents

Nop Packet (1 Tick)
TickO

SPARE[15:0]

DEALLOC[2:0]

OP[7:0]

ALERT[4:0]

[31:16]

[15:13]

[12:5]

[4:0]

ECC[6:0]

All Other Packets (2 - 19 Ticks)
TickO

SPARE[4:0]

DESTPID[lO:O]

DEALLOC[2:0]

OP[7:0]

TBD[10:6]

[31:27]

[26:16]

[15:13]

[12:5]

[4:0]

ECC[6:0]

Compaq Confidential
5 Jam.mry 2001 -~Subject To Change

Router Interface -the Rbox 13-15

SharedlnvalBroadcast Details

13.2.3.7 ROUTE FIELD Format

Table 13-20 lists the ROUTE FIELD format.
Table 13-20 ROUTE FIELD Format
Meaning

ROUTE Bit

ROUTE[O]

Did adapt

[16]
ROUTE[l]

Virtual channel

[17]
ROUTE[2]

Is for 1/0

[18]
ROUTE[3]

Can adapt

[19]
ROUTE[8:4]

EW value

[24:20]
ROUTE[9]

EW direction

[25]
ROUTE[14:10]

NS value

[30:26]
ROUTE[15]

NS direction

[31]

13.3 SharedlnvalBroadcast Details
SharedinvalBroadcasts are sent from the directory to one of the nodes in the cluster of
processors sharing a mask bit. That processor receives the SharedlnvalBroadcast and
buffers it in an internal structure called the inval widget. This is the root of the inval
fanin/fanout tree.
SpeciallnvalBroadcast messages are fanned out within the cluster from node to node.
At each node a new inval widget entry is allocated. The inval widget entry waits for all
the children processors in its subtree to complete before it completes. It also performs
the invalidate on the local processor (if the local processor is not the requesting processor) and must wait for the local inval to complete before the inval widget entry completes. This means that once the children of the root node complete and the root node
itself performs its inval, all of the invalidations are complete. Once a node is complete,
the inval widget entry is deallocated and a SpeciallnvalBroadcast complete message is
sent to the parent node. Once the root node completes the SharedlnvalBroadcast, an
InvalAck is returned to the requesting processor.

Com p.aq Confidentia I
13-16 Router Interface -

the Rbox

5 Jc1nuary 2001

Subject To Change

1/0 Port and 1/0 ASIC Assumptions

To avoid deadlock, SpeciallnvalBroadcast messages are expected to be fanned out (and
back in) in a dimension order. This, combined with the fact that the inval widgets are
specific to particular inputs, allows multiple broadcast messages to be fanned out from
different starting points while avoiding deadlock.
SpeciallnvalBroadcast messages have their own special buffering. This buffering is
deallocated in the normal way. The SpeciallnvalBroadcast complete signals are also
transferred along the deallocation channel.

13.4 1/0 Port and 1/0 ASIC Assumptions
The I/O ASIC communicates with the rest of 21464 via the I/O port. The packet formats are the same as the 21464-to-21464 packet formats. The 1/0 ASIC can only issue
a subset of the commands, and only needs to be able to receive a subset of the commannd, as described in Section 13.5.
The interface to the I/O ASIC follows the same dealloc strategy as the 21464-to-21464
ports. The only difference is that the packets coming from the I/O ASIC encode the destination PID in place of the route information. (The 21464 then does a routing table
lookup and replaces the destination PID with the routing information.)
The header tick format is specified in Section 13.2.3.6
New messages on the out-going I/O port can only be initiated on the rising edge of the
out- going forward clock. This can make the decode on the I/O ASIC simpler. The only
packet type that can start on the falling edge is the NZNOP instruction (or the true NOP
during reset).
An 1/0 ASIC has a PID with the uppermost bit set. The lower PID bits equal the PID of
the processor with the I/O port the ASIC is connected to. The PID of the 1/0 ASIC is
what allows the messages routed on the 21464 interconnect to reach the I/O device -first route to the processor with the same lower bits, then route out the I/O port.
Here are the high-level operations supported in the 214641/0 port interface:
•

DMA read and write access by the I/O ASIC to the memory in the 21464 system

•

Read and write access by the 21464 microprocessors to registers in the I/O ASIC
and on the 1/0 buses connected to the I/O ASIC

•

Read and write access by the 1/0 ASIC to the system IPR's in the 21464 microprocessors

DMA read and write access to memory space via the I/O ASIC is described in Section
13.6.
The read and write access by the 21464 microprocessors to the registers in the I/O
ASIC and on the I/O buses connected to the 1/0 ASIC allow the microprocessors to
control the 1/0 devices connected on the port. The 21464 interconnect has enough virtual channels and the coherence protocol is such that the I/O ASIC may stall these
accesses pending completion of DMA references and the system will not deadlock.
I/O ASIC references can read/write the system IPR's of any 21464 in the system. This
allows for 21464 system IPR's to be configured by the I/O ASIC or another device on
an 1/0 bus connected to the I/O ASIC. This is also the mechanism by which interrupts
are delivered from an I/O device to a 21464 (see Section 13.5 for more information on
Compaq Confidential
5 January 2001 - Subject To Change

Router Interface -the Rbox 13-17

Interrupt Delivery

interrupts). Note that these references must never block either of the two prior types of
access (DMA or 1/0 register access by the microprocessor), otherwise deadlock may
occur.
Note that an 1/0 ASIC may send an IO_CHANNEL Rd* or Wr* message to another I/
0 ASIC in order to implement peer-to-peer I/O. See Section 13.7 for deadlock-avoidance requirements.
Note that the I/O port protocol does not explicitely support coherent I/O TLB 's, but that
I/O TLB coherence can be maintained by hardware exclusive caching of TLB entries.
The I/O port also has a synchronous mode for lock-step operation. In this mode, data
from the I/O port input is not directly taken into the router core. Rather, it is written into
a four-entry FIFO using a (two-bit) write pointer. The router core later reads the data
from this FIFO using its two-bit read pointer. The FIFO write path writes a piece of data
into the FIFO and increments the write pointer every cycle that valid data is sampled.
The FIFO read path reads a FIFO entry and increments the read pointer at the rate of the
incoming data, synchronous to the internal 21464 clock.
This synchronous mode allows the 21464 router core to sample data from the incoming
I/O port (via the FIFO) at a predictable time even though the 21464 pads may sample
the incoming I/O port data at an unpredictable time. Short-term jitters can be tolerated.
Over the long term, the system must be synchronous.
The read and write pointers are initialized as follows. At boot time the 21464 forward
clocks on the I/O output ports are not transitioning. Also, the forward clocks in the 1/0
input ports are not transitioning. The write pointer increments only when an incoming
forward clock is received, so the write pointer is initialized at this time. At some point
boot software starts the outgoing forward clocks from 21464. It does this by a write to
the RBOX_IO_CFG IPR. During the same register write it also initializes the read
pointer to the appropriate value. The I/O ASIC must detect the start of the forward
clock from 21464 and start its forward clocks a fixed (predictable) amount of time later.
When the 21464 starts receiving the forward clocks from the 1/0 ASIC (a fixed time
later), the write pointer starts incrementing. Since this time is fixed, it should have been
possible to pre-calculate the necessary read pointer value. (Either that, or try all 4 possible combinations.) The same read pointer value can be used from one boot to the next
even though the latency for the forward clocks to transition may vary slightly from one
boot to the next.

13.5 Interrupt Delivery
There are two mechanisms to deliver an interrupt to an 21464 microprocessor:
•

Queueing an identifier

•

Setting a mask bit

21464 includes a 4 entry queue to hold 24-bit identifiers that can uniquely identify the
source of an interrupt. These 24-bit identifiers are called interrupt id's (110 's). Interrupt
software can read the head of the queue to determine how to process an interrupt. This
queue is accessible via references to the RBOX_INTQ and RBOX_INTA IPR's.
21464 also includes a 32 bit mask for coarser interrupt receipt. This mask can be referenced via the RBOX_INT IPR. The individual interrupts can be masked via the
RB OX_IMASK system IPR. Some of the bits in this mask will be reserved for specific
Compaq Confidential
13-18 Router Interface -

the Rbox

5 Jc1nuc1ry 2001 - Subject To Change

OMA Device Assumptions

purposes - e.g. interrupt queue, performance counter, error, 1/0 error, and other bits will
be available for general software use - e.g. interprocessor interrupt. New interrupts can
be launched via the RB OX_IREQ IPR.
1/0 devices will typically queue an IID to produce an interrupt. In order to use this
method, the I/O ASIC will issue a write to the RBOX_INTA IPR in the appropriate
21464. The 1/0 device must be prepared to receive a WrIONAck response indicating
that the given 21464's interrupt queue has overflowed. When the I/O ASIC receives the
overflow response, it must resend the interrupt again to the same or another 21464 until
it is accepted by one of them.
Interprocessor interrupts will typically be performed via writes (to the RBOX_IREQ
IPR) that set a mask bit in the RBOX_INT IPR of another 21464. Interprocessor interrupts will typically not use the interrupt queue method since there is no hardware mechanism to determine when the interrupt queue overflows.
The 21264 core allows for 6 interrupt wires into the core. 21464 will partition the interrupt sources onto these six wires as follows:
Table 13-21 Interrupt Level Sources
Interrupt Level

Source

IRQ(O)

System correctable I performance count

IRQ(l)

Interrupt queue

IRQ(2)

Interval timer

IRQ(3)

Other (interprocessor/SW ALERT)

IRQ(4)

Halt interrupt/other

IRQ(5)

Uncorrectable/machine-check/HW ALERT

13.6 OMA Device Assumptions
This section describes two alternative techniques for the I/O ASIC to perform DMA
accesses to the 21464 system memory -- exclusive caching and timeouts.
A DMA Device is contained within the I/O ASIC off the I/O port. Its purpose is to service I/O bus reads and writes.
A DMA device can access data in one of three different ways:
1. An uncached FetchBlk request to read the block
2.

A ReadMod request to obtain exclusive access to the block (often to write a portion
of the block)

3. An InvaltoDirty request to gain exclusive access to the block (presumably to write
the entire block).
For a DMA read stream there are two ways to prefetch data in multipleblocks, depending on the ordering required by the DMA device. The most efficient way is to use a
stream of FetchReq (i.e. non-cacheable fetch) commands. As an example, the I/O controller might Fetch blocks A and B. The references to blocks A and B may be serviced
in any order by the memory system, and the responses may return in any order. Note the
two sources of difficulty: (1) the references are serviced out of order, and (2) the referCompaq Confidential
5 January 2001 - Subject To Change

Router Interface -the Rbox 13-19

DMA Device Assumptions

ences may return out of order. Source (1) may violate the memory reference ordering
constraints required by the DMA read stream (the returned loads are not sequentially
consistent, for example). Source (2) makes the implementation of the DMA controller
more difficult because the data may have to be reordered.
The second way to prefetch data in multiple blocks for a DMA read stream is to use
ReadModReq commands. The advantage of this method is that the I/O device can
implement a sequentially consistent read stream since the exclusive access forces order.
One disadvantage is that VictimClean must be generated to release exclusive access to
the block. The other disadvantage is that exclusive access is required. Multiple DMA
devices that attempt to access the same block at the same time will be serialized, as a
consequence, as will a processor and a D MA device.
There are also two ways to prefetch data in multiple blocks for a DMA write streams.
The first way is via a stream of ReadModReq commands. The second is via a stream of
InvaltoDirtyReq commands. The InvaltoDirtyReq's require that the writes be full-block
writes.
Note that the protocol specifies that InvaltoDirty's may be issued speculatively from a
DMA device since the memory always contains the prior copy of the block -- a VictimClean will back out the request if it is found to be a mis-speculation. Also, the DMA
device will never dirty blocks in response to a ReadModReq. This means that Victim
commands will never be needed for a DMA read (via ReadMod command) stream.

13.6.1 1/0 OMA Access and Exclusive Caching
When using this technique, the DMA device is expected to force the eviction of a cache
block soon after receiving a forward for the cache block. The I/O ASIC may (exclusively) cache copies of blocks for long time periods. If a processor or another I/O ASIC
(or even if this I/O ASIC) requests a copy of the block, the directory will see that this I/
0 ASIC is the exclusive owner of the block and will forward the request to the I/O
ASIC. When this happens, the directory expects to eventually receive both a ForwardMiss and a Victim (or Victim Clean) in response.
When the I/O ASIC is using this mode to access DMA, it should respond ForwardMiss
to every received forward request. The following is additionally required:
•

Any currently cached blocks/TLB entries that could possibly match the address in
the forward must be marked for eventual eviction (after a time-out)

•

Any currently pending MAF entries that could possibly match the address must be
marked so that the block eventually gets evicted after it returns.

Note that the receipt of a forward does not imply that the I/O ASIC currently holds a
copy of the block. (A victim may be on its way from the 1/0 ASIC to the directory
before the I/O ASIC receives the forward.)
Note that this scheme allows the 1/0 ASIC to (exclusively) cache copies of scattergather maps, or I/O TLB entries.

13.6.2 1/0 OMA Access via Timeouts
When using this technique, the DMA device is expected to evict blocks soon after they
obtain exclusive access to the block. This allows the I/O ASIC to ignore the forwards.
Compaq Confidential
13-20 Router Interface -

the Rbox

5 Jc1nw1ry 2001 -· Subject To Change

110 Space Ordering and Assumptions

When the 1/0 ASIC is using this mode to access DMA, it should simply respond ForwardMiss to every received forward request, and otherwise ignore the forward.
DMA devices must take care to avoid deadlock in this mode. Take the following scenario:

1. The DMA device requests exclusive access to blocks A and B simultaneously,
2. The response for block B returns but cannot be written until the response for block
A returns. In this scenario deadlock could result if the DMA device does not eventually release exclusive access to the block B. It is possible that the response to the
request A cannot be completed since requests are blocked waiting for the eviction
of B. Thus, after a timeout, block B must be released in order to make forward
progress, even though the reference to the block has not been completed.
Note also that I/O TLB 's may not be cached (long-term) when this timeout mechanism
is used.

13.7 1/0 Space Ordering and Assumptions
21464 supports the same I/O space ordering rules as the 21264: LD-LD ordering is
maintained to the same I/O ASIC or processor, ST-ST ordering is maintained to the
same I/O ASIC or processor, LD-ST or ST-LD ordering is maintained to the same
address, and LD-STor ST-LD ordering is not maintained when the addresses are different.
All these ordering constraints are on a single processor basis to the same I/O ASIC or
processor. Multiple loads (to the same or different addresses) may be in flight without
being responded to, though their in-flight order is maintained to the destination by the
core/CBOX and the router. Similarly, multiple stores (to the same or different
addresses) can be in flight.
When there is a load I/O outstanding to address A, 21464 will not launch a store to
address A until the BlkIO response to the load I/O is received. 21464 may have an earlier write I/O request to address B in flight at the same time as there are load I/O
requests in flight to address B; the CB OX/router guarantee that the earlier write I/O
request reaches the destination before the later load I/O requests.
21464 also supports peer-to-peer I/O. In order to avoid deadlock among peer I/O ASIC
clients, writes must be able to bypass prior reads. This is required because read
responses cannot be returned until prior writes have completed in order to maintain
some PCI ordering constraints. By allowing the writes to bypass the reads, we guarantee that the writes will eventually drain, thereby guaranteeing that the reads will eventually drain.

Compaq Confidential
5 January 2001 -~Subject To Change

Router Interface - the Rbox 13-21

110 Space Ordering and Assumptions

In order to implement all these requirements, the 21464 router must maintain the following point-to-point rules on the IO_CHANNEL:
Table 13-22 Router IO_CHANNEL Point-to-Point Rules
First

Second

Description

Rd*

Order must be maintained

Rd*

Wr*

The later Wr* must be allowed to bypass the earlier Rd* to avoid deadlock

Wr*

Rd*

Order must be maintained

Wr*

Order must be maintained

In other words, except for the case of a read followed by a write, a total order must be
maintained.
Note that 21464 does not support instruction references to 1/0 space. 21464 cannot execute code received directly from the 1/0 ASIC. Code residing in I/O space must first be
copied/DMA'ed into cacheable memory before it can be directly executed.
Note that all I/O writes are acknowledged. MB 's wait for all 1/0 write acknowledgements to be received before proceeding. MB 's also wait for the response to all 1/0 reads
before proceeding.
Note also there there are no ordering constraints between different I/O space accesses
that reference different I/O ASIC's or processors; the ordering rules apply only with the
same source and destination for the references. MB 's must be used to order references
to different I/O ASIC's or processors.

Compaq Confidential
13-22 Router Interface - the Rbox

5 Januc1ry 2001

Subject To Change

The 5th Rambus Channel

14
Rambus Interface -

the Zbox

Introductury information about the Zbox is located in Section 2.7.3. Information about
the Zbox IPRs is located in Section 16.6.

14.1 The 5th Rambus Channel
For higher reliability in large memory systems, a fifth Rambus channel (one extra for
every four channels) can optionally be enabled. This extra channel allows the system to
tolerate the failure of any single DRAM part or any single DRAM row.
The technique used is one used in RAID schemes for disks. (Is it RAID 5?) When
enabled, the stored bits in the 5th channel are the bit-wise exclusive-or of the corresponding 4 bits in the original 4 channels. If the stored bits in channel i are a bit-stream
Pi, and if the expected information for channel i is a bit-stream Vi, then the operation to
store the expected bit-stream to memory is:
PO = VO
Pl = V1
P2 = V2
P3 = V3
P4 = VO /\ V1 /\ V2 A V3

And, the operation to read the expected bit-stream from memory under normal operation is:
VO = PO
V1 = Pl
V2 = P2
V3 = P3
check that (P4 == PO /\ Pl /\ P2 /\ P3)

In the case when the parity check calculation on the stored bits indicates that there are
no mismatches, the read operation is complete; We assume that the reconstructed data
VO-V3 is correct. In the case that there is a single-bit mismatch on the parity check calculation, the read operation is also complete; We assume that the ECC codes contained
in VO-V3 will correct the (likely) resultant single-bit error.
However, in the case when the parity check calculation mismatches on more than one
bit, this indicates that the resultant data VO-V3 may have a multi-bit error that is uncorrectable, and that it may have bad data that appears either good or correctable. In this
case we can attempt to correct the error by mapping out a channel.
Compaq Confidential
5 January 2001 - Subject To Change

Rambus Interface - the Zbox

14-1

The 5th Rambus Channel

The algorithm to map out a channel is simply to try all possible combinations and pick
the one that results in the cleanest resultant ECC codes.
After we have decided to map out a physical channel, we can reconstruct the expected
data using the redundant information contained in the 5th channel. If we assume that PO
has failed, the read operation to reconstruct the original data would be:
VO = Pl A P2 A P3 A P4

Vl = Pl
V2 = P2

V3 = P3

Similar calculations can be applied when mapping out one of the other four channels.
Note that after we map a channel out, we still store the expected bit-stream to memory
in the same way that we did before we mapped the channel out. This makes it simple to
remap the channel back in when there is a soft-error; After re-writing the bits to memory the channel can be restored to full functionality.
Note also that when the 5th channel is enabled, correctable memory errors will not
propagate; All errors that can be corrected by remapping a channel will be corrected
without sending corrupt data on the network or putting corrupt data in the local caches.
(When the 5th channel is not enabled, corrupted memory data may be propagated in the
network (via "garbage codes") and may also be written into the local caches.

Compaq Confidentia I
14-2

Rambus Interface -

the Zbox

5 Jc1nuc1ry 2001 ·- Subject To Change

The GIO Port

15
Miscellaneous Interfaces
15.1 The GIO Port
As in the 21363, the GIO port is a way for the 21464 to interface with miscellaneous
external I/O functions. During power-up initialization sequences, the GIO port provides access to system configuration information including PLL divisors, WHOAMI
and other router configuration components, the Rambus SIO and 120 chains. The GIO
port also provides a connection to server management memory where XSROM or console code can be loaded or mailboxes set-up to allow communication with the server
management subsystem.
The functional model is for the 21464 to control the operation of the port performing all
reads and writes. Status bits and interrupts are used for reverse communications. This
is in contrast to the JTAG port where the system is the master and initiates all read and
write transactions to the 21464.

15.1.1 Signals
The GIO port is a low bandwidth, simple interlace to external logic. In currently planned systems the GIO port is
expected to interface with an external FPGA running in the 33Mhz range. The GIO port signals are:

Table 15-1 GIO Port Signals
Name

Path

Description

GIO_TFR<3:0>

Address/Data bus. Transfers 32 data bits per transaction in 8 cycle bursts of 4
bits per cycle. Addresses are 8-bit values transferred in the first two 4-bit cycles.

GIO_ALE

Address latch enable signal. Asserts for two cycles at the beginning of each
transaction defining when transaction address bits 0-3, then 4-7 are driven on
GIO_TFR<3:0>.

GIO_RD

Read enable signal. If asserted in cycles 2-9 of a transaction, the operation is a
read, if deasserted, the operation is a write transaction.

GIO_CLK

Free running clock.

GIO_INT

External interrupt signal. Asserted when system logic is requesting communication with the 21464. Interrupt handlers then perform reads and writes across the
GIO port to examine and respond to the interrupt request.

GIO_HINT

High priority interrupt. Currently just a separate device interrupt but we are evaluating a true HALT type of function.

Compaq Confidential
5 January 2001 -~ Subject To Change

Miscellaneous Interfaces

15-1

The GIO Port

15.1.2 Transactions
Both read and write transactions are fixed 12 GIO clock cycle operations. Eight bits of
address are transferred in the first two cycles and 32 bits of data are transferred in
cycles 4-11. Cycles 3 and 12 are used to tum-around the bus for read operations and are
included in write transactions to simplify the state machines.
Figure 15-1 GIO Port Read Transaction Timing

GIO_CLK
GIO_TFR
GIO_ALE
GIO_RD

Read Transaction

Figure 15-2 GIO Port Write Transaction Timing
0

GIO_CLK
GIO_TFR
GIO_ALE
GIO_RD

Write Transaction

15.1.3 Registers
There are three registers in the 21464 that software uses to interact with the GIO port.
These CSRs reside in the Rbox and are mapped to IO space.

15.1.3.1 GIO_CNFG

The GIO_CNFG register defines the characteristics of the GIO_CLK signal. The
GIO_CLK pin is disabled by reset (which resets?) and enabled by writing a non-zero
value to the Divisor. The Divisor allows GIO_CLK to be between half and 1/256th of
the core clock. This allows the GIO port to run as slow as lOMHz on a 2.5 GHz 21464.
Compaq Confidential
15-2

Miscellaneous Interfaces

5 Jc1nuc1ry 2001 ·-Subject To Change

The GIO Port

Figure 15-3 GIO_CNFG Register
s..;...7_ _ _
,~o

Divisor

Table 15-2 GIO_CNFG Register Field Descriptions
Field Name

Extent Type

Description

DIVISOR

7:1

Defines the number of Core clock cycles in each GIO_CLK phase (half
cycle). A divisor of zero also disables the clock.

RW,O

15.1.3.2 GIO_ADDR

The GIO_ADDR register defines the address of the next GIO transaction. If the
START_READ bit is set when the GIO_ADDR register is written, a read transaction on
the GIO bus is initiated. If the START_READ bit is written with a zero, no transaction
is initiated.
Software must not write this register while a GIO transaction is in progress. A GIO
transaction is in progress when the DONE bit of the GIO_DATA register is clear.
Figure 15-4 GIO_ADDR Register
63

Addr
START_READ _ __,

Table 15-3 GIO_ADDR Register Fields Description
Field Name

Extent Type

Description

ADDR

7:1

RW,O

Defines the eight-bit address of the GIO transaction.

START_READ

RW,O

When written with a 1, a read transaction to ADDR is performed on the
GIO bus.

15.1.3.3 GIO_DATA

The GIO_DATA register specifies the data to be written on the GIO bus and holds the
data read from the GIO bus on read transactions. When the GIO_DATA register is written, a write transaction is initiated on the GIO bus. Software must poll the DONE bit of
the GIO_DATA register to detect the completion of the write transaction and must not
perform any subsequent writes to the GIO_DATA register before the DONE bit is set.
The DO NE bit also indicates completion of a read transaction.
***I find the sense of the DONE bit to be backwards. It should be a BUSY bit. Software will typically issue a read and poll until complete. The loop control would be a
BLT instead of a BGE but with a BUSY bit, the value returned from the completed read
would be correct whereas a DONE bit would need to be masked-off before the value
was used. A BUSY bit also naturally resets to zero.

Compaq Confidential
5 January 2001 - Subject To Change

Miscellaneous Interfaces

15-3

The GIO Port

***The BUSY vs. DONE debate is a minor nit. My inclination is to mimic the 21364,
rather than implement the more natural interface but comments are welcome.

Figure 15-5 GIO_DATA
323:...:...1_ _ _ _ _ _ _ _ _ _ _ _ _ _ __..::.,

Data

Table 15-4 GIO_DATA Register Fields Description
Field Name

Extent Type

Description

DONE

RO, 1

Status bit indicating the GIO bus controller is idle and has completed
any read or write transactions. Software must query this bit to ensure
the GIO logic is not busy before updating any GIO registers.

DATA

31:0

RW,O

Data written to the GIO_DATA register is immediately written to the
GIO bus. For read transactions, the DATA field is the valid result of the
read operation once the DONE bit is set.

15.1.4 Use
Aside from the restrictions imposed by an 8-bit address and 32-bit data word, the 21464
does not define the external structure of the GIO port. Marvel systems intend to use a
FPGA to interface to the GIO port so the external structure of the GIO port could be
modified by a simple update to the FPGA program.
To better understand the capabilities and operation of the GIO port, consider the proposed interface to server management on a Marvel platform. Marvel has defined several registers in GIO address space:
Table 15-5 GIO Address Space Registers Defined by Marvel
Register

Name

OOh

CPUx_CTRL

Olh

CPUx_DMA_Data

02h

CPUx_DMA_Addr

03h

CPUx_RIMM_Serial_Port

04h

CPUx_ComO_Xmit

05h

CPUx_ComO_Rcv

06h

CPUx_Coml_Xmit

07h

CPUx_Coml_Rcv

These registers allow three basic functions to be performed:
Compaq Confidential
15-4

Miscellaneous Interfaces

5 Janu,1ry 2001 - Subject To Change

The GIO Port

•

Direct interaction with the RAMbus serial controls

•

Dual Simple UART communication links

•

DMA access to Server Management memory

As an example of how the GIO bus is used, consider the sequence of operations necessary to read a value from server management memory:
Operation

Description

IOstore GIO_ADDR CPUx_DMA_Addr
IOstore GIO_DATA addr
IOload GIO_DATA (until done bit set)

\\Send DMA address to FPGA

IOstore GIO_ADDR CPUx_CTRL
IOstore GIO_DATA 1
IOload GIO_DATA (until done bit set)

\\Tell FPGA to start the DMA

IOstore GIO_ADDR CPUx_CTRL I Start
IOload GIO_DATA (until done bit set)
(repeat read of CPUx_CTRL until bit<O> clear)

\\Get DMA status from FPGA
\\Wait for GIO port to idle
\\Repeat read until DMA done

IOstore GIO_ADDR CPUx_DMA_Data I Start
IOload GIO_DATA (until done bit set)
(mask off done bit)

\\Get DMA data from FPGA
\\Wait for GIO port to idle

\\Wait for GIO port to idle

Each three-line group above equates to a single (360ns) GIO bus transaction. For the
16-bit DMA transactions currently planned, the third group would likely execute twice
because the actual DMA from server management memory into the FPGA is expected
to take -600ns.
In total the 16-bit transfer would require (5 * 360ns) 1.8us to complete yielding an
effective bandwidth of just over lMB/sec.
If the FPGA implementation can accommodate it, a recommended optimization is to

define the DMA start bit and opcode in the CPUx_DMA_Addr register instead of the
CPUx_CRTL register. This would merge the first two GIO bus transactions into a single transaction and reduce the latency of each read sequence to 1.44us. Increasing the
DMA transfer size to 32-bits per operation is another available option. The bandwidth
of large transfers would be increased to over 2MB/sec but the latency of a single operation would likely be extended back to 1.8us.
To initiate communications, server management will post a GIO interrupt after storing a
message packet in a predefined location in server management memory. 21464 PALcode, in response to the interrupt, will read the packet and can respond by writing a
response packet to a predefined location in server management memory.
Multi-threading creates several problems unique to the 21464:
•

Which TPU (s) will service the GIO interrupt?

•

The GIO port is a shared resource, how to we prevent multiple TPU s from accessing it simultaneously?
Compaq Confidential

5 January 2001 --· Subject To Change

Miscellaneous Interfaces

15-5

The GIO Port

•

Does server management need to allocate separate message blocks or addresses for
each TPU?

The current plan is to require software to restrict access to the GIO port to a single TPU
at a time. A mask register will define which TPU(s) are to receive the GIO interrupt. If
multiple TPU s are selected, software must synchronize among the TPU s and ensure
there is no contention for the GIO port. We currently do not have a mechanism that
allows server management to target specific interrupts to specific TPUs. The mask register must be pre-set.
15.1.4.1 Differences In Implementation Between the 21364 and 21464

The following differences exist between the 21464's GIO port implementation and the
21364 implementation:
•

The enable bit in GIO_CNFG does not exist in the 21464. If software ensures the
divisor is zero whenever the GIO_CLK should be disabled, both chips will behave
identically.

•

The 21364 defined GIO transactions to be 64-bit writes and 63-bit reads. The
21464 restricts all GIO transactions to 32 bits. As long as systems define GIO port
registers to be no larger than 32-bits, the size difference should be transparent. The
largest GIO operation defined in Marvel is currently 23 bits.

•

Because the 21464 sequences 4-bits per cycle across the GIO bus, the FPGA will
need to shift/pack differently than the 21364, but this will be transparent to software.

•

It has been requested that we make the GIO_HINT pin perform a real halt function

rather than act as another device interrupt. This is under consideration.
The motivation for implementing a different interface from the 21364 was primarily the
availability of pins. The Marvel CMM module connector and FPGA are already pin
constrained. The 21464 will have additional voltages to control and needs to connect to
the JTAG port for access to debug controls. Optimizing the GIO port to utilize a four
bit datapath and single read/write control wire saves enough pins to connect the CMM
to the JTAG port on the 21464. We are looking to consolidating other functions (like
GIO_CLK and SROM_CLK) as a way to further optimize the interface to the Marvel
CMM.

Compaq Confidential
15-6

Miscellaneous Interfaces

5 Jc1nuc1ry 2001 ··· Subject To Change

Internal Processor Register Summary

16
Internal Processor Registers
16.1 Internal Processor Register Summary
See the Preface for the location of other documents that provide additional information
about the 21464 internal processor registers.
Information can be read from and written into IPRs in various ways, as described in
Section 2.12.1. Table 16-1 distinguishes registers that are explicitly written by an
MTPR instruction, implicitly written as the result of executing an instruction, and
implicitly written as a result of some event not directly associated with the execution of
a specific instruction.
Not all IPRs can be read. To aid debug, it is important that those IPRs with a strong
need to be readable be identified early.
The ability for one TPU to read or write another TPUs IPRs is still a source of debate.
Currently no mechanism exists, but it is generally believed that debugging needs and
error fix-up code might require this capability.
Table 16-1 Internal Processor Register Summary
Mnemonic

PerTPU

Index

Writer
Class1 Read lnit2 Grp

I General Events for 1PU 0

IAGG_EVENTO

1 1100 000

Dbg

I General Events for 1PU 1

IAGG_EVENTl

1 1100 001

Dbg

I General Events for 1PU 2

IAGG_EVENT2

1 1100 010

Dbg

I General Events for 1PU 3

IAGG_EVENT3

1 1100 011

Dbg

M General Events for TPU 0

MAGG_EVENTO

0 1100 000 E

Dbg

M General Events for TPU 1

MAGG_EVENTl

0 1100 001

Dbg

M General Events for 1PU 2

MAGG_EVENT2

0 1100 010

Dbg

M General Events for TPU 3

MAGG_EVENT3

0 1100 011

Dbg

PR_INST_CTL

11101 000 M

All

Name
Performance Monitoring IPRs3
Event Counter IPR Bundle

M Event Counter IPR Bundle

Profile I Data IPR Bundle
Profile Instruction Control

Compaq Confidential

5 January 2001 ~· Subject To Change

Internal Processor Registers

16-1

Internal Processor Register Summary

Table 16-1 Internal Processor Register Summary (Continued)
Name

Mnemonic

PerTPU

Index

Writer
Class1 Read lnit2 Grp

Profile Trigger on PC

PR_TRIG_PC

1 1101 001 M

All

Profile Instruction Character.

PRn_pc

1 1101 Oln E

Dbg

Profile Instruction Ibox Info

PR_I_INFO

1 1101 100 E

Dbg

Profile Instruction Qbox Info

PR_Q_INFO

1 1101 101 E

Dbg

Profile MAGG_EVENT CTRL

PR_MEM_EVENT_CTL

0 1101 000 M

All

Profile Memory Information

PRn_MEM_INFO

0 1101 Oln E

Dbg

Profile Store Latency Info

PR_ST _LATENCY

0 1101 100 E

Dbg

Profile Instr Timeline part 0

PRn_TIMELINEO

1 1110 OOn E

Dbg

Profile Instr Timeline part 1

PRn_TIMELINEl

1 1110 Oln E

Dbg

Profile Instr Timeline part 2

PRn_TIMELINE2

1 1110 lOn E

Dbg

Profile Instr Timeline part 3

PRn_TIMELINE3

1 1110 lln

Dbg

PRn_DMISS_INFO

0 1110 OOn E

Dbg

Profile M Data IPR Bundle

PROFILE Timeline IPR Bundle

Profile D Miss Bundle
Profile Dcache Miss Info

lbox (Instruction Fetch Unit) IPRs
Cycle Counter

1 0111 o:xx

CPU Configuration

CPU_CNFG

1 1001 000

All

DTB Single-Miss Return Address

DTBMS_REI _ADDR

10100111

Dbg 11

Exception Address

EXC_ADDR

10100001

Dbg 11

Exception Summary

EXC_SUM

10100000

Dbg 11

!box Control

I_CTL

1 oooooxx

Dbg 11

!_MODE

1 0001 o:xx

Dbg 11

!box Process Context

I_PCTX

1 OOlOXXX M

Dbg 11

!cache Status

IC_STAT

11001 001

Dbg 11

!cache Flush

IC_FLUSH

11001100

!cache Flush (ASM=O)

IC_FLUSH_ASM

1 1001101

ITB Invalidate Multiple

ITB_IM

11000000

ITB Invalidate Single

ITB_IS

11000010

Instruction PfE Array Write

ITB_PfE

11000100

Instruction TAG Array Write

ITB_TAG

11000110

Inst. Virtual Address Format

IVA_FORM

10100011

PALcode Base

PAL_BASE

1 0101 000

CPU Base

CPU_BASE

1 0101 010

All

!box Mode

Compaq Confidential
16-2

Internal Processor Registers

5 Janw~ry 2001 -· Subject To Change

Internal Processor Register Summary

Table 16-1 Internal Processor Register Summary (Continued)
Name

Mnemonic

PerTPU

Index

Writer
Class1 Read lnit2 Grp

PALcode Temporary 1

PAL_TEMPl

10101001

Dbg

PALcode Temporary 2

PAL_TEMP2

10101010

Dbg

TPU_CNFG

11011000

All

Dbg

Thread Config

Mbox ( Internal Memory Controller Unit ) IPRs
Dcache Control

DC_CTL

0 1001 000

Dcache Status

DC_STAT

01001001

DTB Invalidate Multiple

DTB_IM

01000000

DTB_IS

0 1000010

DTB PIE Array Write

DTB_PfE

01000 lOX

DTB Tag Array Write

DTB_TAG

0100011X

Mbox Control

M_CTL

Dbg

Mbox Process Mode

M_MODE

ooooooxx M
00001 xxx M

Dbg

Mbox Process Context

M_PCTX

0 OOlOOXX

Dbg

Mbox Mem. Management Status

M_STAT

0 0100000

Dbg

QUIESCE_TIMEOUT

0 0111000

Dbg

Virtual Address

0 0100001

Dbg

Virtual Address Format

VA_FORM

0 0100011

Watch Physical Address

WATCH_PHYS_ADDR

HW_INT_CLR

Int. Enable and Current Mode

IER_CM

Interrupt Summary

ISUM

Software Interrupt Request

SIRR

DTB Invalidate Single

Quiesce Timeout

Dbg

Cbox (Scache Control) IP Rs
Hardware Interrupt Clear

Rbox ( External Router Unit ) IPRs
Router Configuration 1

R_CFGl

Router Configuration 2

R_CFG2

Router Interrupt Mask

R_INT_MASK

Router Interrupt Queue

R_INT_QUE

Router Interrrupt Queue Add

R_INT_QUEADD

Router Interrupt Request

R_INT_REQ

Router Interrupt Status

R_INT_STAT

Router Interval Timer

R_INTER_TIM

Router IO Port Buffer Size

R_IO_BUFSIZ

Router IO Port Config 1

R_IO_CFGl

Router IO Port Config 2

R_IO_CFG2

Compaq Confidential
5 January 2001 - Subject To Change

Internal Processor Registers

16-3

Internal Processor Register Summary

Table 16-1 Internal Processor Register Summary (Continued)
Name

Mnemonic

Router IO Port Error Status

R_IO_ERR

Router IO Port Pelf. Counter

R_IO_PERF

Router IO Port Timerl Config

R_IO_TlCFG

Router IO Port Timer2 Config

R_IO_T2CFG

Router Local Port Error Status

R_LOC_ERR

Router Channeln Config 1

R_n_CFGl

Router Channeln Config 2

R_n_CFG2

Router Channeln Error Status

R_n_ERR

Router Channeln Pelf Count

R_n_PERF

Router Channeln Timerl Config

R_n_TlCFG

Router Channeln Timer2 Config

R_n_T2CFG

Router Overall Timer-Control

R_OVER

Router Routing Table

R_ROUT

Router Scratch 1

R_SCRATCHl

Router Scratch 2

R_SCRATCH2

Router Who-Am-I?

R_WHOAMI

PerTPU

Index

Writer
Class1 Read lnit2 Grp

Zbox (External Memory Controller Unit) IPRs
DIFT Control

ZBOXn_DIFf_CTL

DIFT Error Status

ZBOXn_DIFf_ERR_STATUS

DIFT Timeout

ZBOXn_DIFf_TIMEOUT

DRAM Calibration Control 1

ZBOXn_DRAM_CALIB_CTLl

DRAM Calibration Control 2

ZBOXn_DRAM_CALIB_CTL2

DRAM Error Address

ZBOXn_DRAM_ERR_ADR

DRAM Error Status 1

ZBOXn_DRAM_ERR_STATUSl

DRAM Error Status 2

ZBOXn_DRAM_ERR_STATUS2

DRAM Error Status 3

ZBOXn_DRAM_ERR_STATUS3

DRAM Error Control

ZBOXn_DRAM_ERROR_CTL

DRAM Initialization Control

ZBOXn_DRAM_INIT_CTL

DRAM Mapper Control

ZBOXn_DRAM_MAPPER_CTL

DRAM Refresh Control

ZBOXn_DRAM_REFR_CTL

DRAM Refresh Row

ZBOXn_DRAM_REFRESH_ROW

DRAM Sweep Directory Bits

ZBOXn_DRAM_SWEEP_DIR

DRAM Timing Control 1

ZBOXn_DRAM_TIMING_CTLl

DRAM Timing Control 2

ZBOXn_DRAM_TIMING_CTL2

DRAM Timing Control 3

ZBOXn_DRAM_TIMING_CTL3

Compaq Confidential
16-4

Internal Processor Registers

5 Jc1m.1c1ry 2001 - Subject To Change

Internal Processor Register Summary

Table 16-1 Internal Processor Register Summary (Continued)
Name

Mnemonic

DRAM Timing Control 4

ZBOXn_DRAM_TIMING_C1L4

Force Error Address

ZBOXn_FRC_ERR_ADR

RAC Control

ZBOXn_RAC_C1L

Performance Counter 1

ZBOXn_ZPM_C1Ll

Performance Counter 0

ZBOXn_ZPM_CTRO

PerTPU

Index

M = MTPR; I = Implict; E = Event
See Table 16-2.

Chapter 19 contains the information for Performance Monitoring IPRs.

Writer
Class1 Read lnit2 Grp

16.1.1 PALcode Coding Rules
PALcode coding rules are described in Section 17.5.

16.1.2 IPR Issues:
•

The 21264 had the behavior that an implicitly written register would read as zero if
read while being written. Will the 21464 have the same behavior? Should we
define a valid bit in each of the implicitly-written registers to explicitly flag this
case? All registers except VA have bit<63> available.

•

Need to better understand SLEEP modes and GCLK PLL programming. This is
also tied into how to bring the chip alive. What state must be preserved when entering/exiting sleep mode?

•

What does I_CTL[CHIP_ID] do? If it cannot be written, how is it different than
AMASK/IMPLVER?

•

Disruptions and PALmode in the !box describes several cases where a combination
of traps within traps overwrites implicitly written IPRs. Can these cases be enumerated to form guidelines or define a rule?

16.1.3 Reset
This section should be moved to the !nit chapter when more known ....
The 21464 will have at least three major reset modes (with maybe a fourth for manufacturing test):
1. Cold
Power-on full-reset. Initialize all IPRs.
2. Fast
Quick, complete reset for Tandem synchronization. Initialize all IPRs.
3. Debug

Compaq Confidential
5 January 2001 - Subject To Change

Internal Processor Registers

16-5

lbox IPRs

Programmable reset for debug. Initialize only required IPRs; the required subset
being defined as IPRs that contain bits that could alter the initial post-reset code
flow.
IPRs fall into two basic categories:
•

Registers that must be set by hardware to an initial value for all reset flows.

•

Registers that can be initialized by software (PALcode) during the flow.

The primary reason to NOT initialize an IPR is for debug. Manufacturing test patterns,
Tandem synchronization, and general chip simulation and verification benefit from
hardware initialization of most or all IPR values. In addition, implicitly written and
event-written IPRs that are not also writable by a HW_MTPR can be difficult to initialize with software.
It should also be noted that if a destructive scan dump operation precedes a debug reset,
the contents of all uninitialized IPRs are potentially unknown random values.

Table 16-2 defines the classes of initialization.
Table 16-2 IPR Initialization Classification
Class

Meaning

All

The value is initialized by hardware for all reset flows.

Dbg

The value is initialized by hardware for Cold and Fast reset flows but left to software to initialize for the debug reset flows.

The value is not initialized by hardware during any reset flow. The goal is to eliminate this class.

16.2 lbox IPRs
This section describes the Ibox IPRs.
The IPR reserved fields can have the following type:
Table 16-3 IPR Reserved Field Type Definitions
Type

Meaning

MBZ

Must be zero when written and always read as zero.

RAZ

Ignored for writes and always read as zero.

Ignored for writes and reads.

16.2.1 Cycle Counter Register -

CC[tpu]

The process cycle counter consists of two fields. The COUNT field is an unsigned,
wrapping counter, the OFFSET field is an operating-system specific offset which, when
added to the wrapping counter, forms a per-process or per-thread cycle count. The
ENABLE field is used to enable/disable the counter.
The RPCC instruction is used to read the process cycle counter. It is TBD whether a
MFPR instruction will also read the cycle counter.
Section (II-A) 2.1.12 of the Alpha SRM requires a mechanism to cause the RPCC
instruction to read-as-zero, writing CC_CTL with a zero achieves that result.
Com p.aq Confidential
16-6

Internal Processor Registers

5 Jc1nw1ry 2001 ···Subject To Cfumge

lbox IPRs

Notes:

•

Most event-written IPRs in the 21464 will have a valid bit because they may not
read correctly when being updated by an event. This register must read correctly
even if the counter is being incremented.
RPCC

Read:
Written
Index

HW_MTPR
OxB8-0xBB

The low two index bits allow for selective writing of fields.
00
01
10
11

write nothing
write OFFSET and ENABLE fields only
write COUNT field only
write all fields

Figure 16-1 Cycle Counter Register- CC[tpu]

Table 16-4 Cycle Counter Register Fields Description
Field Name

Extent Type

Description

OFFSET

63:32

RW,O

OS specific value that is added to PCC_CNT to derive the per-process
cycle count. Overflow of PCC_CNT does not alter PCC_OFF.

COUNT

31:4

RW,O

Wrapping counter which increments once every 16th CPU cycle.

Reserved

3:1

RAZ

ENABLE

RW,O

Cycle counter enable. The COUNT field increments monotonically
when enabled and remains unchanged when disabled.

16.2.2 OTB Single-Miss Return Address Register - DTBMS_RET_ADDR[tpu]
Stores the return PC for single-level DTB miss traps. On the 21264, the return address
was stored in EXC_ADDR, but saving the value in a separate register avoids cases
where disruptions during the single miss flow cause the EXC_ADDR register to be
modified and the return PC for the DTB miss flow to be lost.
The traps that set DTBMS_RET_ADDR are:
DTBMS_SINGLE
DTBMS_SINGLE_CONS

Compaq Confidential
5 January 2001 - Subject To Change

Internal Processor Registers

16-7

lbox IPRs

DTBMS_RET_ADDR is readable at two different locations. The first location (OxA6)
is a general location with no side-effects. The second location (OxA7) has the sideeffect of setting a issue block against load and store instructions and is intended to only
be used within the block of instructions that modifies a DTB entry.
Written
Readable
Index
Reset

Implicitly written when a DTB single miss trap occurs.
HW_MFPR

OxA6, OxA7
Unchanged for debug.

Figure 16-2 OTB Single-Miss Return Address Register -

DTBMS_RET_ADDR[tpu]

~·1

ADDR

MODE

Table 16-5 OTB Miss Return Address Register Field Descriptions
Field Name

Extent Type

Description

ADDR

63: 2

IR,O

Sign-extended PC of the instruction that caused the TB miss, where
ADDR is SEXT(PC<51:2>).

MODE

1:0

IR,O

Mode of the trapping instruction:
00-Normal
01-PALmode
11 - SuperPALmode

16.2.3 Exception Address Register -

EXC_ADDR[tpu]

Implicitly written with the expected restart PC for most PALmode traps.
The only PALmode traps that do not write EXC_ADDR are:
DTBM_SINGLE
DTBM_SINGLE_CONS
IMCHK
For Interrupts, this register contains the PC of the next instruction that would have executed had the interrupt not occurred. PALcode uses this address as the return address
from the interrupt handler.
Written
Readable
Index
Reset

Implicitly written when a trap occurs.
HW_MFPR
OxAl
Unchanged for debug.

Figure 16-3 Exception Address Register -

EXC_ADDR[tpu]

~·1

ADDR

MODE

Compaq Confidential
16-8

Internal Processor Registers

5 Jc1nuc1ry 2001 - Subject To Change

lbox IPRs

Table 16-6 Exception Address Register Field Descriptions
Field Name

Extent Type

Description

ADDR

63:2

IR,O

For all traps except ARITH and MT_FPCR, the restart PC written into
EXC_ADDR is the sign-extended PC, SEXT(PC<51:2>), of the instruction that caused the trap. For ARITH andMT_FPCR, the restart address
is the PC of the next instruction

MODE

1:0

IR,O

Mode of the trapping instruction:
00-Normal
01 - PALmode
11 - SuperPALmode

16.2.4 Exception Summary Register -

EXC_SUM[tpu]

The exception summary register is an implicitly written register that contains trap status
information and any register specifiers present in the original instruction.
The traps that set EXC_SUM are:
ARI TH

DFAULT
UNALIGN
DTBM_SINGLE
DTBM_SINGLE_CONS
BAD_JMP_IVA
The register fields actually reflect bits <26:16> and <4:0> of the instruction longword
independent of the type of operation. The fields are not qualified in any way for
instructions that lack one or more of the register fields.
Written
Readable
Index
Reset

Implicitly written when a trap occurs.
HW_MFPR
OxAO
Unchanged for debug.

Compaq Confidential
5 January 2001 -~ Subject To Change

Internal Processor Registers

16-9

lbox IPRs

Figure 16-4 Exception Summary Register 63

EXC_SUM[tpu]
28

2423

1918

INT
SET_IOV
SET_INE
SET_UNF - - - - '
SET_OVF - - - - '
SET_DZE - - - - - '
SET_INV _ _ ____,
IOV------'
INE - - - - - - - '
UNF - - - - - - - '
OVF--------'
DZE - - - - - - - - '
INV--------'

swc--------'
Table 16-7 Exception Summary Register Field Descriptions
Field Name

Extent Type

Reserved

63:29

RAZ

28:24

IR,O

Instruction bits<25:21> of the trapping instruction

23:19

IR,O

Instruction bits<20: 16> of the trapping instruction

18:14

IR,O

Instruction bits<4:0> of the trapping instruction

INT

IR,O

Integer overflow/underflow trap

SET_IOV

IR,0

PALcode should set FPCR[IOV]

SET_INE

IR,0

PALcode should set FPCR[INE]

SET_UNF

IR,O

PALcode should set FPCR[UNF]

SET_OVF

IR,O

PALcode should set FPCR[OVF]

SET_DZE

IR,O

PALcode should set FPCR[DZE]

SET_INV

IR,O

PALcode should set FPCR [INV]

IOV

IR,O

Floating convert to integer trap

INE

IR,O

Floating inexact error trap

UNF

IR,O

Floating underflow trap

OVF

IR,0

Floating overflow trap

DZE

IR,O

Divide by zero trap

INV

IR,O

Invalid operation trap

swc

IR,O

Software completion possible/requested. Set if the instruction that triggered the trap contained the /S specifier.

Description

Compaq Confidential
16-10 Internal Processor Registers

5 Janwiry 2001 ··· Subject To Change

lbox IPRs

16.2.5 lbox CPU Configuration Register -

CPU_CNFG

Per-chip configuration register. Settings apply to all TPUs.
Written
Readable
Index
Reset

HW_MTPR
HW _MFPR
OxCS
All modes.

Figure 16-5 lbox CPU Configuration Register -

CPU_CNFG
4 3

CHIP_ID

SLOT1_DIS _ ___,
LPR_SEQ_DIS _ ____,
FTC_RR _ _ ___.
CBBYP_DIS ----..J
CBCLPS_DIS - - - - - - l
CBRFAST_DIS - - - - - - - '
TLB_USE1 - - - - - - - - - '
UPD_TD - - - - - - - - l
RMP_WAY - - - - - - - - - '
PREF_EN - - - - - - - - - '
ANTl_STARVE - - - - - - - - - - '
SS_2TRAIN - - - - - - - - - - - '
SS_GH - - - - - - - - - - - - '
SS_CLR------------'
THRASH_LIMIT - - - - - - - - - - - - - - - '

Table 16-8 CPU Configuration Register Fields Description
Field Name

Extent Type

Description

CHIP_ID

63:56

RO,O

Reserved

55:25

MBZ

µITB_DIS

RW,O

Disable the µITB performance feature (a debug mode).

BPR_DIS

RW,O

Disable the branch predictor. When set, all branches will be predicted
not-taken.

SLOTl_DIS

RW,O

Disable the use of slot 1 fetching.

LPR_SEQ_DIS

RW,O

Disable sequential training, predict non-sequential

FTC_RR

RW,O

Force the fetch thread chooser into round-robin mode

CBBYP_DIS

RW,O

Disable bypasses around the collapsing instruction buffer

CBCLPS_DIS

RW,O

Disable the collapsing capability of the collapsing buffer

CBRFAST_DIS

RW,O

Disable Oldest CBR mispredict fast restart optimization

TLB_USEl

RW,O

Use only one entry in the ITB.

UPD_TD

RW,O

Update the thrash detector array

RMP_WAY

RW,O

Enable remapping the !cache way when a thrash has been detected by
the thrash detector.

Read-only CHIP_ID code

Compaq Confidential
5 January 2001 -~Subject To Change

Internal Processor Registers 16-11

lbox IPRs

Table 16-8 CPU Configuration Register Fields Description (Continued)
Field Name

Extent Type

Description

PREF_EN

RW,O

Enable the prefetch hardware.

ANTI_STARVE

12:11

RW,O

Controls the fetch starvation threshold detection. If a TPU does not
retire an instruction for the selected number of instructions, the other
TPUs will be suspended until the starving threads have retired at least
one good instruction.
00
01
10
11

Off (anti-starvation detection disabled)
lK cycle non-retire threshold
16K cycle non-retire threshold
128K cycle non-retire threshold

SS_2TRAIN

RW,O

SS_GH

9:8

RW,O

By enabling these bits, one or both of the upper two bits of the Store Set
array index will include LGHIST information.

SS_CLR

7:4

RW,O

The Store Set array is cleared whenever the bit defined by this field
overflows in a free-running counter. The bit monitored is bit (9 +
SS_CLR) creating a clear frequency of 2< 9+ss_CLR)_ A SS_CLR value
of zero disables the Store Set array. The recommended value of
SS_CLR is ten causing the Store Sets to be cleared every 512K cycles.

THRASH_LIMIT

3:0

RW,O

Number of thrashes before the entry is remapped to a set location in the
le ache.

16.2.6 lbox TPU Configuration Register-TPU_CNFG
Icache/Ibox configuration register.
Written
Readable
Index
Reset

HW_MTPR
HW_MFPR
OxD8
All modes.

Figure 16-6 lbox TPU Configuration Register - TPU_CNFG
63

Table 16-9 lbox TPU Configuration Register Field Descriptions
Field Name

Extent Type

Description

Reserved

63:58

MBZ

TPU_ID

57:56

R0,0-3 Read-only ID of the current TPU. The 21464 has four TPUs numbered
0,1,2,3.

Reserved

55:8

MBZ

Compaq Confidential
16-12 Internal Processor Registers

5 Jc1nuc1ry 2001 ·- Subject To Change

lbox IPRs

Table 16-9 lbox TPU Configuration Register Field Descriptions
Field Name

Extent Type

Description

MCHK_EN

RW,O

Enable machine-check interrupts to this TPU.

PREF_RANGE

6:4

RW,O

Number of le ache blocks to prefetch beyond a demand miss

Reserved

3:0

MBZ

16.2.7 lbox Control Register - LCTL[tpu]
The per-TPU l_CTL register controls !stream memory management functions.
Most fields in I_CTL are replicated in M_CTL. It is expected (but not required) that
these registers will be typically written together.
Written
Readable
Index
Reset

HW_MTPR
HW_MFPR
Ox80-0x83
Unchanged for debug.

The low two index bits allow for selective writing of fields.
00
01
10
11

Write nothing
Write VPfE_BASE only
Write SUPERPAGE, VA_SIZE and PAGE_SIZE fields
Write all fields

Figure 16-7 lbox Control Register 63

I_CTL[tpu]

525;;.,;..1_ _ _ _ _ _ _ _--""""33

2120

VPTE_BASE<51 :33>
REDUCED_PT
SUPERPAGE
VA_SIZE - - - - - - - '
PAGE_SIZE _ _ _ __,

Table 16-10 lbox Control Register Field Descriptions
Field Name

Extent Type

Reserved

63:52

MBZ

VPTE_BASE

51:33

RW,O

Reserved

32:28

MBZ

REDUCED_Pf

Description

Virtual Page Table Base. See IVA_FORM, Section 16.2.17, for details.

See Appendix C.3.

Compaq Confidential
5 January 2001 ··· Subject To Change

Internal Processor Registers 16-13

lbox IPRs

Table 16-10 lbox Control Register Field Descriptions
Field Name

Extent Type

Description

SUPERPAGE

26:24

!stream Super Page mode enables. Any combination of bits can be set
at once. Any non-kernel mode access to an enabled superpage region
must result in an access violation.

RW,O

SPE[2]
SPE[l]
SPE[O]

Enables super page mapping when PC[63:50] = Ox3FFE. In
this mode PC[47:0] is mapped directly to PA[47:0].
Enables super page mapping when PC[63:41] = Ox7FFFFE.
In this mode PA[47:0] = SEXT(PC[40:0]).
Enables super page mapping when PC[63:30] =
Ox3FFFFFFFE. In this mode PA[47:0] = ZEXT(PC[29:0]).

Reserved

23:22

RAZ

VA_SIZE

RW,O

Defines the I-Stream Virtual address size. Controls the IVA_FORM
register and sign extension checking.
VA_SIZE = 0 - 43-bit addressing
VA_SIZE = 1 - 52-bit addressing (invalid if PAGE_SIZE = 0)

PAGE_SIZE

RW,O

Defines the I-Stream page size. Controls the IVA_FORM register
PAGE_SIZE = 0 - 8KB pages
PAGE_SIZE = 1 - 64KB pages

Reserved

19:0

As follows:
Bits

Type

19
18
17:8
7:6
5:3
2:0

x
MBZ

x
RAZ

x
MBZ

16.2.8 lbox Process Mode Register- l_MODE[tpu]
The Ibox process mode register specifies the console and current process mode. These
mode bits shadow the bits in M_MODE and exists on a per-TPU basis.
Written
Read
Index
Reset

HW_MTPR
HW_MFPR
Ox88-0x8B
Unchanged for debug.

The low two index bits allow for selective writing of fields.
00
01
10
11

Write nothing
Write CURRENT field only
Write CONSOLE field only
Write all fields

Compaq Confidential
16-14 Internal Processor Registers

5 Jc1nuary 2001

Subject To Change

lbox IPRs

Figure 16-8 lbox Process Mode Register -

l_MODE[tpu]
65432

Table 16-11 lbox Process Mode Register Fields Description
Field Name

Extent Type

Reserved

63:6

Description

As follows:
Bits

63:52
51:33
32:28
27:19
18
17:6

Type

MBZ

x
MBZ

CONSOLE

RW,O

ITB traps in console mode are reported to the trap handler separately
from non-console mode traps so they can be vectored to different
addresses.

CURRENT

4:3

RW,O

The CURRENT mode field is encoded as follows
00-Kernel
01-Executive
10 - Supervisor
11-User

Reserved

2:0

MBZ

16.2.9 lbox Process Context Register -

l_PCTX[tpu]

The process context register contains information associated with the context of the
process currently running on the TPU.
Written
Readable
Index
Reset

HW_MTPR
HW_MFPR
Ox90-0x97
Unchanged for debug.

The low three index bits allow for selective writing of fields.
000
001
010
100
111

Write nothing
Write ASN field only
Write TPU_GRP field only
Write FP_ENABLE field only
Write all fields

Compaq Confidential
5 January 2001 ···Subject To Change

Internal Processor Registers 16-15

lbox IPRs

Figure 16-9 lbox Process Context Register -

l_PCTX[tpu]

Table 16-12 lbox Process Context Register Field Descriptions
Field Name

Extent Type

Reserved

63:20

FP_ENABLE

Description

As follows:
Bits

Type

63:52
51:33
32:28

MBZ

27:20

RW,O

If clear, floating-point instructions generate FEN exceptions. Used at

x
MBZ

process context switch time to detect if any FP state exists. Software
clears the bit when a process is initially created and only sets it when the
first FP instruction traps.
Reserved

MBZ

TPU_GRP

17: 16

RW,0

TPU group number associated with this TPU. Allows TB entries to be
collectively allocated/invalidated for all TPUs that belong to the same
group.

ASN

15:8

RW,O

Address space number. Stored in TBs and compared during invalidate
operations to minimize the number of entries invalidated on a process
context switch. See SRM Section (II-B) 3.8 for details.

Reserved

7:0

As follows:
Bits

Type

7:3
2:0

x
MBZ

16.2.10 lcache Status Register- IC_STAT[tpu]
The Ibox status register is a read/write-1-to-clear register that contains Ibox status
information.
Written
Readable
Index
Reset

Set implicitly by lcache data or tag parity error, WlC
HW_MFPR
OxC9
Unchanged for debug.

Figure 16-10 lcache Status Register- IC_STAT[tpu]
1 0

INDEX
MULTIPLE _ ___,
PAR_ERROR _ _ __,

Compaq Confidential
16-16 Internal Processor Registers

5 Jc1m.1c1ry 2001 ~·Subject To Change

lbox IPRs

Table 16-13 lcache Status Register Fields Descriptions
Description

Field Name

Extent Type

Reserved

63:16

RAZ

INDEX

15:5

RO,O

Reserved

4:2

RAZ

MULTIPLE

RO,O

PAR_ERROR

WlC,O Set when a data or tag parity error occurs, Cleared when wtten with a 1.
Writing a 1 also clears the MULTIPLE field.

Cache line index that caused the PAR_ERROR bit to be set.

Set when a data or tag parity error occurs and the PAR_ENABLE bit is
already set. Cleared whenever the PAR_ERROR bit is cleared.

16.2.11 lcache Flush Register- IC_FLUSH[tpu]
When a write to the IC_FLUSH pseudo-register retires, all !cache blocks that match the
group number of this TPU (in l_PCTX) are marked invalid.
Written
Readable
Index
Reset

HW_MTPR
No
OxCC
NIA

Figure 16-11 lcache Flush Register- IC_FLUSH[tpu]
63

Table 16-14 lcache Flush Register Fields Description
Field Name

Extent Type

Reserved

63:0

Description

16.2.12 lcache Flush (ASM=O) Register- IC_FLUSH_ASM[tpu]
When a write to the IC_FLUSH_ASM pseudo-register retires, all !cache blocks that
match the group number of this TPU (in l_PCTX) and have their ASM bit cleared are
marked invalid.
The current implementation actually performs an IC_FLUSH when this IPR is written.
Written
Readable
Index
Reset

HW_MTPR
No
OxCD
NIA

Compaq Confidential
5 January 2001 ··· Subject To Change

Internal Processor Registers 16-17

lbox IPRs

Figure 16-12 lcache Flush (ASM

=0) Register- IC_FLUSH_ASM[tpu]

Table 16-15 lcache Flush (ASM
Field Name

Extent Type

Reserved

63:0

=O) Register Fields Description
Description

16.2.13 ITB Invalidate Multiple Register- ITB_IM[tpu]
The ITB Invalidate Multiple register is a write-only pseudo-register. When write instructions to this register retire,
all ITB entries matching the criteria specified by the mode bits are invalidated. An explicit write to IC_FLUSH is
required to flush the lcache of the corresponding blocks.
Written
Readable
Index
Reset

HW_MTPR
NO
OxCO
NIA

Figure 16-13 ITB Invalidate Multiple Register- ITB_IM[tpu]

---.
MODE

5 4

Table 16-16 ITB Invalidate Multiple Register Fields Descriptions
Field Name

Extent Type

Reserved

63:5

MBZ

MODE

4:0

WO,O

Description

Defines the invalidation mode, as follows:
Value

Mnemonic Description

OxOO
OxOl

IAG
IA

Ox03

IASM

Ox05

IAP

Ox OD

IASN

OxlO

IAG

Ox16

IWRP

Invalidate all entries independent of group
Invalidate all entries that match the current
TPUGRP field in M_PCTX.
Invalidate all entries with the ASM bit set that
also match the current TPUGRP.
Invalidate all entries with the ASM bit clear that
also match the current TPUGRP
Invalidate all non-ASM entries that match the
current TPUGRP and ASN fields.
Invalidate all entries independent of group and
reset the write pointer
Reset the write pointer only. Nothing is invalidated

Compaq Confidential
16-18 Internal Processor Registers

5 Jc1nuc1ry 2001 -·Subject To Change

lbox IPRs

16.2.14 ITB Invalidate Single Register- ITB_IS[tpu]
When a write to the ITB_IS pseudo-register retires, all ITB entries that match the group
number and address space number of this TPU (in I_PCTX) and match the tag value
supplied are marked invalid.
The implementation physically shares storage for this register with the ITB_TAG register so writes to these registers must be separated by a IFETCHB instruction to ensure
correct ordering.
Written
Readable
Index
Reset

HW_MTPR
No

OxC2
No

Figure 16-14 ITB Invalidate Single Register- ITB_IS[tpu]
63

525.;..;..1_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _--'-'1312

VA<51:13>

Table 16-17 ITB Invalidate Single Register Fields Description
Field Name

Extent Type

Reserved

63:52

VA<51:13>

51:13

Reserved

12:0

Description

MBZ

16.2.15 Instruction PTE Array Write Register- ITB_PTE[tpu]
The ITB PfE array is written by way of this register. A write transaction to the
ITB_TAG writes a register outside of the ITB array. When a write to the ITB_PTE register is retired, the contents of both the ITB_TAG and ITB_PfE registers are written
into the ITB entry. The specific ITB entry written is determined by the round-robin
algorithm described above.
Written
Readable
Index
Reset

HW_MTPR
No

OxC4
No

Compaq Confidential
5 January 2001 -· Subject To Change

Internal Processor Registers 16-19

lbox IPRs

Figure 16-15 Instruction PTE Array Write Register -

ITB_PTE[tpu]

PA<44:13> or PA<47:16>
URE--~

SAE--__.
ERE---~

KRE - - - - - - - '

GH[1:0] _ _ _ _ _......

ASM - - - - - - - - '

Table 16-18 Instruction PTE Array Write Register Field Descriptions
Field Name

Extent Type

Description

PA<44:13> or
PA<47:16>

63:32

W,?

The physical page number is PA<44:13> when in SK page mode and
PA<47: 16> when in 64K page mode. The 21464 cannot address more
than 16TB when in 8K page mode.

Reserved

31: 12

MBZ

URE

W,?

User write enable. When process context is User mode, this bit must be
set to write this entry.

SRE

W,?

Supervisor write enable. When process context is Supervisor mode,
this bit must be set to write this entry.

ERE

W,?

Executive write enable. When process context is Executive mode, this
bit must be set to write this entry.

KRE

W,?

Kernel write enable. When process context is Kernel mode, this bit
must be set to write this entry.

Reserved

MBZ

GH[l:O]

6:5

W,?

Granularity hint.

ASM

W,?

Address space match bit. When set, this PTE matches all address space
numbers.

Reserved

3:0

MBZ

16.2.16 Instruction Tag Array Write Register- ITB_TAG[tpu]
The ITB tag array is written by way of this register. A write transaction to the
ITB _TAG writes a register outside of the ITB array. When a write to the ITB _PTE register is retired, the contents of both the ITB_TAG and ITB_PTE registers are written
into the ITB entry. The specific ITB entry written is determined by a round-robin algorithm; the algorithm writes to entry number 0 as the first entry after the 21464 is reset.
The implementation shares the physical register with the ITB_IS register so a
IFETCHB instruction must separate writes to the ITB_TAG and ITB_IS registers.
Written
Readable
Index
Reset

HW_MTPR
No
OxC6
No

Compaq Confidential
16-20 Internal Processor Registers

5 Jc1m.1c1ry 2001 -· Subject To Change

lbox IPRs

Figure 16-16 Instruction Tag Array Write Register- ITB_TAG[tpu]
525.;...;..1_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
1312

VA<51:13>

Table 16-19 Instruction Tag Array Write Register Fields Description
Field Name

Extent Type

Reserved

63:52

VA<51:13>

51:13

Reserved

12:0

Description

MBZ

16.2.17 Instruction Virtual Address Format Register -

IVA_FORM[tpu]

The read-only virtual address format register contains the virtual page table entry
address derived from the faulting virtual address stored in the EXC_ADDR register
along with the virtual page table base and associated control bits stored in the I_CTL
register.
Written
Readable
Index
Reset

NIA (Derived from other implicitly and explicitly written registers)
HW_MFPR

OxA3
NIA

Figure 16-17 Instruction Virtual Address Format Register- IVA_FORM[tpu]
43-bitVA/ 8KB pages (VA_SIZE=O, PAGE_SIZE=O, REDUCED_PT=O)
~63'------------------'-33~32;;;..__ _ _ _ _ _ _ _ _ _ _ _ _ _.....;..,32

_____s_EX_T_N_P_T_E__B_A_S_E<_5_1:_33_>_)_ _ _ _~l_______PC_<4_2_:1_3>_________~

52-bitVA/ 64KB pages (VA_SIZE=1, PAGE_SIZE=O, REDUCED_PT=O)
63

4241
--S-EXT_N_P_T_E-_B_A_S_E<_5_1-:42_>_)_-....l
_ _ _ _ _ _ _ _S_E_XT_(_PC_<_5_1:-16_>_)_ _ _ _ _ _ _ 3 2

__,~

52-bitVA/ 64KB pages (VA_SIZE=1, PAGE_SIZE=1, REDUCED_PT=1)
~63'------------4~24~1-~~------~-------------.32

~I_ _s_EXT_N_P_T_E__B_A_s_E<_5_1_:4_2>_)_ ___.l_ooo__.l~o_1~I_ooo_oooo
__ooo_o_oo_ _,__ _ _ _P_C_<4_9_:2_9>_ _ ____.~
Table 16-20 Instruction VA Format Register (43-Bit VA) Fields Description
Field Name

Extent Type

SEXT(VPfE_BASE<51:33>)

63:33

PC<42: 13>

32:3

Reserved

2:0

Description

RAZ

Compaq Confidential
5 January 2001 -· Subject To Change

Internal Processor Registers 16-21

lbox IPRs

Table 16-21 Instruction VA Format Register (52-Bit VA, REDUCED-PT =0) Fields Description
Field Name

Extent Type

SEXT(VP1E_BASE<51 :42>)

63:42

SEXT(PC<51: 16>)

41:3

Reserved

2:0

Description

RAZ

Table 16-22 Instruction VA Format Register (52-Bit VA, REDUCED-PT=1) Fields Description
Field Name

Extent Type

SEXT(VP1E_BASE<51 :42>)

63:42

Description

41:39
38:37
36:24
PC<49:29>

23:3

Reserved

2:0

RAZ

16.2.18 PALcode Base Address Register - PAL_BASE[tpu]
Based on the type of fault, the hardware vectors into the appropriate PALcode handler
as an offset from the physical address in PAL_BASE.
The specific PALcode entry points and offsets are:
Table 16-23 PALcode Base Address Entry Points and Offsets
RESERVED

PB + xOOO

DTBM_DOUBLE

PB +xlOO

DTBM_DOUBLE_ALT

PB + x180

FEN

PB +x200

UNALIGN

PB + x280

DTBM_SINGLE

PB + x300

DFAULT

PB + x380

OPCDEC

PB + x400

IACV

PB + x480

MCHK

PB + x500

ITB_MISS

PB + x580

ARITH

PB + x600

INTERRUPT

PB + x680

MT_FPCR

PB +x700

IMCHK

PB + x780

DTBM_SINGLE_CONS

PB + x800

ITB_MISS_CONS

PB + x880
Compaq Confidential

16-22 Internal Processor Registers

5 J~1n1.u~ry 2001 ··· Subject To Change

lbox IPRs

Table 16-23 PALcode Base Address Entry Points and Offsets
RESERVED

PB + xOOO

BAD _JUMP_IVA

PB +x900

FAULT_RESET

PB +x980

WAKEUP

PB +xAOO

IP_RESET

PB +xA80

RESET

PB +xBOO

PAL_BASE is also used in the computation of CALL_PAL branches. The 21464 computes the target PC of a CALL_PAL instruction as follows:
Bits

Contents

PC<51:15>
PC<l4>
PC<l3>
PC<l2:12>
PC<ll:6>
PC<5:2>
PC<l>
PC<O>

PAL_BASE<51: 15>
0
CallPal function<?>
CallPal function<5:0>
0
Current PC< 1>
1

The SRM does not actually define the behavior of PAL_BASE<63:52>, but reading
anything other than zero feels unwise.
Written
Readable
Index

HW_MTPR
HW_MFPR, Implicitly by trap to PALmode.
OxA8

Figure 16-18 PALcode Base Address Register- PAL_BASE[tpu]

Table 16-24 PALcode Base Address Register Fields Description
Field Name

Extent Type

Reserved

63:52

VA<51:15>

51:15

Reserved

14:0

Description

MBZ

16.2.19 PALcode Temp Registers -

PAL_TEMP1 [tpu], PAL_TEMP2[tpu]

The PAL_TEMP registers are for miscellaneous use by PALcode. The primary intention is not to use these registers for save/restore type sequences within normal PALcode
flows, but as infrequently written holders of important state.
Compaq Confidential
5 January 2001 -~ Subject To Change

Internal Processor Registers 16-23

Mbox IPRs

The 21464 does not define a specific use for these registers but discussed uses include:

•

Hold the physical address of a scratch area of memory where this TPU can save and
restore values. During the PALcode initialization sequence, each CPUffPU would
do a calculation to produce a unique pal_temp_address. On previous processors,
this value would have been stored in a PALcode shadow register, but given the lack
of shadow registers in the 21464, this might be a good use for the scratch registers.

•

Use during PALcode flows, where the ability to reliably access memory is questionable. In this case, HW_ST/HW _LD would not be an option so the overhead of
synchronizing writes and reads to these registers is reasonable.

Written
Readable
Index

HW_MTPR
HW _MFPR
OxA9, OxAA

Figure 16-19 PALcode Temp Registers- PAL_TEMP1[tpu], PAL_TEMP2[tpu]
63

Anything

16.3 Mbox IPRs
16.3.1 Dcache Control Register -

DC_CTL

The Dcache control register is a chip-wide register controlling Dcache state.
There are many open issues relating to the structure of this register. One new bit, when
written with a one, will reset the DTB write pointer to zero. Is this the flush bit?
Written
Read:
Index

HW_MTPR
HW_MFPR
Ox48

Figure 16-20 Dcache Control Register- DC_CTL

....___ _ _ FLUSH
.___ _ _ _ F_BAD_TPAR
' - - - - - - - - F_BAD_DECC
' - - - - - - - - DCTAG_PAR_EN
- - - - - - - DCDAT_PAR_EN

Table 16-25 Dcache Control Register Field Descriptions
Field Name

Extent Type

Description

Reserved

63:56

MBZ

DCDAT_PAR_EN

RW,O

Dcache data parity error enable

DCTAG_PAR_EN

RW,O

Dcache tag parity enable

Compaq Confidentia I
16-24 Internal Processor Registers

5 J(1nu(1ry 2001 -~ Subject To Change

Mbox IPRs

Table 16-25 Dcache Control Register Field Descriptions (Continued)
Field Name

Extent Type

Description

F_BAD_DECC

RW,O

Force Bad Data ECC. When set, ECC data is not written into the cache
along with the block that is loaded by the fill or store.

F_BAD_TPAR

RW,O

Force Bad Tag Parity. When set this bit cause bad tag parity to be put in
the Dcache tag array during Dcache fill operations

FLUSH

RW,O

F_HIT

RW,O

Force Hit. When set, this bit causes all memory space load and store
instructions to hit in the Dcache, independent of the Dcache tag address
compare.

SET_EN

49:48

RW,O

Dcache Set Enable. At least one set must be enabled.

Reserved

47:0

MBZ

16.3.2 Dcache Status Register- DC_STAT[tpu]
The Dcache status,register is a per-TPU read-write register containing information
about Dcache parity and ECC errors.
The status bits indicate an error when set and must be explicitly written with a 1 to
clear.
Written
Read

Set when the parity or ECC error event occurs.
HW_MFPR

Index
Reset

Ox49
Unchanged for debug modes.

Figure 16-21 Dcache Status Register- DC_STAT[tpu]

....__ _ _ TPERR_PO
.____ _ _ _ TPERR_P1
' - - - - - - - TPERR_P2
.____ _ _ _ _ ECC_ERR_LD
.____ _ _ _ _ ECC_ERR_ST
.____ _ _ _ _ _ SEO

Table 16-26 Dcache Status Register Field Descriptions
Field Name

Extent Type

SEO

WlC,O A second ECC error occurred within N cycles of a previous ECC error.

ECC_ERR_ST

WlC,O An ECC error occurred when processing a store

ECC_ERR_LD

WlC,O An ECC error occurred when processing a load from the Dcache or any
fill

Description

Compaq Confidential
5 January 2001 ··· Subject To Change

Internal Processor Registers 16-25

Mbox IPRs

Table 16-26 Dcache Status Register Field Descriptions (Continued)
Field Name

Extent Type

Description

TPERR_P2

WlC,O A Dcache tag probe from pipe 2 resulted in a tag parity error. The error
is uncorrectable.

TPERR_Pl

WlC,O A Dcache tag probe from pipe 1 resulted in a tag parity error. The error
is uncorrectable.

TPERR_PO

WlC,O A Dcache tag probe from pipe 0 resulted in a tag parity error. The error
is uncorrectable.

Reserved

57:0

RAZ

Figure 16-22 OTB Invalidate Address Space Register -

DTB_IASN[tpu]
0

16.3.3 OTB Invalidate Multiple Register -

DTB_IM[tpu]

The DTB fuvalidate Multiple register is a write-only pseudo-register. When write instructions to this register
retire, all DTB entries matching the criteria specified by the mode bits are invalidated.

Written
Readable
Index

HW_MTPR
No
Ox40

Compaq Confidentia I
16-26 Internal Processor Registers

5 Jc1mJc1ry 2001 ~-Subject To Change

Mbox IPRs

Figure 16-23 OTB Invalidate Multiple Register -

DTB_IM[tpu]
5 4

Table 16-27 OTB Invalidate Multiple Register Fields Description
Field Name

Extent Type

Reserved

63:5

MBZ

MODE

4:0

WO,O

Description

Defines the invalidation mode, as follows:
Value

Mnemonic Description

OxOO
OxOl

IAG
IA

Ox03

IASM

Ox05

IAP

OxOD

IASN

OxlO

IAG

Ox16

IWRP

16.3.4 OTB Invalidate Single Register -

DTB_IS[tpu]

The DTB Invalidate Single register is a write-only pseudo-register. Write instructions to
this register invalidate any DTB entries that would match the virtual page number written, given the current values of the TPUGRP and DTB_ASN registers.
Written
Readable
Index

HW_MTPR
No
Ox42

Figure 16-24 OTB Invalidate Single Register -

DTB_IS[tpu]

Table 16-28 OTB Invalidate Single Register Fields Description
Field Name

Extent Type

Reserved

63:52

VA<51:13>

51:13

Reserved

12:0

Description

MBZ

Compaq Confidential

5 January 2001 -~ Subject To Change

Internal Processor Registers 16-27

Mbox IPRs

16.3.5 OTB PTE Array Write Registers -

DTB_PTEO[tpu], DTB_PTE1 [tpu]

The DTB PTE write register is a write-only register used to write the PTE part of the
DTB array. It contains the physical page mapping and protection bits for the array.
When a write to it retires, the PTE, along with the DTB_TAG, DTB_ASN and TPUGRP
registers, are written to the DTB. PALcode must perform two consecutive writes to
DTB_PTE to guarantee both copies of the TB are updated together.
The bits in this register are also specified in Section (II-A) 3.6 of the Alpha SRM.
Written
Readable
Index

HW_MTPR
No
Ox44

Figure 16-25 OTB PTE Array Write Registers -

DTB_PTEO[tpu], DTB_PTE1 [tpu]
2

PA<44:13> or PA<47:16>

UWE-----'
SWE----'
EWE---KWE---~
URE----~

SRE-------'
ERE _ _ _ _ ____,
KRE - - - - - - - - '
GH[1:0) - - - - - - - - - '
ASM - - - - - - - - -

FOW - - - - - - - - - - - '
FOR---------VALID ---------~

Table 16-29 DTB_PTE Array Write Registers Fields Descriptions
Field Name

Extent Type

Meaning

PA<44: 13> or
PA<47:16>

63:32

wo,o

Reserved

31:16

MBZ

UWE

wo,o

User write enable. When process context is User mode, this bit must be
set to write this entry.

SWE

WO,O

Supervisor write enable. When process context is Supervisor mode, this
bit must be set to write this entry.

EWE

WO,O

Executive write enable. When process context is Executive mode, this
bit must be set to write this entry.

KWE

WO,O

Kernel write enable. When process context is Kernel mode, this bit
must be set to write this entry.

URE

WO,O User read enable. When process context is User mode, this bit must be
set to read this entry.

The physical page number is PA<44: 13> when in 8K page mode and
PA<47: 16> when in 64K page mode. The 21464 cannot address more
than 16 TB when in 8K page mode.

Compaq Confidential
16-28 Internal Processor Registers

5 Jc1nwiry 2001 -~Subject To Change

Mbox IPRs

Table 16-29 DTB_PTE Array Write Registers Fields Descriptions
Field Name

Extent Type

Meaning

SRE

WO,O

Supervisor read enable. When process context is Supervisor mode, this
bit must be set to read this entry.

ERE

wo,o

Executive read enable. When process context is Executive mode, this
bit must be set to read this entry.

KRE

WO,O

Kernel read enable. When process context is Kernel mode, this bit
must be set to read this entry.

Reserved

MBZ

GH[l:O]

6:5

WO,O

Granularity Hint.

ASM

wo,o

Address space match bit. When set, this PfE matches all address space
numbers.

Reserved

MBZ

FOW

WO,O

Fault-on-write control bit.

FOR

WO,O

Fault-on-read control bit

VALID

WO,O

Valid bit. Not actually written to the TB, but is used to prevent the writer
block interlock from being lifted unitl the MTPR instruction is killed.
The DTB miss PALcode must include a branch-on-invalid check before
the DTB write or the 21464 hangs if an invalid TB entry is written.

16.3.6 OTB Tag Array Write Registers- DTB_TAGO[tpu], DTB_TAG1[tpu]
The DTB Tag write register is a write-only register used to write the DTB tag array. It
contains the virtual page number of the entry currently being written to the DTB, and
will be committed to the DTB array when the corresponding write to DTB_PTE retires.
The DTB_ASN and TPUGRP registers are also implicitly included in the data written
to the DTB when the write to DTB_PTE retires. PALcode must perform two consecutive writes to DTB_PTE to guarantee both copies of the TB are updated together.
Written
Readable
Index

HW_MTPR
No
Ox46

Figure 16-26 OTB Tag Array Write Registers 63

DTB_TAGO[tpu], DTB_TAG1 [tpu]

52;;..;.51_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _____;,,,,;1312

VA<51:13>

Compaq Confidential
5 January 2001 -~ Subject To Change

Internal Processor Registers 16-29

Mbox IPRs

Table 16-30 OTB Tag Array Write Registers Fields Description
Field Name

Extent Type

Reserved

63:52

VA<51: 13>

51: 13

Reserved

12:0

Description

MBZ

16.3.7 Mbox Control Register- M_CTL(tpu]
The Mbox control register was a write-only register in the 21264, but is proposed to be read-write in the 21464.
M_CTL controls the formatting of the VA_FORM register by storing the virtual page table base, along with control bits for big-endian and 64K page modes.
HW_MTPR
HW_MFPR
Ox00-0x03

Written
Read
Index

The low two index bits allow for selective writing of fields.
00
01
10
11

Write nothing
Write VPTE_BASE field only
Write SUPERPAGE, BIG_ENDIAN, VA_SIZE and PAGE_SIZE only
Write all fields

Figure 16-27 Mbox Control Register- M_CTL[tpu]
63

VPTE_BASE<51 :33>

BIG_ENDIAN - - DBL_ALT _ ____,
VA_SIZE _ _ ____,
PAGE_SIZE - - - - - '

Table 16-31 Mbox Control Register Fields Description
Field Name

Extent Type

Reserved

63:52

Description

MBZ

VPTE_BASE<51 :33> 51 :33

RW,O Virtual Page Table Base. See the VA_FORM register section for details.

Reserved

32:28

MBZ

REDUCED_PT

Compaq Confidential
16-30 Internal Processor Registers

5 January 2001 ··· Subject To Change

Mbox IPRs

Table 16-31 Mbox Control Register Fields Description
Field Name

Extent Type

SUPERPAGE

26:24

Description

RW,O Dstream Super Page mode enables. Any combination of bits can be set
at once. Any non-kernel mode access to an enabled superpage region
must result in an access violation.
Bits

Meaning

SPE[2]

Enables super page mapping when VA[63:50] = Ox3FFE. In
this mode VA[47:0] is mapped directly to PA[47:0].
Enables super page mapping when VA[63:41] = Ox7FFFFE.
In this mode PA[47:0] =SEXT(VA[40:0]).
Enables super page mapping when VA[63:30] =
Ox3FFFFFFFE. In this mode PA[47:0] = ZEXT(VA[29:0]).

SPE[l]
SPE[O]

BIG_ENDIAN

RW,O When set, the lower bits of the physical address for Dstream accesses
are inverted based upon the length of the datatype referenced. Also, the
shift amount (Rbv[2:0]) is inverted for EXTxx, INSxx and MSKxx
instructions.

DBL_ALT

RW,O Determines which double miss flow will be vectored to when a hw_ld/
vpte misses in the TB. This bit controls the vectoring for all double TB
misses -- I-Stream and D-Stream. 0 Vector to DTB_MISS_DOUBLE 1
Vector to DTB_MISS_DOUBLE_ALT DTB_MISS_DOUBLE and
DTB_MISS_DOUBLE_ALT are in used in place of the 21264's
DTB_MISS_DOUBLE_3 and DTB_MISS_DOUBLE_4 with the distinction being that the decision is the discretion of PALcode.

VA_SIZE

RW,O Defines the D-Stream Virtual address size. Controls the VA_FORM
register and sign extension checking.
VA_SIZE = 0 specifies 43-bit addressing
VA_SIZE = 1 specifies 52-bit addressing (invalid if PAGE_SIZE = 0)

PAGE_SIZE

RW,O Defines the D-Stream page size. Controls the VA_FORM register.
PAGE_SIZE = 0 specifies 8KB pages
PAGE_SIZE = 1 specifies 64KB pages

Reserved

19:0

As follows:
Bits

Type

19
18
17:3
2:0

x
MBZ

16.3.8 Mbox Process Mode Register- M_MODE[tpu]
The Mbox process mode register specifies the console, current and alternate processor
mode. These mode bits shadow the bits in !_MODE and exists on a per-TPU basis.
Written
Read
Index

HW_MTPR
HW_MFPR
Ox08-0xOF

Compaq Confidential
5 January 2001 ··· Subject To Change

Internal Processor Registers 16-31

Mbox IPRs

The low three index bits allow for selective writing of fields.
000
001
010
100
111

Write nothing
Write CURRENT field only
Write CONSOLE field only
Write ALT field only
Write all fields

Figure 16-28 Mbox Process Mode Register -

M_MODE[tpu]

8765432

CONSOLE _
___,
CURRENT _ ____,

Table 16-32 Mbox Process Mode Register Field Descriptions
Field Name

Extent Type

Reserved

63:8

Description

As follows:
Bits

Type

63:52
51:33
32:28
27:19
18
17:8

MBZ

x
MBZ

ALT

7:6

RW,O

The ALT field is encoded as follows:
00-Kernel
01-Executive
10 - Supervisor
11- User

CONSOLE

RW,O

DTB traps in console mode are reported to the trap handler separately
from non-console mode traps so they can be vectored to different
addresses.

CURRENT

4:3

RW,O

The CURRENT field is encoded as follows:
00-Kernel
01 - Executive
10 - Supervisor
11-User

Reserved

2:0

MBZ

16.3.9 Mbox Process Context register -

M_PCTX[tpu]

The Mbox process context register is a copy of the Ibox process context register stored
locally to the Mbox for implementation convenience.
Written
Readable
Index

HW_MTPR
HW_MFPR
Ox10-0x13

Compaq Confidential
16-32 Internal Processor Registers

5 Janu~1ry 2001 ··· Subject To Change

Mbox IPRs

The low two index bits allow for selective writing of fields.
00
01

Write nothing
Write ASN field only
Write TPU_GRP field only
Write all fields

10
11

Figure 16-29 Mbox Process Context Register -

M_PCTX[tpu]
ASN

Table 16-33 Mbox Process Context Register Field Descriptions
Field Name

Extent Type

Reserved

63: 18

Description

As follows:
Bits

Type

63:52
51:33
32:28
27:19
18

MBZ

x
MBZ

TPU_GRP

17:16

RW,O

Thread Group number this TPU belongs to

ASN

15:8

RW,O

Address space number, should be identical to the value in PCTX controlling the ITB

Reserved

7:0

As follows:
Bits

Type

7:3
2:0

x
MBZ

16.3.10 Mbox Memory Management Status Register -

M_STAT[tpu]

The memory management status register is implicitly written register containing information about the most recent Dstream TB miss or fault in the TPU.
The traps that set M_STAT are:
UNALIGN
DFAULT
MCHK
DTBM_SINGLE
DTBM_SINGLE_CONS

Compaq Confidential
5 January 2001 ·-Subject To Change

Internal Processor Registers 16-33

Mbox IPRs

*** One of the bits should indicate a BAD_VA fault (sign extension check failure)???
Written
·Readable
Index

Implicitly written when a Dstream fault occurs.
HW_MFPR
Ox20

Figure 16-30 Mbox Memory Management Status Register- M_STAT[tpu]

DCDAT_PERR ---~
DCTAG_PERR _ _ ___.
OPCODE - - - - - - - '

FOW - - - - - - - - - '
FOR _ _ _ _ _ _ _ __..
ACV - - - - - - - - - - - - '

WR-----------'

Table 16-34 Mbox Memory Management Status Register Field Descriptions
Field Name

Extent Type

Reserved

63:12

RAZ

DCDAT_PERR

IR,O

Set when a Dcache data parity error occurs during the initial tag probe
of a load or store instruction. A DFAULT PALmode trap is generated.

DCTAG_PERR

IR,0

Set when a Dcache tag parity error occurs during the initial tag probe of
a load or store instruction. A DFAULT PALmode trap is generated.

OPCODE

9:4

IR,O

The opcode of the instruction that generated the error.

FOW

IR,O

Set when a fault-on-write error occurs and PfE[FOW] was set

FOR

IR,O

Set when a fault-on-read error occurs and PfE[FOR] was set

IR,O

Set when an access violation occurs. This includes bad virtual
addresses

IR,0

Set when an error occurs during a write operation

ACV
WR

Description

16.3.11 Quiesce Timeout Register -

QUIESCE_TIMEOUT[tpu]

This IPR specifies a limit to the number of CPU cycles that may elapse between issuing
the QUIESCE instruction and the watch_flag (see WATCH_PHYS_ADDR) being
cleared. The value in this register is used to load the quiesce timer.
Allowing this value to be set per-TPU is necessary to give the 21464 the capability of
running virtual machines, i.e., the ability for different TPUs to run different O/S 's
simultaneously.
Does the counter wrap or saturate and what is the behavior of a zero value?

Compaq Confidential
16-34 Internal Processor Registers

5 Jc1m.u1ry 2001 - Subject To Change

Mbox IPRs

The background for booting document assumes the ability to define an infinite wait
condition in the Suspending TPUs section. Disabling the timer should suffice.
Written
Readable
Index

HW_MTPR
HW_MFPR, Implicitly by Quiesce instruction
Ox38

Figure 16-31 Quiesce Timeout Register- QUIESCE_TIMEOUT[tpu]
201"""'"9_ _ _ _ _ _ _ _
4

TIMEOUT
WATCH_EN
TIMEOUT_EN

Table 16-35 Quiesce Timeout Register Field Descriptions
Field Name

Extent Type

Description

Reserved

63:20

MBZ

TIMEOUT

19:4

RW,
Ox280

Reserved

3:2

MBZ

WATCH_EN

RW,O

Enables comparison against the physical address specified by the
LDx_ARM instruction. If disabled, a TPU will not awaken when an
access to the WATCH_PHYS_ADDR is detected.

TIMEOUT_EN

RW,O

Enables the timeout counter. If disabled, a TPU will not timeout from a
quiesce operation.

Number of CPU cycles to wait before clearing the watch_flag of a quiesced TPU. What does zero do???

The actual timer value is not currently readable. While the process is quiesced, that is
not important, but would the ability to read the timer value when awaken by an interrupt or address match be useful?

16.3.12 Virtual Address Register -

VA[tpu]

When a Dstream fault occurs, the associated virtual address is stored in the VA register.
The VA is not written when an LD_ VPTE gets a DTB miss or Dstream fault.
Traps that cause VA to be written are:
UNALIGN
DFAULT
DTBM_SINGLE
DTBM_SINGLE_CONS
Written
Readable
Index

Implicitly by instruction that caused the miss.
HW_MFPR
Ox21

Compaq Confidential
5 January 2001 -~Subject To Change

Internal Processor Registers 16-35

Mbox IPRs

Figure 16-32 Virtual Address Register -

VA[tpu]

VA<63:0>

NOTE:

The SRM states that the pre-endian adjusted address (va, not va') is
reported for memory management faults. The 21464 stores va' and
requires PALcode to adjust back to va where necessary.

16.3.13 Virtual Address Format Register -

VA_FORM[tpu]

The read-only virtual address format register contains the virtual page table entry
address derived from the faulting virtual address stored in the VA register along with
the virtual page table base and associated control bits stored in the VA_CTL register.
Written
Readable
Index

NIA (Derived from address in VA and control bits in M_CTL)

HW_MFPR
Ox23

Figure 16-33 Virtual Address Format Register -

VA_FORM[tpu]

43-bitVA/ 8KB pages (VA_SIZE=O, PAGE_SIZE=O, REDUCED_PT=O)
~63~~~~~~~~~~~~~~~~33~32"--~~~~~~~~~~~~~__.;;..32

~I~~~~S_EX~n_v_PT_E___
BA_S_E_<5_1_:33_>_)~~~~~'~~~~~~-VA_<_42_:_13_>~~~~~~mll
52-bitVA/ 64KB pages (VA_SIZE=1, PAGE_SIZE=O, REDUCED_PT=O)
63

4241

3 2

-63~~~~~~~~~~---"'i42-41~.--..--~~~~~----.-~~~~~~~~~~---32

~I~~S-EXT~(V-P-TE-_-B-AS_E_<-51-:4-2>-)~____,,.--~~~~~~~S-E-XT-~-A-<-51-:1-6-~~~~~~~~~mll

52-bit VA/ 64KB pages (V A_SIZE=1, PAGE_SIZE=1, REDUCED_PT=1)

~I~~s_EXT~(V_P_TE___B_As_E_<_51_:4_2>_)~~1~000~~'0_1~l~ooo~ooo~ooo~oo_oo~--'-~~~~V_A_<_49_:2_9>~~~--E11
Table 16-36 Instruction VA Format Register (43-Bit VA) Fields Description
Field Name

Extent Type

SEXT(VPIE_BASE<51 :33>)

63:33

VA<42: 13>

32:3

Reserved

2:0

Description

RAZ

Table 16-37 Instruction VA Format Register (52-Bit VA, REDUCED-PT=O) Fields Description
Field Name

Extent Type

SEXT(VPIE_BASE<51 :42>)

63:42

SEXT(VA<51:16>)

41:3

Reserved

2:0

Description

RAZ

Compaq Confidential
16-36 Internal Processor Registers

5 Jc1nw~ry 2001 - Subject To Cf1ange

CboxlPRs

Table 16-38 Instruction VA Format Register (52-Bit VA, REDUCED-PT =1) Fields Description
Field Name

Extent Type

SEXT(VPIE_BASE<51 :42>)

63:42

Description

41:39
38:37
36:24
VA<49:29>

23:3

Reserved

2:0

RAZ

16.3.14 Watch Physical Address Register -

WATCH_PHVS_ADDR[tpu]

When a LDx_ARM instruction retires, the physical address specified is loaded into this
register and the watch flag is set. If the watch flag is still set when a Quiesce instruction
to the TPU retires, the TPU is put to sleep until the flag is cleared.
The watch flag is cleared by a memory write to the physical address, an interrupt to the
TPU or when the quiesce timer expires.
Written
Readable
Index

Implicitly by LDx_ARM instruction
No
NIA

Figure 16-34 Watch Physical Address Register- WATCH_PHYS_ADDR[tpu]
63

phys_addr<47:4>
WATCH_FLAG _

____,

Table 16-39 Watch Physical Address Register Fields Description
Field Name

Extent

Type

Reserved

63:48

MBZ

PHYS_ADDR<47:4>

47:4

Reserved

3:1

WATCH_FLAG

Description

MBZ

16.4 Cbox IPRs
16.4.1 Hardware Interrupt Clear Register -

HW_INT_CLR[tpu]

The hardware interrupt clear register is a write-only register used to clear edge-sensitive
interrupt requests.
I believe this register is moving to the Cbox and will be completely reworked given
how the 21364/21464 handle interupts, as opposed to the 21264.

Compaq Confidential
5 January 2001 - Subject To Change

Internal Processor Registers 16-37

Rbox IPRs

Note:

The FBTP bit will be move.

Written
Readable

??
No

Figure 16-35 Hardware Interrupt Clear Register -

HW_INT_CLR[tpu]

CR-----

P C - - -.....

MCHK_ID _ _ _ __.

Table 16-40 Hardware Interrupt Clear Register Fields Description
Field Name

Extent Type

Description

Reserved

63:32

Clears a corrected read error interrupt request.

30:29

Clears a performance counter interrupt request.

MCHK_ID

28:27

Clears a Dstream machine check interrupt request.

Reserved

26:0

16.5 Rbox IPRs
This section describes the Rbox IPRs.

16.5.1 Router Configuration1 (R,W) -

R_CFG1

Table 16-41 shows the router configuration register fields.
Table 16-41 Router-Configuration1 Register Fields Description
Bit

Field

<31>

IRW

Value

Meaning

Comments

If set then ignore writes to the Router Table. This bit
helps reduce the risk that an errant IPR write will corrupt
the Routing table. Table 2: Router-Configuration register
(Part 2)

<30:25> reserved
<24:22>

DRl<2:0>

<21>

DRE

0
1
2
3
4
5
6
7

0
3
15
63
255
1023
409
516383

Drain Interval. Indicates how many cycles after the drain
interval starts, before the router forces out the starved
packet. The drain interval starts once an input-buffer slot
becomes available for the starved packet.

Enable Drain Mode

Compaq Confidential
16-38 Internal Processor Registers

5 J(1m.1(1ry 2001 --Subject To Change

Rbox IPRs

Table 16-41 Router-Configuration1 Register Fields Description
Bit

Field

Value

Meaning

Comments

<20:18>

STI

0
1
2
3
4

0
3
15
63
255
1023
4095
16383

Starvation Interval. Indicates how many cycles the starvation token can last in the header queue before it triggers the starvation mode. When in starvation mode, the
router treats all packets in the header queue in front of
the token as starved.

5
6
7
<17>

STE

<16:14>

SYF<2:0>

<13>

SYE

<12:10>

reserved

<9>

ADA

<8:7>

SHB<l:O>

0
1
2
3

0
64
128
256

Size of Packet queue in ticks. A value of zero, and ADA
= 0 forces deterministic routing. The other values allow
performance experiments.

<6:3>

BR0<3:0>

<0>
<1>
<2>
<3>

North
South
East
West

Determines which output ports to send the broadcast
packet that are in the local input port.

<2:1>

TRT

0
1
2
3

North
South
East
West

Turn Route Type:
Selects whether the routing type is North-last, etc. Northlast and South-Last imply XY-routing for the Starvationrecovery routine. East-last and West-last imply YX-routing.

<0>

ECB

Normal Low

ECC Bypass:
If set enable the low latency ECC checker.
(Currently not implemented)

Enable Starvation mode
SYNCH frequency: Interval = (N+ 1) * 4096 cycles
Period = (N+ 1) * (1024) 2 cycles

N (range is
0:7)

Enable SYNCHs

Enable adaptive routing when set.

16.5.2 Router Configuration2 (R, W)- R_CFG2
Table 16-42 shows the Router Configuration2 register fields.
Table 16-42 Router-Configuration2 Register Fields Description
Bit

Field

Value

Meaning

Comments

<31 :22>

reserved

<21:20>

TGl

0
1
2
3

0%
33%
67%
100%

Toggle rate for Header queue entries 4 through 15. Same
asTGO.

<19:18>

TGO

0
1
2
3

0%
33%
67%
100%

Toggle rate for the first four Header-Queue entries.
Chooses fifth channel over the Adaptive-route field.

Compaq Confidentia I
5 January 2001 ·- Subject To Change

Internal Processor Registers 16-39

Rbox IPRs

Table 16-42 Router-Configuration2 Register Fields Description
Bit

Field

Value

Meaning

Comments

<17:14>

WAL<3:0>

<0>
<1>
<2>
<3>

NorthO/P
SouthO/P
EastO/P
WestO/P

Wall:
The output port is a wall. Packets with the Bounce bit set
will tum on encountering the wall.

<13:12>

DNC

0
1
2
3

16
32
64
128

De-allocate NOP Counter: Number of network cycles
before the 21464 tries to force a De-allocate NOP on the
links. This NOP will dispatch as soon as any packet currently on the link completes.

<11>

WID

0
1

Narrow
Wide

Width:
Selects the width of the network links.

<10:9>

FDR<l:O>

0
1
2
3

No force
every 4th
every 8th
every 16th

Force Deterministic Route:
Force every 4th to 16th cycle to route deterministically.
This is a performance tweak.

<8>

TCB

0
1

normal
2-cycle

Two Cycle Bid:
The local arbiter does not bid in the cycle after issuing a bid.
Setting this bit prevents the local arbiter from bidding in the
next two cycles, which allows packets to route in order.

<7:6>

reserved

<5:0>

DRM

<0>
<1>
<2>
<3>
<4>
<5>

Request
Forward
Block-Response
Victim-Block
Non-Block
Release

Deterministically Route Message class:
Setting a bit causes all packets in the selected message class to
route deterministically. This is an insurance policy in case we
find a ships-passing-in-the-night problem

16.5.3 Router Channel {N,S,E,W} Configuration1 (R,W)- R_n_CFG1
There are four such registers - one per network port - called R_N_CFGl, R_S_CFGl,
R_E_CFGl, R_W_CFGl.

Compaq Confidential
16-40 Internal Processor Registers

5 January 2001 ··· Subject To Change

RboxlPRs

Table 16-43 Router-{N,S,E,W}-Configuration1 Register Fields Description
Bit

Field

Value

Meaning

Comments

<31:26> reserved
<25>

<24:21> SPD<3:0>

<20>

When set, the Output-Port issues Null ticks.

NUL

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

1:1
3:2
2:1
5:2
3:1
7:2
4:1
9:2
5:1
11:2
6:1
13:2
7:1
15:2
8:1
reserved

This table is still TBD. Need some slower ratios for test
purposes.

Ignore incoming de-allocate NOP packets. This is a test
bit, used to ensure that the router timers expire.

IGD

<19:18> FEM<l:O>

The clock ratio between the interface and the CPU. For
instance, a ratio of 3:2 means that there are three CPU
clocks for every 2 interface clocks.

0
1
2
3

Normal
Error-D
Error-C
Error-S

Force-Error mode:
Error-D: force 1-shot, double-bit error
Error-C: force a continuous stream of 1-bit errors
Error-S: force 1-shot, single-bit error
A hidden, 20-bit counter triggers the forced error modes.
Once per million GCLK clock cycles, this counter forces
the error on a random, outward-bound packet tick.
The Router-TCTL IPR defines this counter. A write to
this IPR clears the hidden counter, and clears the forceerror conditions.
The interval timer, the SYNC timer, and the time-out
timer, all share this hidden counter.

<17>

INI

Initialize Mode:
Causes the port to go through a clock-forward init on the
next fast reset. Also causes the output to send true NOP
packets rather than NUL-NOP packets, until the clock
initialization sequence completes. The hardware clears
this bit at the end of a clock-forward initialization
sequence.

<16>

SYC

Run the port input in synchronous mode

<15:14> UNI
<13>

SYE

Unload pointer init value (for clock-forward reset)
Enable the port to respond to a SYNCH.

Compaq Confidential
5 January 2001 -~ Subject To Change

Internal Processor Registers 16-41

Rbox IPRs

Table 16-43 Router-{N,S,E,W}-Configuration1 Register Fields Description
Bit

Field

<12>

FCC

Enable the port to check the Forwarded clock. This
check logic confirms that the clocks are at the expected
rate, and that the clocks are in synchronization.

<11>

ECC

Enable the ECC checking/correction logic.

<10>

SAE

Enable the port to respond to a SW alert

<9>

HAE

Enable the port to respond to a HW alert

<8>

reserved

<7>

reserved

<6:3>

BR0<3:0>

<2>

Output Port Enable:
If clear, the router discards any packet destined for this
port. The hardware clears this bit when the channel goes
down.

ICO

Input Connected to Output:
If set the input port is connected to an output port (i.e.,
another node). If it is not connected then the hardware
disables the port logic to minimize power, noise, etc.

Input Port Enable:
When clear the router ignores any packets on this input
port. The hardware clears this bit when the channel goes
down.

<0>

Value

<0>
<1>
<2>
<3>

Meaning

North
South
East
West

Comments

Broadcast Output port:
These bits direct the broadcast packet on the Local Port
to the enabled output ports.

16.5.4 Router Channel {N,S,E,W} Configuration2 (R,W)- R_n_CFG2
Table 16-44 shows the Router Channel Configuration2 register fields.
Table 16-44 Router Channel {N,S,E,W} Configuration2 Register Fields Description
Value

Meaning

Comments

Bit

Field

<31 :9>

reserved

<8:6>

FOF<2:0>

Output-FIFO Fullness offset for fifth channel see SOF<2:0>

<5:3>

TOF<2:0>

Output-FIFO Fullness offset for turning path: see SOF<2:0>

<2:0>

SOF<2:0>

Output-FIFO-Fullness offset for straight-through path:
This value is added to the actual Output Buffer fullness amount
in the pre-decode logic. The pre-decode uses this result to
determine which output FIFO is the least heavily used, and it
routes new packets to this output port.

Com p.aq Confidentia I
16-42 Internal Processor Registers

5 J~1nwiry 2001 -- Subject To Change

RboxlPRs

16.5.5 Router Channel {N,S,E,W} Timer1 Configuration (R,W)- R_n_T1CFG
There are four such registers - one per network port- called R_N_TlCFG,
R_S _Tl CFG, R_E_Tl CFG, R_W _Tl CFG
Table 16-45 Router {N,S,E,W} Timer1 Configuration Register Fields Description
Bit

Field

Value

Meaning

Comments

<31:28>

reserved

<27>

WITE

Enable the Write-IO timer

<26:21>

WITV<5:0>

Write-IO message-class timer value

<20>

RITE

Enable the Read-IO timer

<19:14>

RITV<5:0>

Read-IO message-class timer value

<13>

FWTE

Enable the Forward timer

<12:7>

FWTV<5:0>

Forward message-class timer value

<6>

RSTE

Enable the Response timer

<5:0>

RSTV<5:0>

Response message-class (both block and non-block ) timer
value

16.5.6 Router Channel {N,S,E,W} Timer2 Configuration (R,W)- R_n_T2CFG
There are four such registers - one per network port - called R_N_T2CFG,
R_S_T2CFG, R_E_T2CFG, R_W _T2CFG
Table 16-46 Router {N,S,E,W} Timer2 Configuration Register Fields Description
Bit

Field

Value

Meaning

Comments

<31 :21>

reserved

<20>

FITE

Enable the Fan-in timer

<19:14>

FITV<5:0>

Broadcast-Acknowledge Fan-in class timer value

<13>

FOTE

Enable the Fan-out timer

<12:7>

FOTV<5:0>

Broadcast Fan-out class timer value

<6>

RETE

Enable the Request timer

<5:0>

RETV<5:0>

Request message-class timer value

Compaq Confidential
5 January 2001 ·- Subject To Change

Internal Processor Registers 16-43

Rbox IPRs

16.5.7 Router Channel {N,S,E,W} Error Status (R, W1C)- R_n_ERR
There are four such registers - one per network port - called R_N_ERR, R_S_ERR,
R_E_ERR, R_W _ERR.
Table 16-47 Router {N,S,E,W} Error Status Register Fields Description
Bit

Field

<31:19>

reserved

<18>

FITX

Broadcast-Acknowledge Fan-in Trmer expired

<17>

FOTX

Broadcast Fan-out Timer expired

<16>

WITX

Write-IO Timer expired

<15>

RITX

Read-IO Trmer expired

<14>

FWTX

Forward Trmer expired

<13>

RETX

Request Trmer expired

<12>

RSTX

Response Timer expired

<11>

reserved

<10>

FCE

Forwarded-clock error:
To determine the presence of a clock, then clear this bit are read
it again (while the forward-clock checking is enabled).

<9>

DBE

Double-bit error

<8:2>

SYN<6:0>

ECC syndrome

<1>

MSE

Multiple, single-bit errors

<0>

SBE

Single-bit error

Value

Meaning

Comments

Notes:

•

The syndrome reflects the first error condition. For example, if the double-bit error
bit is set then the syndrome is for the first occurrence of the double-bit error.

•

The hardware will disable the input and output ports on the failing compass point,
for the following errors:
Double-bit error
Forward-clock error
The expiration of any timer.

In addition, the hardware will force the adjacent node to shut-down its port by forcing a
double-bit error in the first tick of the next packet heading to the adjacent node.

16.5.8 Router Channel {N,S,E,W} Performance Counter (R, W)- R_n_PERF
There are four such registers - one per network port - called R_N_PERF,
R_S_PERF, R_E_PERF, R_W _PERF.

Compaq Confidential
16-44 Internal Processor Registers

5 Jc1nuc1ry 2001 -· Subject To Change

RboxlPRs

The PCV counter stops incrementing when it reaches the maximum value (all
ones). It also sets an interrupt at this point if the interrupt mask has enabled this
interrupt.
Table 16-48 Router {N,S,E,W} Performance Counter Register Fields Description
Bits

Field

Value

<31:3>

PCV<27:0>

<2:0>

PCC

Meaning

Comments
Counter value
Performance Counter Selection:
Port usage 0 - increment the count for every outward tick.
Port usage 1 - increment the count for every outward packet
TBD
TBD
TBD
TBD
TBD
TBD

0
1
2
3
4
5
6
7

16.5.9 Router 1/0-Port Configuration1 Register {R, W)- R_IO_CFG1
Table 16-49 shows the Router I/O Port Configuration register fields.
Table 16-49 Router 1/0-Port Configuration Register Fields Description
Bits

Field

Value

<31:27>

reserved

<26>

KCL

Keep Clock running:
The IO-ASIC may derive its clock from the forwarded clock
sent by the 21464. If this bit is set, then keep the forwarded
clock running, and instead set the DTN (Drive True-NOP) bit
(see above).

<25>

NUL

When set, the Output-Port issues Null ticks.

<24:21>

SPD<3:0>

<20>

IGD

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
reserved

Meaning

3:2
2:1
5:2
3:1
7:2
4:1
9:2
5:1
11:2
6:1
13:2
7:1
15:2
8:1
reserved

Comments

The clock ratio between the interface and the CPU. For
instance, a ratio of 3 :2 means that there are three CPU clocks
for every 2 interface clocks.

Ignore incoming de-allocate NOP packets. This is a test bit,
used to ensure that the router timers expire.

Compaq Confidentia I

5 January 2001 ~· Subject To Change

Internal Processor Registers 16-45

Rbox IPRs

Table 16-49 Router 1/0-Port Configuration Register Fields Description
Bits

Field

Value

Meaning

Comments

<19:18>

FEM<l:O>

0
1
2
3

Normal
Error-D
Error-C
Error-S

Force error mode:
Error-D: force 1-shot, double-bit error
Error-C: force a continuous stream of 1-bit errors
Error-S: force 1-shot, single-bit error
A hidden, 20-bit counter triggers the forced error modes. Once
per million G clocks, this counter forces the error on a random,
outward-bound packet tick.
The Router TCTL IPR defines this counter. A write to this IPR
clears the hhlden counter, and clears the force-error conditions.
The interval timer, the SYNC timer, and the time-out timer, all
share this hidden counter.

<17>

DTN

Drive True-NOP packet.
Software should first assert this bit before pulsing the UNU bit
(see previous field).
Hardware will then reset the bit. Hardware will set this bit
whenever it disables the 1/0 port because of an error condition,
and if the KCL (Keep clock - see below) bit is set.

If this bit is set then the port will discard any messages that are
attempting to travel through the port.
<16>

UNU(RAZ,
RW)

Unload Pointer Update (for lock-step synchronous mode). Setting this bit:
Causes the unload pointer to initialize to the UNI value and to
start counting when the clock-forward initialize sequence
begins (i.e., when the NUL-NOP is sent).
Causes the DTN bit (see next field) to transition from one to
zero

<15:14>

UNl<l:O>

Unload pointer init value (for lock-step synchronous mode).

<13>

SYE

Synchronous input mode enabled (for lock-step)

<12>

FCC

Enable the port to check the Forwarded clock. This check logic
confirms that the clocks are at the expected rate, and that the
clocks are in synchronization.

<11>

ECC

Enable ECC checking and correcting.

<10>

reserved

<9>

HAE

<8:3>

reserved

<2>

<1>

reserved

<0>

Enable the port to respond to a HW alert

Output Port Enable. If clear, the router discards any packet destined for this port. The hardware clears this bit when the channel goes down.

Input Port Enable. When clear the router ignores any packets
on this input port. The hardware clears this bit when the channel goes down.

Compaq Confidential
16-46 Internal Processor Registers

5 Janw~ry 2001 - Subject To Change

RboxlPRs

16.5.10 Router 1/0-Port Configuration2 Register {R, W) -

R_IO_CFG2

Table 16-50 shows the Router 1/0 Port Configuration2 register fields.
Table 16-50 Router 1/0-Port Configuration 2 Register Field Description
Value

Meaning

Comments

FTG

Toggle
No toggle

Disable toggling of fifth channel in the local arbiter when
selecting routing direction.

<12>

FCW

East
West

Fifth channel is wired in an East or West direction

<11>

PCS

North
South

Fifth channel is wired in a North or South direction

<10:6>

FEW<4:0>

East-West Coordinates of node at the other end of the fifth
channel.

<5:1>

FNS<4:0>

North-South Coordinates of node at the other end of the fifth
channel.

PIO

Bits

Field

<31:13>

reserved

<13>

IO-ASIC
5th

I/O Channel usage.

16.5.11 Router 1/0-Port Buffer Size {R,W)- R_IO_BUFSIZ
This register only applies when the 1/0-port is acting as an 1/0 channel. When it is acting as a fifth channel then it ignores this register. The buffer sizes correspond to the size
of the input buffers inside the I/O ASIC. (Hence there is no request buffer because the
network does not send request packets to the 1/0 ASIC.)
Table 16-51 Router 1/0-Port Buffer Size Register Fields Description
Value

Meaning

Comments

Bits

Field

<31:15>

reserved

<14:12>

WIB<2:0>

Number of Write-IO buffers in IO-ASIC

<11:9>

RIB<2:0>

Number of Read-IO buffers in IO-ASIC

<8:6>

FWB<2:0>

Number of Forward buffers in IO-ASIC

<5:3>

NSB<2:0>

Number of Non-Block-response buffers in IO-ASIC

<2:0>

RSB<2:0>

Number of Block-response buffers in IO-ASIC

Compaq Confidential
5 January 2001 ··· Subject To Change

Internal Processor Registers 16-47

Rbox IPRs

16.5.12 Router 1/0-Port Timer1 Configuration {R,W)- R_IO_T1CFG
These timers apply to the out-going port. Whenever the timers expire then the 1./0-port
starts to discard the packets. These values apply whether the 1/0-port is acting as the I/
0-channel or the Fifth-channel. However, the fifth channel never transfers Read-IO or
Write-I 0 packets, and thus these counters should not be enabled.
Table 16-52 RouterUO-Port Timer1 Configuration Register Fields Description
Bits

Field

Value

Meaning

Comments

<31:28>

reserved

<27>

WITE

Enable the Write-IO timer

<26:21>

WITV<5:0>

Write-IO message-class timer value

<20>

RITE

Enable the Read-IO timer

<19:14>

RITV<5:0>

Read-IO message-class timer value

<13>

FWTE

Enable the Forward timer

<12:7>

FWTV<5:0>

Forward message-class timer value

<6>

RSTE

Enable the Response timer

<5:0>

RSTV<5:0>

Response message-class (both block and non-block) timer value

16.5.13 Router 1/0-Port Timer2 Configuration {R,W}- R_IO_T2CFG
This register only applies when the I/0-port is acting as a fifth channel. When the 1/0port is acting as a I/0-channel then it disables these timers.
Table 16-53 Router 1/0-Port Timer2 Configuration Register Fields Description
Bits

Field

<31 :7>

reserved

<6>

RETE

Enable the Request timer

<5:0>

RETV<5:0>

Request message-class timer value

Value

Meaning

Comments

16.5.14 Router 1/0-Port Error Status (R, W1C)- R_IO_ERR
The RETX field only applies to the fifth channel. The RITX and WITX only apply to
the I/O channel.
Table 16-54 Router VO-Port Error Status Register Fields Description
Bits

Field

Value

Meaning

Comments

<31:17>

reserved

<16>

WITX

Write-IO Timer expired (1/0 channel only)

<15>

RITX

Read-IO Ttmer expired (I/O channel only)

<14>

FWTX

Forward Ttmer expired

<13>

RETX

Request Timer expired (Fifth channel only)

<12>

RSTX

Response Timer expired

<11>

reserved

Compaq Confidential
16-48 Internal Processor Registers

5 Jc1nuary 2001 - Subject To Change

RboxlPRs

Table 16-54 Router 1/0-Port Error Status Register Fields Description
Bits

Field

Value

Meaning

Comments

<10>

FCE

Forwarded-clock error: To determine the presence of a clock, then
clear this bit are read it again (while the forward-clock checking is
enabled).

<9>

DBE

Double-bit error

<8:2>

SYN<6:0>

ECC syndrome

<1>

MSE

Multiple, single-bit errors

<0>

SBE

Single-bit error

Notes:

•

The syndrome reflects the first error condition. For example, if the double-bit error
bit is set then the syndrome is for the first occurrence of the double-bit error.

•

The hardware will disable the input and output ports on the failing compass point,
for the following errors:
Double-bit error
Forward-clock error
The expiration of any timer.

In addition, the hardware will force the adjacent node to shut-down its port by forcing a
double-bit error in the first tick of the next packet heading to the adjacent node.

16.5.15 Router 1/0-Port Performance Counter (R, W) - R_IO_PERF
The PCV counter stops incrementing when it reaches the maximum value (all ones). It
also sets an interrupt at this point if the interrupt mask has enabled this interrupt.
Table 16-55 Router 1/0-Port Performance Counter Register Fields Description
Bits

Field

<31:3>

PCV<27:0>

<2:0>

PCC

Value

Meaning

Comments
Counter value

0
1
2
3
4
5
6
7

Performance Counter Selection:
Port usage 0 - increment the count for every outward tick.
Port usage 1 - increment the count for every outward packet
TBD
TBD
TBD
TBD
TBD
TBD

16.5.16 Router Local-Port Error Status Register (R, W1C)- R_LOC_ERR
The 21464 does not check the interface between the router and the Scache/memory for errors.
The packets travel to and from the Router and the C-box or Z-box without ECC or parity bits.
However, the interface ports perform error checking, as follows: The input ports write a
reserved-double-bit error pattern into the packet tick on detecting a double-bit error. When this
Compaq Confidential
5 January 2001 ··· Subject To Change

Internal Processor Registers 16-49

Rbox IPRs

packet arrives at the local-output port, the router sends the packet to the C-box with the error
signal asserted.
Table 16-56 Router 1/0-Port Error Status Register Fields Description
Value

Bits

Field

<31 :9>

reserved

<8>

RTP

<7: 1>

reserved

<0>

RES

Meaning

Comments

Router-Table parity error

Reserved Double-bit error code detected

16.5.17 Router Routing Table Register (R, W) -

R_ROUT

This table holds the routing information the packet needs to reach the destination processor. There are 532 routing table entries, There is one for each of the 512 nodes, and
20 for each sharing mask bit in the directory mask. These 20 entries define which node
the router sends the SharedlnvalBroadcast packets.
Table 16-57 Router Routing Table Register Fields Description
Bits

Field

<23>

PAR

<22>

reserved

<21:20>

IND

Value

Meaning

Comments
Parity

0
1
2
3

Initial Direction:
This field defines which direction the packet will leave the source
node if CAD = 0. If CAD= 1 then hardware will ignore this field if it
chooses to adaptively route. The Initial direction should always be on
the deterministic path (even if CAD = 1).

North
South
East
West

<19:16>

DNS<3:0>

North-South (Y) coordinate of the destination

<15:12>

DEW<3:0>

East-West (X) coordinate of the destination

<11>

REW

0
1

East
West

Route in the East-West direction

<10>

RNS

0
1

North
South

Route in the North-South direction

<9>

STF

0
1

Cut-thru.
Store/Fwd

Store-&-Forward.
If set then packet destined for an I/O ASIC waits in destination node
until complete packet is in the output FIFO. Only required if packet
crosses a slow link.

<8>

BOU

Bounce:
If set the packet will turn on encountering the wall; otherwise it will
pass through the wall.

<7>

CAD

Can-adapt:
When set the packet can adapt (i.e., use the A-route paths). Typically,
this bit is set. The error-recovery code clears this bit when it needs to
prevent a packet from adaptively routing into a faulty region of the
network.

<6>

PME

If set, request packet can access memory at this destination node.

<5>

PIO

If set, 1/0-packets can access the IO-ASIC at this destination node.

Compaq Confidential
16-50 Internal Processor Registers

5 Jc1nuary 2001 - Subject To Change

RboxlPRs

Table 16-57 Router Routing Table Register Fields Description
Bits

Field

Value

Meaning

Comments

<4>

PIP

If set, 1/0-packets can access the IPRs at this destination node.

<3>

IME

If set then the IO-ASIC (on this node) can access memory at this destination node.

<2>

110

If set then the IO-ASIC (on this node) can access the IO-ASIC at this
destination node (for peer-to-peer transactions).

<1>

IIP

If set then the IO-ASIC (on this node) can access the IPRs at this destination node.

<0>

VAL

Destination is valid for all transfers

Occasionally, a torus offers two paths to the destination node from the current node that
are equidistant. The routing algorithm should favor both paths equally, because this
increases the network performance. The 21464 uses the 21363 approach, which
assumes that the software creating the routing table has balanced the equidistant paths
amongst the nodes. Thus an individual node will favor a particular direction, but a different node in the network will favor the opposite direction. Hence, overall, the network
will exhibit no particular bias.

16.5.18 Router WHOAMI Register (R,W) -

R_WHOAMI

This contains a 10-bit address defining the nodes address.
Table 16-58 WhoAml Register Fields Description
Bits

Field

<31 : 10>

reserved

<9:5>

EW<4:0>

East-West (x-axis) coordinate

<4:0>

NS<4:0>

North-South (y-axis) coordinate

Value

Meaning

Comments

16.5.19 Router Overall-Timer-Control Register (R,W)- R_OVER
This register specifies the period between increment pulses to the port, fan-in, and fanout timers.
Table 16-59 Router Overall-Timer-Control Register Fields Description
Bits

Field

<31:21>

TIM

Tllllet Value: Software visible portion.

<20:0>

HID

Hidden portion: Only the hardware can write to this portion

Value

Meaning

Comments

16.5.20 Router Interrupt Status (R, WIC) -

R_INT_STAT

TBD:

16.5.21 Router Interrupt Mask (R, W) -

R_INT_MASK

TBD:

Compaq Confidentia I
5 Jam.1ary 2001 -- Subject To Change

Internal Processor Registers 16-51

Zbo:xlPRs

16.5.22 Router Interrupt Request (WO) -

R_INT_REQ

TBD:

16.5.23 Router Interrupt Queue Register (RO) -

R_INT_QUE

A read of this register reads the head of the interrupt queue. This read is non-destructive
- to remove the entry at the head of the queue, software must write an arbitrary value to
this register.
TBD:

16.5.24 Router Interrupt Queue Add Register (WO) -R_INT_QUEADD
This register is typically used by 1/0 devices to post interrupts. A write to this register
attempts to add an interrupt identifier to the interrupt queue (via an write-IO command).
If the queue is full then the 21464 discards the interrupt identifier and returns a WrIONAck packet. Otherwise, the write succeeds and the 21464 returns a WrIOAck packet.
The interrupt queue is two-entries deep.
TBD:

16.5.25 Router Interval Timer Register (R, W) -

R_INTER_TIM

TBD: Do we need this?
Table 16-60 Router Overall-Timer-Control Register Fields Description
Bits

Field

Value

<31 :8>

reserved

<7:0>

ITV

Meaning

Comments

Interval Tllller value

16.5.26 Router Scratch Register 1 (R,W) - R_SCRATCH1
TBD: Do we need this?

16.5.27 Router Scratch Register 2 (R,W) - R_SCRATCH2
TBD: Do we need this?

16.6 Zbox IPRs
This section describes the internal processor registers that control Zbox functions.
These registers are duplicated for each of the two memory controllers.

16.6.1 DRAM Error Status 1 - ZBOXn_DRAM_ERR_STATUS1
There are two DRAM error status 1 registers; ZBOXO_DRAM_ERR_STATUS 1 and
ZBOXl_DRAM_ERR_STATUS 1.
Figure 16-36 shows the DRAM error status 1 register.
Figure 16-36 DRAM Error Status 1

Compaq Confidential
16-52 Internal Processor Registers

5 Jc1nwiry 2001 ·- Subject To Change

ZboxlPRs

2322

1413

1110

s 4

DAT_ERRSYN0[8:0]--------------~
DAT_ERRSYN1[8:0]-------------------~

Rese~ed----------------------~
DIR_ERRSYN[5:0]------------------------~

TCLOCK_CHAN[4:0]---------------------------~
LK99-0093A

Table 16-61 describes the DRAM error status 1 register fields.
Table 16-61 DRAM Error Status 1 Fields Description
Name

Extent Type

Description

Reserved

[63:32] RO,
MBZ

DAT_ERRSYN0[8:0]

[31 :23] RWRC

ECC syndrome for octaword 0 - valid only when RAID, RMAP,
SGL, DBL, GE3, or PAR set

DAT_ERRSYN1[8:0]

[22:14] RWRC

ECC syndrome for octaword I -valid only when RAID, RMAP,
SGL, DBL, GE3, or PAR set

Reserved

[13:11] MBZ,
WAC

DIR_ERRSYN[5 :0]

[10:5]

RWRC

Directory syndrome for single-bit ECC error

TCLOCK_CHAN[4:0]

[4:0]

RWAC

Bit mask of which channels had tclock errors - only valid
when TCLK error bit set

Any write to DRAM_ERR_STATUSl forces a load of TCLOCK_CHAN. This should
clear it if no errors are occurring.
See ???? , which describes error syndromes while a channel is being remapped.

16.6.2 DRAM Error Status 2 - ZBOXn_DRAM_ERR_STATUS2
There are two DRAM error status 2 registers; ZBOXO_DRAM_ERR_STATUS2 and
ZBOX1_DRAM_ERR_STATUS2.
Figure 16-37 shows the DRAM error status 2 register.
Figure 16-37 DRAM Error Status 2

Compaq Confidentia I

5 January 2001 -~ Subject To Change

Internal Processor Registers 16-53

ZboxlPRs

DAT_ERRSYN2[8:0]---------------~

DAT_ERRSYN3[8:0]---------------------'
Rese~ed----------------------~
TEMPCAL_DEV[4:0]------------------------~

T E M P C A L _ C H A N [ 4 : 0 J - - - - - - - - - - - - - - - - - - - - - - - - - - - L K - - '-0o A
99

Table 16-62 describes the DRAM error status 2 register fields.
Table 16-62 DRAM Error Status 2 Fields Description
Name

Extent Type

Description

Reserved

[63:32] RO,
MBZ

DAT_ERRSYN2[8:0]

[31 :23] RWAC

ECC syndrome for octaword 2 - valid only when RAID,
RMAP, SGL, DBL, GE3, or PAR set

DAT_ERRSYN3 [8:0]

[22:14] RWAC

ECC syndrome for octaword 3 - valid only when RAID,
RMAP, SGL, DBL, GE3, or PAR set

Reserved

[13:10] MBZ,
WAC

TEMPCAL_DEV[4:0]

[ 9:5]

RWAC

Identifies which device had a temperature calibration error valid only when TCALERR is set

TEMPCAL_CHAN[ 4:0]

[ 4:0]

RWAC

A bit mask of those channels that had temperature calibration
errors - valid only when TCALERR is set

Any write to DRAM_ERR_STATUS2 register will:
•

Force a reload of the syndrome registers DAT_ERRSYNO, DAT_ERRSYNl,
DAT_ERRSYN2, DAT_ERRSYN3, and DIR_ERRSYN
(if ECC_COR_ENABLED = 0, the syndromes are defaulted to 0).

•

Force a write of TEMPCAL_CHAN (if any read has been performed after reset,
this will be some bits of the data read, which will be the same on replicated systems). Writer note: Unclear- needs updating/augmenting.

•

Clear TEMPCAL_DEV

See ????? , which describes error syndromes while a channel is being remapped.

16.6.3 DRAM Error Status 3 - ZBOXn_DRAM_ERR_STATUS3
There are two DRAM error status 3 registers; ZBOXO_DRAM_ERR_STATUS3 and
ZBOX1_DRAM_ERR_STATUS3.
Figure 16-38 shows the DRAM error status 3 register.

Compaq Confidential
16-54 Internal Processor Registers

5 Jc1nuc1ry 2001

Subject To Change

ZboxlPRs

Figure 16-38 DRAM Error Status 3

ERR_STATUS[15:0]--------------------------~
LK99-0095A

Compaq Confidential
5 January 2001 ··· Subject To Change

Internal Processor Registers 16-55

ZboxlPRs

Table 16-63 describes the DRAM error status 3 register fields.
Table 16-63 DRAM Error Status 3 Register Fields Description
Name

Extent Type

Reserved

[63:16] RO,
MBZ

ERR_STATUS[15:0]

[15: 0]

Description

RWlC Bitmask of DRAM error conditions:
Name

Bit

Meaning When Set

SWP

[15]

An error occurred during sweep mode read

SEO

[14]

A second uncorectable error occurred for which no
physical address was saved

MEO

[13]

A second correctable error occurred for which no
Phys Addr. was saved

Reserved [12]
Reserved [11]
OLCK

[10]

A DLL had an out-of-lock condition

TJME

[9]

A DIFf timeout occurred

TCAL

[8]

Some channel had a over temperature fault

TCLK

[7]

Some channel had a clock fault

D21

[6]

Directory[21] was read as 1

DBL

[5]

A double ECC error was detected on a read

GE3

[4]

Three or more single ECC errors were detected on a
read

MAPF

[3]

A raid-remap occurred, and no unique best remapping was found

RAID

[2]

A raid-remap occurred, and a remapping was
selected

SGL

[1]

One or two single bit ECC errors were detected on a
read

PAR

[0]

One or more parity errors were detected on a read

See ????? , which describes error syndromes while a channel is being remapped.

16.6.4 DRAM Error Control - ZBOXn_DRAM_ERROR_CTL
There are two DRAM error control registers; ZBOXO_DRAM_ERROR_CTL and
ZBOXl_DRAM_ERROR_CTL.
Figure 16-39 shows the DRAM error control register.

Compaq Confidential
16-56 Internal Processor Registers

5 Janu,1ry 2001 - Subject To Cfumge

ZboxlPRs

Figure 16-39 DRAM Error Control

FRC_LOCAL--------------~

ECC_COR_ENABLED-----------------'
FRC_WTERR[2:0]---------------~

SET_DIR21 - - - - - - - - - - - - - - - - - - - - - '
RAID_ON-----------------~

RAID_MAP[4:0]------------------~

Rese~ed----------------------~
ERR_INT_ENAB[10:0]---------------------------~

LK99-0096A

Table 16-64 describes the DRAM error control register fields.
Table 16-64 DRAM Error Control Register Fields Description
Name

Extent

Type

Description

Reserved

[63:32]

RW,
MBZ

FIFTH_CH_ENA

[31]

RAID channel has power and is enabled.

FRC_LOCAL

[30]

If set, forces directory data being sent to the DIFT to be read as zero

(local). Directory data that is stored into the DRAM_SWEEP_DIR
register when SWEEP_ON is set is not affected. No directory ECC
errors are reported when FRC_LOCAL is set.
ECC_COR_ENABLED [29]

Used to disable ECC correction of fill data to fill buffer. If clear,
data is not corrected, errors are not reported, and syndromes are
forced to zero.

FRC_WTERR[2:0]

If the address matches the value of Zbox force-error address

[28:26]

SET_DIR_21

[25]

Bits

Meaning

000
001
010
011
100
101
110
111

Do nothing
Substitute victim_data[27:0] for Dir[21,E5:0,20:0]
Force COL_ADR[O] to 0 on channel 4 only
Force COL_ADR[O] to 0 on all channels
Force COL_ADR[O] to 0 on channel 0 only
Force COL_ADR[O] to 0 on channel 1 only
Force COL_ADR[O] to 0 on channel 2 only
Force COL_ADR[O] to 0 on channel 3 only

If set, forces the spare directory bit to be set based on

address-match when matched block is written to memory
(Section 6.7.22). RAID_ON must be set if this function is
enabled.

Compaq Confidential

5 January 2001 --· Subject To Change

Internal Processor Registers 16-57

Zbox:IPRs

Table 16-64 DRAM Error Control Register Fields Description (Continued)
Name

Extent

Type

Description

RAID_ON

[24]

Used to disable byte writes - If set:
1. FIFTH_CH_ENA must be set
2. Directory byte writes are disabled

RAID _MAP[ 4:0]

[23: 19]

Indicates which channel should be remapped (one-hot encoded).
There are only eight legal encodings for the combination of
RAID_ON, ECC_COR_ENABLED, and RAID_MAP:
ECC_COR_
RAID_ON ENABLED RAID_MAP

[18: 11]

RO,
MBZ

ERR_INT_ENAB[lO:O] [10:0]

Reserved

0
0

1
0

00000 Use raid channel, nothing mapped
out yet
10000 Raid channel exists, some channel mapped out (used for chanel
4 observation port).
01000 Raid channel exists, some channel mapped out.
00100 Raid channel exists, some channel mapped out.
00010 Raid channel exists, some channel mapped out.
00001 Raid channel exists, some channel mapped out.
10000 No raid channel, ECC enabled.
10000 No raid channel, no ECC ch/corr.

If set, enables appropriate interrupt to be generated when the error

status bit in the corresponding bit position of ERR_STATUS[lO:O]
is being set. Setting the enable after the ERR_STATUS bit is
already set does not cause an interrupt to be generated.

16.6.5 DRAM Timing Control 1 - ZBOXn_DRAM_TIMING_CTL 1
There are two DRAM timing control 1 registers; ZB OXO_D RAM_TIMING_CTLl and
ZBOXl_DRAM_TIMING_CTLl.
Figure 16-40 shows the DRAM timing control 1 register.

Compaq Confidentia I
16-58 Internal Processor Registers

5 Jc1f'IU«iry 2001 - Subject To Change

ZboxlPRs

Figure 16-40 DRAM Timing Control 1
302926272625

2322

191617161514

1110 9

4 3

ReseNed------~

ROW_STAG_SEL[1:0]--------------------'
TCWD_TCLK_OFF[1 : O ] - - - - - - - - - - - - - - - - - - - - - '
TCWD_TCLK_WIDTH[2:0]---------------------'
TCWD_GCLK_WIDTH[3:0]------------------------'
CLOCK_RANGE[1 :O]--------------------~
SYNC_LD_UNLD--------------------~

CLOCK_ENABLED[O]---------------------~
CLOCK_RATI0[3:0]----------------------~
CLOCK_RATIO_HALF-----------------------~

TCAC_TCLK_SEL[S:O]-------------------------------'
TCAC_ADJ_SEL[3:0]----------------------------~
LK99-0097A

Table 16-65 describes the DRAM timing control 1 register fields.

Table 16-65 DRAM Timing Control 1 Fields Description
Name

Extent

Type Description

Reserved

[63:30]

RW,
MBZ

ROW_STAG_SEL[l:O]

[29:28]

Row stagger select for refresh (0, 1, 2 or 4 TLCKs). The refresh
operations to the ROW bus can be staggered from channel to
channel by 0 (no stagger), 1, 2 or 4 TCLKs, controlled by
ROW _STAG_SEL[l:O], encoded as follows:
Value

Meaning

No stagger
1 TCLK stagger
2 TCLKs stagger
4 TCLKs stagger

1
2
3
TCWD_ TCLK_OFF[l:O]

[27:26]

TCWD_TCLK_WIDTH[2:0] [25:23]

Signal tcwd_off_a_h adjusts for tCWD less than 6 by starting
write_sent pulse at an early TCLK. Values are as follows:
RDRAM tCWD

TCWD_TCLK_OFF

See tableunderTCWD_GCLK_WIDTH[3:0] fortCWD = 4,
5 or 6 values. For tCWD = 7, use the value from the table +1.

Compaq Confidential
5 January 2001 -- Subject To Change

Internal Processor Registers 16-59

ZboxlPRs

Table 16-65 DRAM Timing Control 1 Fields Description (Continued)
Name

Extent

TCWD_GCLK_WIDTH[3:0] [22:19]

Type Description

Use the following table for any tCWD value of 4 ... 7: .
Clock
Ratio

TCWD_TCLK_WIDTH[2:0] TCWD_GCLK_WIDTH[3:0]

2
3
0
2
2.5
3
4
3
3.5
3
6
4
5
0
4.5
1
5
2
5
5
5.5
5
3
4
6
5
6.5
5
5
7
5
6
7.5
5
7
8
6
0
16
6
8
For RDRAM tCWD < 6, program TCWD_TCLK_WIDTH
and TCWD _ GCLK_WIDTH as above, and adjust
TCWD_ TCLK_OFF. For tCWD = 7, add 1 to the
TCWD_TCLK_WIDTH value above.
CLOCK_RANGE[l:O]

[18:17]

Tell DLL min/max elk freq range (encoding TBF).

SYNC_LD_UNLD

[16]

Synchronize silos.

CLOCK_ENABLED[O]

[15]

RW,O Rambus clocks are not enabled until this bit is set.

CLOCK_RATI0[3:0]

[14:11]

This is the GCLK to TCLK ratio according to this table:
Clock_Ratio[3:0]

GCLK:TCLK Ratio

0
1
2*
3*
4*
5*
6*
7*
8
9 ... 15

16:1
Illegal
2:1
3:1
4:1
5:1
6:1
7:1
8:1
Illegal

CLOCK_RATIO_HALF

[10]

This adds 0.5 to the GCLK to TCLK ratio if set. This bit may
only bit set for the ratios marked with * under
CLOCK_RATI0[3 :0].

TCAC_TCLK_SEL[5:0]

[9:4]

Selects TCAC RDRAM delay parameter (COL= RD-7Data).
Set this value according to table under TCAC_ADJ_SEL[3:0].

Compaq Confidential
16-60 Internal Processor Registers

5 Jc1nwtry 2001 --· Subject To Change

ZboxlPRs

Table 16-65 DRAM Timing Control 1 Fields Description (Continued)
Name

Extent

Type Description

TCAC_ADJ_SEL[3:0]

[3:0]

Used for fine (GCLK) adjustment of the tCAC parameter. The
supported range is 0 ... 15. This is used to:
• Compensate for the runway depth
•Center the read-strobe in the clock-forward silo
data-valid window
The following table lists base values (no delay compensation).
Baseline CSR values for TCAC (RDRAM tCAC=8 to 0 fine
adjustment). For RDRAM tCAC>8, add offset to
TCAC_TCLK_SEL baseline value (support for RDRAM tCAC
from 7 to 39). For fine adjustment, add needed cycles to
TCAC_ADJ_SEL baseline value
Clock Ratio

TCAC_TCLK_SEL[S:O] TCAC_ADJ_SEL[3:0]

2.5

3
3.5
4

9
9

2
2

9
9

4.5
5
5.5
6
6.5
7
7.5

8
16

2
2

9
9

2
2

A
9
9
9

6
D
2
7

16.6.6 DRAM Timing Control 2 - ZBOXn_DRAM_TIMING_CTL2
There are two DRAM timing control 2 registers; ZBOXO_DRAM_TIMING_CTL2 and
ZBOX1_DRAM_TIMING_CTL2.
Figure 16-41 shows the DRAM timing control 2 register.

Compaq Confidential
5 January 2001 ·- Subject To Change

Internal Processor Registers 16-61

ZboxlPRs

Figure 16-41 DRAM Timing Control 2
4 3

TRAS_OFF[5:0]-----------------'
Reserved-----------------~

TRP_OFF[4:0]------------------~
Reserved--------------------~

TPP_ O F F [ 3 : 0 ] - - - - - - - - - - - - - - - - - - - - - - - - - - '
Reserved----------------------~

TRR_OFF[3:0]---------------------------'
TRCD_OFF[4:0]--------------------------~
TRDP_OFF[3:0]----------------------------~
LK99-0098A

Table 16-66 describes the DRAM timing control 2 register fields.
Table 16-66 DRAM Timing Control 2 Fields Description
Name

Extent Type

Description

Reserved

[63:31] RW,MBZ

TRAS_OFF[5:0]

[30:25] RW

Reserved

[24]

TRP_OFF[4:0]

[23:19] RW

Reserved

[18]

TPP_OFF[3:0]

[17:14] RW

Reserved

[13]

RW,MBZ

TRR_OFF[3:0]

[12:9]

Used to determine tRR (RAS-RAS to same device)
TRR_OFF = Rambus tRR

TRCD_OFF[4:0]

[8:4]

Used to determine tRCD (RAS-CAS delay)
TRP_OFF= Rambus tRP

TRDP_OFF[3:0]

[3:0]

Used to determine tRDP (CAS=RD-PRE delay)
TRDP_OFF= Rambus tRDP

Used to determine tRAS (RAS-PRE)
TRAS_OFF = Rambus tRAS

RW,MBZ
Used to determine tRP (PRE-RAS)
TRP_OFF= Rambus tRP

RW,MBZ
Used to determine tPP (PRE-PRE to same device)
TPP_OFF = Rambus tPP

16.6.7 DRAM Timing Control 3- ZBOXn_DRAM_TIMING_CTL3
There are two DRAM timing control 3 registers; ZB OXO_DRAM_TIMING_CTL3 and
ZBOXl_DRAM_TIMING_CTL3.

Compaq Confidential
16-62 Internal Processor Registers

5 Jam.u~ry 2001 - Subject To Cfumge

ZboxlPRs

Z_SLT attempts to gang commands of like type (rd or wr) in consecutive COLC packets. To prevent locking out the command of unlike type, there is logic which monitors
the number of pending and consecutively slotted transactions. Based on thresholds programmed by means of CSRs, the slotter switches to the other command type.
Figure 16-42 shows the DRAM timing control 3 register.
Figure 16-42 DRAM Timing Control 3
63

3231

2827

2423

2019

1413

9 8 7

4 3

Reserved-----~

RD_STRV_MAX_WR[3:0]-----------------'
WR_STRV_MIN_WR[3:0]-------------------'
WR_STRV_MAX_RD[3:0]---------------------'
RD_WR_SPC[5:0]-----------------------'
WR_RD_SPC[4:0]-----------------------~

Reserved---------------------------'
TRTP_ O F F [ 3 : 0 ] - - - - - - - - - - - - - - - - - - - - - - - - - - - - - '
TRTR[3:0]----------------------------~
LK99-0099A

Compaq Confidential
5 January 2001 ··· Subject To Change

Internal Processor Registers 16-63

ZboxlPRs

Table 16-67 describes the DRAM timing control 3 register fields.
Table 16-67 DRAM Timing Control 3 Fields Description
Description

Name

Extent

Type

Reserved

[63:32]

RO,
MBZ

RD _STRV_MAX_WR
[3:0]

[31:28]

This is used by the read starvation logic. It is the maximum number of consecutive writes that may be slotted to the COLC bus
while a read is pending in the ZRQ-CSQ. Supported range is
0 ... 15. This is the maximum number of consecutively-slotted
writes that are tolerated before considering reads to be starved.
The additional criterion is that there must be at least one (not programmable) read pending in the ZRQ. The CSR is programmed
with the desired maximum - 1. Legal values are 0 ... 15, yielding
1 ... 16 maximum number of consecutively slotted writes while a
read is pending.

WR_STRV_MIN_WR
[3:0]

[27:24]

This is the minimum number of writes that must be pending
before interrupting a stream of reads. Used to enforce the Alpha
architecture write-timeliness requirement. Supported range is
0 ... 7, yeilding a pending write count of 1 ... 8. To disable
interrupting the read stream, use a value of 8 in this field. Values
in the range of 9 ... 15 can result in UNPREDICTABLE operation.
This is the minimum number of writes that must be pending in the
ZRQ before considering writes to be starved by reads.

WR_STRV_MAX_RD
[3:0]

[23:20]

This is the maximum number of consecutive reads allowed before
attempting to satisfy the write timeliness clause. This is the maximum number of consecutively-slotted reads that will be tolerated
before considering writes to be starved. The additional criterion is
that there must be at least z_csrN-7wr_strv_rnin_wr_a_h
+ 1 writes pending in the ZRQ. The CSR is programmed with the
desired maximum- 1. Legal values are 0 ... 15, yielding 1 ... 16
maximum number of consecutively slotted reads while the minimum number of writes are pending.

RD_WR_SPC[5:0]

[19: 14]

This is the number of TCLKs of dead time that must be inserted
between COLC=RD and COLC=WR packets.

WR_RD_SPC[4:0]

[13:9]

This is the number of TCLKs of dead time that must be inserted
between COLC=WR and COLC=RD packets.

Reserved

[8]

RW,
MBZ

TRTP_OFF[3:0]

[7:4]

Used to determine tRTP (COLC=RET-ROWR=PRER)
TRTP_OFF= Rambus tRTP

TRTR[3:0]

[3:0]

Specifies tRTR value (COLC=WR-COLC=RET) or
COLC=WR-COLC=BMSK)

16.6.7 .1 Calculating Read to Write and Write to Read Spacing

There are CSRs to control the number of "gap" cycles to inject between adjacent read
and write transaction packet pairs. These gap cycles are required to avoid driver overlap
on the Rambus data lines.
Compaq Confidential
16-64 Internal Processor Registers

5 Jc1nw~ry 2001 -- Subject To Change

ZboxlPRs

In an ideal (zero skew, zero CFM-7CTM delay) arrangement, the CSRs can be programmed such that the last data cycle of the first transaction can be followed immediately (the next cycle) by the first data cycle of the second transaction.
16.6.7 .2 Terminology

Timing parameter tCAC is the CAS access delay, and specifies the number of TCLK
cycles between the end of a CAS=READ packet and its corresponding data cycles.
Direct RD RAMs support tCAC values of 7 to 12 cycles. The ZBox supports a higher
tCAC limit to allow for repeater chips, which add approximately 30nS of delay.
Timing parameter tCWD is the CAS write delay, and specifies the number of TCLK
cycles between the end of a CAS=WRITE packet and its corresponding data cycles.
The Direct RD RAM tCWD parameter is defined as 4 + tCLS. tCLS is a RD RAM core
parameter and, for example, has the value of 2 for the 256k x 18 x l 6d device, yielding
tCWD = 6. tCLS is a 2 bit field, so tCWD can vary from 4 to 7.
16.6.7 .3 Ideal Rambus

In an ideal arrangement, the minimum read to write spacing that must be injected
(expressed in TCLKs) is tCAC - tCWD.
The formula for calculating the (ideal) read to write spacing CSR in the ZBox, namely,
RD_WR_SPC[5:0] is:
RD_WR_SPC {IDEAL) = {tCAC-tCWD) + tPACKET

similarly for wr_rd spacing:
WR_RD_SPC {IDEAL) = {tCWD-tCAC} + tPACKET

Note:

Because tCWD is typically less than tCAC, a negative result should be that
a tPKT value is placed in the CSR (where tPKT = 4 TCLKs).

16.6.7 .4 Non-Ideal Rambus

Skews and delays in actual (non-ideal) designs must be accounted for in determining
the values of the spacing CSRs. Thus, a read to write transition might need to have further dead cycles injected (for example, to account for delay that could make a read's
data collide with a subsequent write's data if IDEAL values are programmed in the
CS Rs).
System designers must determine these additional delays, round up to the nearest
TCLK, and add them to the ideal calculated values. These delays are called:
tRES

tRead Extra Spacing

tWES

tWrite Extra Spacing

Furthermore, when the 21464 is issuing write-data to the channel, there must be 2
TCLKs of dead space before a read transaction can supply its data, to allow ringing on
the channel to settle, and not disrupt read data arriving at the 21464. Thus, if tCAC tCWD < 2, then tWRA (Write Ringing Avoidance) cycles must be added to the write to
read spacing to enforce the 2 TCLK minimum.
The CSR equations now become:
RD_WR_SPC[5:0] = {tCAC-tCWD} + tRES

Compaq Confidential
5 January 2001 ~·Subject To Change

Internal Processor Registers 16-65

ZboxlPRs

WR_RD_SPC[4:0]

= (tCWD-tCAC}

+ tWES + tWRA

(If negative result, set CSR to 4)

16.6.8 DRAM Refresh Control - ZBOXn_DRAM_REFR_CTL
There are two DRAM refresh control registers; ZBOXO_DRAM_REFR_CTL and
ZBOXl_DRAM_REFR_CTL.
Figure 16-43 shows the DRAM refresh control register.
Figure 16-43 DRAM Refresh Control
1716

1312

FRC_PRE------------___,

Reserved-----------------'
FORCE_NOCOP-----------------'
ENA_PREC--------------------'
ENA_PREX-------------------'
DRAIN_WRITE_CTL[1:0]-------------------'
REFBIT_BNK[2:0]---------------------'
REF_ B U R S T [ 3 : 0 ] - - - - - - - - - - - - - - - - - - - - - - - '
REF_INT[12:0]--------------------------'
LK99-0102A

Compaq Confidential
16-66 Internal Processor Registers

5 Jc1mJc1ry 2001 - Subject To Change

ZboxlPRs

Table 16-68 describes the DRAM refresh control register fields.
Table 16-68 DRAM Refresh Control Fields Description
Name

Extent

Type

Reserved

[63:32]

RO,
MBZ

FRC_PRE

[31]

Description

When set, disables page-hit logic in memory controller, forcing full PRE-RAS-CAS for every access. Set to 1 on cold or

fast reset. Firmware must clear this bit to enable use of
the ZBox page table.
Reserved

[30:25]

RW,
MBZ

FORCE_NOCOP

[24]

When set, forces nocop to have higher priority than READS
during retire slot.

ENA_PREC

[23]

Enables the slotter to use a COLC_PREC precharge packet.

ENA_PREX

[22]

Enables the slotter to use a COLX-PREX precharge packet.

DRAIN_WRITE_CTL[l:O]

[21:20]

Controls when to force write drains. Encoded as:

REFBIT_BNK[2:0]

[19:17]

Bits

Meaning when set

00
01
10
11

Drain write timer disabled, and reset to 0
Drain writes every 64 TCLKs
Drain writes every 128 TCLKs
Reserved

A 3-bit mask that corresponds to the BNK[5:3] address bits
that are to be ignored during refresh (REFA/REFP) in support
of multi-bank refresh.
REFBIT_BNK[2:0] BNK[5:3]

000
100

110
111

No mask
Mask BNK[5]
Mask BNK[5:4]
Mask BNK[5 :3]

REF_BURST[3:0]

[16:13]

The number of refresh commands in a burst. The number is
encoded as REF_BURST[3:0] + 1

REF_INT[12:0]

[12:0]

The number of Tpkts (each= 4 Rambus clocks) to wait
between refresh intervals. The number of refreshes that will be
serviced within a given TREF_INT interval is determined by
REF_BURST as described in text following this table.

Compaq Confidential
5 January 2001 -~ Subject To Change

Internal Processor Registers 16-67

ZboxlPRs

The refresh interval (expressed in TCLKs) is programmed by using REF_INT[12:0].
The formula is:

REF_INT[l2:0]

MIN(uREF_INT, uRAS_INT) - 1

Where:
•

uREF_INT is the "micro" refresh interval

•

uRAS _INT is the constraint that guarantees all banks are precharged (due to refresh
operations) such that the RDRAM tRAS,MAX parameter is satisfied

•

uREF_INT= INT((.25 * tREF_T * nBURST) I (2**(b+r))

•

uRAS_INT = INT((.25 * tRASMAX_T * nBURST))/(2**b)

•

tREF_Tis RDRAM refresh interval expressed as a number of TCLKs

•

b = number of refresh bank bits (may not be equal to number of bank-address bits
due to multibank refresh)

•

r = number of row address bits

•

nBURST = number of refresh operations/interval. This should be set to the number
ofREFP-REFA transactions that can be issued within the tRAS,MIN interval. This
reduces the overhead of refresh activity

•

tRASMAX_T is the RDRAM tRAS,MAX parameter expressed as a number of
TCLKs

•

A REF_INT value of 0 disables memory refresh.

The number of refresh operations issued each interval is programmed by means of
REF_BURST[3:0]. This register serves as an offset, in that the actual burst length is 1
more than the programmed value. Legal range for the CSR is 0 ... 15, yielding burst
length range of 1 ... 16.

16.6.9 DRAM Calibration Control 1-ZBOXn_DRAM_CALIB_CTL1
There are two DRAM calibration control l registers; ZBOXO_DRAM_CALIB_CTLl
and ZBOXl_DRAM_CALIB_CTLl.
Figure 16-44 shows the DRAM calibration control 1 register.
Figure 16-44 DRAM Calibration Control 1

CCTLIN[6:0]----------------~

RAC_QUIET_SEL[1 :0)------------------~
RD_CC_SPC[4:0)--------------------~

CC_INT[14:0]----------------------------'
LK99-0122A

Com p.aq Confidential
16-68 Internal Processor Registers

5 Jc1nUc1ry 2001 - Subject To Change

ZboxlPRs

Table 16-69 describes the DRAM calibration control 1 register fields.
Table 16-69 DRAM Calibration Control 1 Fields Description
Name

Extent

Type

Description

Reserved

[63:30]

RW,
MBZ

TC_INT[14:0]

[29:15]

The number of refresh intervals between temperature calibrations.

CC_INT[14:0]

[14:0]

The number of refresh intervals between current calibrations.

16.6.9.1 Temperature Calibration Interval

The Temp Calibrate interval (expressed in number of refresh intervals) is programmed
by means of TC_INT[14:0]. It should be set to the number of intervals contained in
one-half of the RDRAM's temperature calibrate (tTCAL) parameter. The one-half
arises from the fact that each temperature calibration sequence consists of 2 commands,
TCEN followed by TCAL, such that the TCAL is issued once per RDRAM tTCAL
interval.
16.6.9.2 Current Control Interval

The current calibrate interval (expressed in number of refresh intervals) is programmed
by means of field CC_INT[l4:0]. It should be set to the number of refresh intervals in
one RDRAM current calibrate interval, divided by the number of devices on the channel. The 21464 RAC performes its current calibration immediately following the calbiration of RDRAM device #0.

16.6.10 DRAM Calibration Control 2 - ZBOXn_DRAM_CALIB_CTL2
There are two DRAM calibration control 2 registers; ZBOXO_DRAM_CALIB_CTL2
and ZBOX1_DRAM_CALIB_CTL2.
Figure 16-45 shows the DRAM calibration control 2 register.
Figure 16-45 DRAM Calibration Control 2
22 2120

1514

TCQUIET[7:0]--------------------'
TC_QUIET_SEL-------------------~
RD_TC_SPC[5:0]--------------------~
TC_INT[14:0]--------------------------~
LK99-0123A

Compaq Confidential
5 January 2001 -~ Subject To Change

Internal Processor Registers 16-69

ZboxlPRs

Table 16-70 describes the DRAM calibration control 2 register fields.
Table 16-70 DRAM Calibration Control 2 Fields Description
Name

Extent

Type

Description

Reserved

[63:30]

RW,
MBZ

CCTLIN[6:0]

[29:23]

Used for manual RAC current control update. The value of
CCTLIN[6:0] is copied to the die bumps.

RAC_QUIET_SEL [22:21]
[1:0]

This field selects the number of tPKTs (1 tPKT = 4 TCLKs) of quiet
period (no Rambus activity) after performing current calibration of
21464 internal RACs.

RAC_QUIET_SEL[1 :O]

Amount of additional delay

1 tPKT

2 tPKTs

4 tPKTs

RD_CC_SPC[4:0]

[20:15]

This is the number of TCLKs of dead time that must be inserted
between COLC=CC and COLC=RD packets.

TCQUJET[7:0]

[14:7]

Number of read cycles prohibited after teal.

TC_QUIET_SEL

[6]

Enables optional all-quiet period after TempCal command is issued to
RDRAMs. When set, the quiet period as specified by TCQUIET[7:0]
causes ROW, COL and DATA buses not to be driven. When clear, the
quiet period only applies to a period of prohibition of reads.

RD_TC_SPC[5:0]

[5:0]

This is the number of TCLKs of dead time that must be inserted
between COLC=RD and COLC=TC packets.

16.6.10.1 Read to Current Control Transition

Prior to executing a current calibration packet to a given device, that device may not
have been read READTOCC TCLKs prior. The field that controls this quiet period is
RD_CC_SPC. This CSR should be programmed to the RDRAM's READTOCC parameter.
16.6.10.2 Temperature Calibrate to Read transition
A

rw0"9'

.... C"-f- ho

,:...,.,:,.,.,-.f.orl ,....,...,.... 4-l,,.,. D n.T"r"lh .... .n +,....,. ,.,. ....... ~,,.,.--,.,. +J,,.,. ,,,.... .. ,..:,...+'' _,..._.:,.......:I J...,..4-,..,.. .. ,.,. _ _

5"P .11.lU.:>l. U\:.< .l.l.ljv\Al;AJ vu U.lv .l'-.a.l.l.lUU.:> l.V v.11.lV.l\;v Ulv

'fUlvl.

•-----~

pvl.1VU Uvl.WvvJ.1 l.v111J:JCU:l-

ture calibrate (TCAL) and the next read transaction. TCQUIET[7:0] controls the length
of the quiet period. The formula for calculating TCQUIET[7:0] is:
tCAL

tTCQUIET - tPACKET - tCAC

16.6.10.3 Read to Temperature Calibrate transition

One must make sure that the the beginning of the tTCQUIET period is not violated by a
prior read or CAL/SAM operation. To do this, the TCAL packet must be delayed to a
certain point past the last previous READ or CAL/SAM packet. RD_TC_SPC[5:0] controls the length of this period. The formula for calculating RD_TC_SPC[5:0] is:
tCAC - tTCAL

tPACKET + tPACKET

Compaq Confidential
16-70 Internal Processor Registers

5 Jt111Ut1ry 2001 ···Subject To Cfumge

ZboxlPRs

16.6.11 DRAM Timing Control 4 - ZBOXn_DRAM_TIMING_CTL4
There are two DRAM timing control 4 registers; ZBOXO_DRAM_TIMING_CTL4 and
ZBOX1_DRAM_TIMING_CTL4.
Figure 16-46 shows the DRAM timing control 4 register.
Figure 16-46 DRAM Timing Control 4
2524

1918

1413

9 8

5 4 3

1 1~1 1 1 1 1 1 1
TRASref_OFF[6:0]--------------~
TRPref_OFF[5:0]------------------~

TPPref_OFF[4:0]-----------------------'
TRRref_OFF[4:0]---------------------------'
Rese~ed----------------------------'

REF_TIMER-----------------------------'
TOFFP[3:0]----------------------------____,
LK99-0103A

Table 16-71 describes the DRAM timing control 4 register fields.
Table 16-71 DRAM Timing Control 4 Fields Description
Name

Extent

Type

Description

Reserved

[63:32]

RO,
MBZ

TRASref_OFF[6:0] [31:25]

Used to determine tRAS (RAS-PRE) during burst refresh.
TRASref_OFF = Rambus tRASref

TRPref_OFF[5:0]

[24:19]

Used to determine tRP (PRE-RAS) during burst refresh.
TRPref_OFF = Rambus tRPref

TPPref_OFF[4:0]

[18:14]

Used to determine tPP during burst refresh (PRE-PRE to same device).
TPPref_OFF =Rambus tPPref

TRRref_OFF[4:0]

[13:9]

Used to determine tRR during burst refresh (RAS-RAS to same device).
TRRref_OFF =Rambus tRRref

Reserved

[8:5]

RW,
MBZ

REF_TIMER

[4]

If set, enables alternate parameters during refresh.

TOFFP[3:0]

[3:0]

Specifies tOFFP value (COLX = PREX to "implied" ROWR = PRER)

16.6.12 DRAM Refresh Row - ZBOXn_DRAM_REFRESH_ROW
There are two DRAM refresh row registers; ZBOXO_DRAM_REFRESH_ROW and
ZBOXl_DRAM_REFRESH_ROW.
Compaq Confidentia I
5 January 2001 ~· Subject To Change

Internal Processor Registers 16-71

Zbox:IPRs

Figure 16-47 shows the DRAM refresh row register.
Figure·16-47 DRAM Refresh Row

--------.-.

1413

1 0

REF_ R O W _ R E G [ 1 2 : 0 ] - - - - - - - - - - - - - - - - - - - - - - - - - - - - '

RIP-------------------------------J
LK99-0104A

Table 16-72 describes the DRAM refresh row register fields.
Table 16-72 DRAM Refresh Row Fields Description
Name

Extent

Reserved

[63:14]

Type

Description

RO,
MBZ

REF_ROW_REG[12:0]

[13:1]

The next row to be refreshed - the refresh row number should
be written to zero for the refresh all memory function (110 for
CMD in DRAM_INIT_CTL).

RIP

[0]

Refresh-all-mem_In_Progress status (polled to control PDN
entry/exit).

16.6.13 DRAM Initialization Control - ZBOXn_DRAM_INIT_CTL
There are two DRAM initialization control registers; ZBOXO_DRAM_INIT_CTL and
ZB OXl_DRAM_INIT_CTL.
Figure 16-48 shows the DRAM initialization control register.
Figure 16-48 DRAM Initialization Control
63

3 2

LDR-------------------------~

SAE-------------------------~
SFZ--------------------------~
8--------------------------~

DEV[4:0]
-----------------------------'
RACCMD[4:0]
CDM[2:0]---------------------------~
LK99-0105A

Com p.aq Confidential
16-72 Internal Processor Registers

5 JamJc1ry 2001 - Subject To Change

ZboxlPRs

Table 16-73 describes the DRAM initialization control register fields.
Table 16-73 DRAM Initialization Control Fields Description
Name

Extent

Type

Reserved

[63: 11]

RO,
MBZ

SAE

[10]

Stop refresh-all-mem after refreshing bank=max, row=max

SFZ

[ 9]

Start from bank=O, row=O for Refresh-All-mem

[ 8]

Broadcast to all devices. Cannot be used for AT1N

DEV[4:0]

[7:3]

If non-RAC function (CMD!=Oblll), device number targeted by CMD

RACCMD[ 4:0]

[7:3]

If RAC function (CMD=Oblll), encoded as follows:

CMD[2:0]

[2: 0]

Description

Bits

Meaning

Bits

Meaning

0000
0010
0100
0110
1000
1010
1100
1110

NOP
Assert PWRUP
Assert RESET
Reserved
Assert CCTLAUTO
Assert CCTLEN
Assert CCTLLD
Reserved

0001
0011
0101
0111
1001
1011
1101
1111

NOP
Deassert PWRUP
Deassert RESET
Reserved
Deassert CCTLAUTO
Deassert CCTLEN
Deassert CCTLLD
Reserved

Function encoded as follows:
Bits

Meaning

000
001
010
011
100
101
110
111

NOP
TCALdevice
TCENdevice
Current calibrate
PDNR
AT1N
Refresh all mem
RAC Function

16.6.14 DIFT Control -ZBOXn_DIFT_CTL
There are two DIFT control registers; ZBOXO_DIFT_CTL and ZBOXl_DIFT_CTL.
Figure 16-49 shows the D IFT control register.

Compaq Confidential
5 January 2001 -- Subject To Change

Internal Processor Registers 16-73

ZboxlPRs

Figure 16-49 DIFT Control
63

2928 2726

17161514131211

9 8 7

PRBQ_FORCE_Sn<C---------------~

PRBQ_STI<C_DIS-------------------'
DIFT_ISS_CTL[9:0]---------------------'
ReseNed----------------------~
DIFT_SGLSTP----------------------~

BYP_EN-----------------------~

INIT_ON--------------------------'
SWEEP_ON---------------------------'
PIDSHIFT[2:0]------------------------~
PIDWIDTH--------------------------~

PID[7:0]------------------------------'
LK99-0106A

Compaq Confidential
16-74 Internal Processor Registers

5 Jammry 2001 - Subject To Change

ZboxlPRs

Table 16-74 describes the PID control register fields.
Table 16-74 PIO Control Fields Description
Name

Extent Type

Reserved

[63:29] RW,
MBZ

Description

PRBQ_FORCE_STXC [28]

Mimics functions of signal with same name in Cbox CSR and
should be set to the same value.

PRBQ_STXC_DIS

[27]

Mimics functions of signal with same name in Cbox CSR and
should be set to the same value.

DIFT_ISS_CTL[9:0]

[26: 17] RW

Reserved

[16]

RW,
MBZ

DIFT_SGLSTP

[15]

When set, BeginQueue is stalled until the DIFT is idle (that is,
all previous transactions have retired).

BYP_EN

[14]

Bypass enable from allocation to ZRQ. This bit, when set,
enables bypassing directly from DIFT allocation directly into
the Zbox middle (ZRQ).

INIT_ON

[13]

Put DIFT into init mode. INIT mode is used to initialize memory.
When the Zbox is in this mode the processor is expected to submit
inval_to_dirty requests to the zbox. The zbox ignores the current
memory state and responds success to the inval_to_dirty command.
Once the block is victimized, memory is initialized.

SWEEP_ON

[12]

Put DIFT into sweep mode.

PIDSHIFT[2:0]

[11:9]

Processor Shift value. Specifies how many bits to right shift the PID.
Supported Range [0 .. .4].

PIDWIDTH

[8]

Processor ID width mode:

PID[7:0]

[7:0]

Maximum value to initialize Softsnap counter.

Pidwidth

Mode

0
1

6bit PID
8bit PID

Processor ID value (loaded from module CSR)

16.6.15 DRAM Error Address - ZBOXn_DRAM_ERR_ADR
There are two DRAM error address registers; ZBOXO_DRAM_ERR_ADR and
ZBOXl_DRAM_ERR_ADR.
Figure 16-50 shows the DRAM error address register.

Compaq Confidential
5 January 2001 ·- Subject To Change

Internal Processor Registers 16-75

Zbo:xlPRs

Figure 16-50 DRAM Error Address

L.K99-0107A

Table 16-75 describes the DRAM error address register fields.
Table 16-75 DRAM Error Address Fields Description
Name

Extent

Type

Description

Reserved

[63:29]

RO,
MBZ

ERR_ADDR

[28:0]

RWAC

Memory Address of the first (more serious) error encountered. A correctable error address can be overwritten by an uncorrectable error address.
(Memory Address is Physical Address with PID and byte-in-Cache_line
address bits PA[5:0] removed.

16.6.16 DIFT Timeout - ZBOXn_DIFT_TIMEOUT
There are two DIFT timeout registers; ZBOXO_DIFT_TIMEOUT and
ZBOXl_DIFT_TIMEOUT.
Figure 16-51 shows the DIFT timeout register.
Figure 16-51 DIFT Timeout
63

323130

DIFT_TIMEOUT_EN-------------~

DIFT_TIMEOUT_VALUE--------------------~
LK9!)-0108A

Table 16-76 describes the DIFT timeout register fields.
Table 16-76 DIFT Timeout Fields Description
Name

Extent

Type

Description

Reserved

[63:32]

RO,
MBZ

DIFT_TIMEOUT_EN

[31]

Enables DIFT timeout interrupts if set.

DIFT_TIMEOUT_VALUE

[30:0]

Value to reload DIFT timer when it counts down to 0.

Compaq Confidential
16-76 Internal Processor Registers

5 Jc1m.u1ry 2001 --Subject To Change

ZboxlPRs

This register specifies a timeout value for the overall DIFf timer. This timer sends
pulses to the 5 bit timers held with each DIFT entry. The value in this register specifies
the period of the pulses.
When the N bit timer with a given DIFT entry cycles through all ZN states, the timer
expires. Each timer cycles to the next state with each pulse from the overall DIFT timer.
This allows DIFf timeouts in the range of 26 to 236 cycles.
The DIFT timer is reloaded when it counts to zero, or when DIFT_TIMEOUT_EN transitions from a 0 to a 1.

16.6.17 DRAM Mapper Control - ZBOXn_DRAM_MAPPER_CTL
There are two DRAM mapper control registers; ZBOXO_DRAM_MAPPER_CTL and
ZBOXl_DRAM_MAPPER_CTL.
The system programmer is given the flexibility to extract Rambus BANK, DEVICE,
ROW and COLUMN fields from the Memory Address (MA) to reduce the overall
number of batik conflicts (that is, the unnecessary act of closing pages).
Physical ~ Memory Address Mapping

There are four modes of interpretation for the physical -7 memory address mapping,
based on the small_addr and striped mode enable bits in the Cbox. ??????? describes
and illustrates these modes, and the following table summarizes them.
striped mode
small_addr PA[36]
mem_adr[28:0]

MA[28:0] = PA[42] I PA[33:6]

MA[28:0] = PA[42] I PA[35:9] I PA[6]

MA[28:0] = PA[42] I PA[37] I PA[35] I PA[31:6]

MA[28:0] = PA[42] I PA[33:32] I PA[37] I PA[35] I PA[31:9] I PA[6]

Figure 16-52 shows the DRAM mapper control register.

Compaq Confidential
5 January 2001 ··· Subject To Change

Internal Processor Registers 16-77

ZboxlPRs

Figure 16-52 DRAM Mapper Control
63

302928

2625

2221

1817

15141312

10 9

7 6

4 3 2 1 0

NUM_PORT_OFF[O]--------------~

DEV_START_ O F F [ 2 : 0 ] - - - - - - - - - - - - - - - - - - - '
BNK_START _ O F F [ 3 : 0 ] - - - - - - - - - - - - - - - - - - - - '
RWH_START_OFF[3:0]----------------------'
RWL_START_OFF[2:0]---------------------~

DEV_WIDTH_OFF[1:0]----------------------~

BNK_WIDTH_OFF[2:0]-----------------------~
ROW_WIDTH_OFF[2:0]-------------------------~

RWL_WIDTH[2:0]--------------------------_____,
COL_WIDTH_OFF[1:0]------------------------------'
DEP_BNK----------------------------~

SPLIT_BNK--------------------------------'
L.K99-0109A

Table 16-77 describes the DRAM mapper control register fields.
Table 16-77 DRAM Mapper Control Fields Description
Name

Extent

Type

Description

Reserved

[63:30]

RW,
MBZ

NUM_PORT_OFF[O]

[29]

This offset value specifies the number of memory ports that will
be active. If one port is specified (num_port_off=O), the colstart is
implied to be MA[O]. If two ports are specified (num_port_off=l),
the colstart is implied to be MA[l].

DEV_START_OFF[2:0]

[28:26]

This offset value specifies the bit position within the MA to start
the DEVICE field extraction. Valid ranges are 7 ... 0, with the
device start= MA[DEV_START_OFF+5]
DEV_START_OFF[2:0] Dev_start

0
1
2
3

4
5
6
7

MA[5]
MA[6]
MA[7]
MA[8]
MA[9]
MA[lO]
MA[ll]
MA[12]

Compaq Confidential
16-78 Internal Processor Registers

5 January 2001

Subject To Change

ZboxlPRs

Table 16-77 DRAM Mapper Control Fields Description (Continued)
Name

Extent

Type

Description

BNK_START_OFF[3:0]

[25:22]

This offset value specifies the bit position within the MA to start
the BANK field extraction. The BANK field is extracted from the
MA in reverse bit order (eg: B [O:n]) to minimize bank conflicts in
dependent bank Direct RDRAM devices. Unlike the other start
fields, BNK_START_OFF describes which MA bit to start B[O]
extraction. Valid ranges are 14 ... 0, with the
bank start= MA[DEV_START_OFF+8]
BNK_START_OFF[3:0] Bnk_start B[O]

0
1
2
3
4
5
6
7
8
9
10

MA[8]
MA[9]
MA[lO]
MA[ll]
MA[12]
MA[13]
MA[14]
MA[15]
MA[16]
MA[17]
MA[18]
MA[19]
MA[20]
MA[21]
MA[22]

11
12
13
14
RWH_START_OFF[3 :0]

[21: 18] RW

This offset value specifies the bit position within the MA to start
the high order ROW field extraction. This field must compensate
for any low order ROW bits that have already been extracted from
the row hole. If none of the row address bits are used in the row
hole (rwl_width= 0), then the values of 0 and 10 correspond to bit
positions MA[9] and MA[19] respectivly. The corresponding values increase by one for each bit taken into the hole. Valid ranges
are 10 ... 0, with RowN starting at bit position
RWH_START_OFF+9+n. The following table shows the appropriate values. Figure 16-53 shows an intpretation of Row High.
RWH_START_OFF
RWL_WIDTH

9 10 11 12 13 14 15 16 17 18 19
10 11 12 13 14 15 16 17 18 19 20
11 12 13 14 15 16 17 18 19 20 21
12 13 14 15 16 17 18 19 20 21 22
13 14 15 16 17 18 19 20 21 22 ~
14 15 16 17 18 19 20 21 22 ~ ~

2
3

4
5

The chart shows the initial high order bit to be extracted. For
example if a three bit row hole (2) is found in the lower bits, then
the "Three bit hole(R3)" is consulted. To begin the upper order
extraction from bit position 22, a RWH_START_OFF of 4 is
required.

Compaq Confidential
5 January 2001 -~ Subject To Change

Internal Processor Registers 16-79

ZboxlPRs

Table 16-77 DRAM Mapper Control Fields Description (Continued)
Name

Extent

Type

Description

RWL_START_OFF[2:0]

[17:15]

This offset value specifies the bit position within the MA to start
the low order ROW field extraction. Valid ranges are 4 ... 0, with
the row lower start= MA[5+RWL_START_OFF]
RWL_START_OFF[3:0] Rwl_start B[O]

0
1

MA[5]
MA[6]
MA[7]
MA[8]
MA[9]

3
4

DEV_WIDTH_OFF[l:O]

[14: 13]

This offset value specifies the number of bits to extract from the
MA for the DEVICE field.
DEV_WIDTH_OFF[1:0]

Dev_width

2b 4 devices
3b 8 devices
4b 16 devices
5b 3 2 devices

1
2

Compaq Confidential
16-80 Internal Processor Registers

5 Jc1nuc1ry 2001 ... Subject To Change

ZboxlPRs

Table 16-77 DRAM Mapper Control Fields Description (Continued)
Name

Extent

Type

Description

BNK_WIDTH_OFF[2:0]

[12: 10]

This offset value specifies the number of bits to extract from the
MA for the BANK field. The BANK field is extracted from the
MA in reverse bit order (eg: B [O:n]) to minimize bank conflicts in
dependent bank Direct RDRAM devices. The
BNK_WIDTH_OFF field describes how many bits to extract to
the right of the MA starting bit position. The BNK bits are
extracted [LSB:MSB]. To enable 64 bank mode
(BNK_WIDTH_OFF=4), software must guarantee that Dependent Bank mode is also enabled (DEP_BNK=l).
BNK_WIDTH_ OFF[2:0]

Bnk_width

2b 4 banks

1
3

3b 8 banks
4b 16 banks
Sb 32 banks

6b 64 banks

An example of how BANK is extracted from MA:

f BNK_START_OFF[3:0]

l*l*I..

BNK_WIDTH_OFF[2:0]

The system programmer is given the ability to extract lower order
ROW bits from the MA between the DEVICE and COLUMN
extraction points. This "row hole" is required if the DEVICE
starting MA bit position does not fall exactly next to the COLUMN ending MA bit position, as shown in the following figure.

JoojRsjR4jRsjR2 jR1 jRo Jcmax Jcmax-1 J... J

~ Row Hole -...j
ROW_WIDTH_OFF[2:0]

[9:7]

This offset value specifies the total number of row bits to extract
from the MA (including those from the row hole). For instance, If
row_width_off=O (row size=9b) andrwl_width_off=5 (row
hole=5b), then (9b-5b)=4b are extracted for the high order row.
See table under BNK_WIDTH_OFF[2:0] to determine where the
high order row bits are extracted based on the size of the row hole.
The Rambus ROW packet protocol allows extensions for 12- and
13-bit rows by using bits defined as bank bits.
ROW _WIDTH_OFF[2:0]

Row_width

0
1

9b
lOb
llb
12b [bank(5) = row(ll)]
13b [bank(5,4) = row(ll,12)]

2
3

4
Compaq Confidential
5 January 2001 - Subject To Change

Internal Processor Registers 16-81

ZboxlPRs

Table 16-77 DRAM Mapper Control Fields Description (Continued)
Name

Extent

Type

Description

RWL_WIDTH[2:0]

[6:4]

This value specifies the number of bits to extract from the MA for
the low order ROW field.
RWL_WIDTH[2:0]

Rwl_width

Ob (no row hole)
lb
2b
3b
4b
5b

2
3
4
5
COL_WIDTH_OFF[l:O]

[3:2]

This offset value specifies the number of bits to extract from the
PA for the COLUMN field.
COL_WIDTH_OFF[1 :O]

Col_width

5b
6b
7b

2
DEP_BNK

[1]

Dependent Bank Mode. When set, assumes dependent bank
devices.

SPLIT_BNK

[0]

Split Bank Mode. When set, assumes a split bank device. When
SPLIT_BNK is set, DEP_BNK must also be set because a
split bank device implies that the banks are dependent.

Compaq Confidential
16-82 Internal Processor Registers

5 Jc1nuary 2001 ··· Subject To Change

ZboxlPRs

Figure 16-53 Interpretation of Row High

Figure TBS (LK99-0110A.WMF).

Need clarification on the ASCII figure below in order to create useful line art.

2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
rwh_start_off
rwl_width
A
Row-High starting bit =
MA<9+rwh_start_of f +rwl_width>

<---1
<-------1

16.6.18 Zbox Performance Counter 0 - ZBOXn_ZPM_CTRO
There are two Zbox performance counter 0 registers; ZBOXO_ZPM_CTRO and
ZBOXl_ZPM_CTRO.
This is a 31-bit event counter and an underflow bit. ZBOXn_ZPM_CTRO can be programmed to count one of 32 items related to the Zbox middle.
The counter can be preloaded with an initial count via software. When the selected
event occurs, the corresponding counter is decremented. When either counter counts
below zero, the Zbox generates a performance_monitor interrupt.
Only the first underflow causes a performance_monitor interrupt, so that the interrupt
can be disabled by writing a 1 to the underflow bit. The interrupt occurs on the 0-71
transition, therefore, #event-1 must be loaded into the counters.
Figure 16-54 shows the Zbox performance counter 0 register.
Figure 16-54 Zbox Performance Counter O

ZBOX_PERF_ C T R O _ U N D - - - - - - - - - - - - - - - '
ZBOX_PERF_CTRO-----------------------'
LK91t-0111A

Compaq Confidential
5 January 2001 ··· Subject To Change

Internal Processor Registers 16-83

ZboxlPRs

Table 16-78 describes the Zbox performance cowiter 0 fields.
Table 16-78 Zbox Performance Counter o Fields Description
Name

Extent

Type

Reserved

[63:32]

RW,
MBZ

Description

ZBOX_PERF_CTRO_UND [31]

Indicates counter underflow.

ZBOX_PERF_CTRO

Zbox Performance counter 0. Decrements when the condition
specified by ZBOX_PERF_CTL[ 4:0] have been met. A performance counter interrupt is signalled when the counter underflows.

[30:0]

16.6.19 Zbox Performance Counter 1 - ZBOXn_ZPM_CTR1
There are two Zbox performance cowiter 1 registers; ZBOXO_ZPM_CTRl and
ZBOXl_ZPM_CTRl.
This is a 31-bit event cowiter and an underflow bit. ZBOXn_ZPM_CTRl can be programmed to count one of 16 items related to the Zbox front-end (DIFT).
The counter can be preloaded with an initial count via software. When the selected
event occurs, the corresponding counter is decremented. When either counter counts
below zero, the Zbox generates a performance_monitor interrupt.
Only the first underflow causes a performance_monitor interrupt, so that the interrupt
can be disabled by writing a 1 to the underflow bit. The interrupt occurs on the 0-71
transition, therefore, #event-1 must be loaded into the counters.
Figure 16-55 shows the Zbox performance cowiter 1 register.
Figure 16-55 Zbox Performance Counter 1
63

3231 30

ZBOX_PERF_ C T R 1 _ U N D - - - - - - - - - - - - - - - '
ZBOX_PERF_CTR1 ---------------------~

Compaq Confidential
16-84 Internal Processor Registers

5 Jc1m1ary 2001

Subject To Change

ZboxlPRs

Table 16-79 describes the Zbox performance counter 1 fields.
Table 16-79 Zbox Performance Counter 1 Fields Description
Name

Extent

Type

Reserved

[63:32]

RW,
MBZ

Description

ZBOX_PERF_CTRl_UND [31]

Indicates counter underflow.

ZBOX_PERF_CTRl

Zbox Performance counter 1. Decrements when the condition
specified by ZBOX_PERF_CTL[8:5] have been met. A performance counter interrupt is signalled when the counter underflows.

[30:0]

16.6.20 Zbox Performance Control - ZBOXn_ZPM_CTL
There are two Zbox performance control registers; ZBOXO_ZPM_CTL and
ZBOXl_ZPM_CTL.
Figure 16-56 shows the Zbox performance control register.
Figure 16-56 Zbox Performance Control
63

9 8

5 4

lllt:::::::·:·:·:.Jllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll!llllllllllllll!llllllllllllllllllllll

Reserved-------------'
ZPM_CTL1 [ 3 : 0 ) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - '
ZPM_CTL0[4:0]-------------------------------'
LK99-0113A

Table 16-80 describes the Zbox performance control fields.
Table 16-80 Zbox Performance Control Fields Description
Name

Extent

Type

Reserved

[63:12]

RO,MBZ

Reserved

[11:9]

RW,MBZ

ZPM_CTL1[3:0] [8:5]

Description

Control for Zbox Performance Counter 1.
CTLl

Item to count (ZPM_CTR1)

00000

nq_any - regardless of reject status

00001

nq_prq - regardless of reject status

00010

nq_rsq - regardless of reject status

00011

nq_csq - regardless of reject status

00100

nq_any-qualified by&! reject

00101

nq_prq - qualified by & ! reject

Compaq Confidential
5 January 2001 ·- Subject To Change

Internal Processor Registers 16-85

ZboxlPRs

Table 16-80 Zbox Performance Control Fields Description (Continued)
Name

Extent

Type

Description

00110

nq_rsq -qualified by&! reject

00111

nq_csq - qualified by & ! reject

01000

nq_any - qualified by & reject

01001

nq_prq - qualified by & reject

01010

nq_rsq - qualified by & reject

01011

nq_csq - qualified by & reject

01100

nq_rej -No Fill Buffers available

01101

nq_rej - Shadow Reject (tRR, tPP) in PDN or SHDPND
Interval

Compaq Confidential
16-86 Internal Processor Registers

5 Jc·muc1ry 2001 - Subject To Change

ZboxlPRs

Table 16-80 Zbox Performance Control Fields Description (Continued)
Name

Extent

ZPM_CTL0[4:0] [4:0]

Type

Description

CTLl

Item to count(ZPM_CTR1)

01110

nq_rej- Page-Conflict Reject (tRP,tRCD) in PND or BLKPND or NQPRPND Interval, (tRAS) in PND or HLDPND
Interval (R/w), (tRDP) in SHDPND interval (R)

01111

nq_rej- WRB Reject (tRTR, tRTP)

10000

nq_rej - Queue Full Reject

10001

nq-rej-NQ' Waterfall priority over DFf-NQ

10010

cmd = dir_only_read

10011

cmd = dir+data_read

10100

cmd = dir_only_write

10101

cmd = dir+data_ write

10110

PRER precharge

10111

PREX precharge

11000

PREC precharge

11001

COL=RD

11010

COL=WR

11011

COL=NOCOP

11100

COL=Any

11101

Starvation detections

11110

Force write retire

11111

Deferred write retire

Control for Zbox Performance Counter 0.
CTLO

Item to count {ZPM_CTRO)

0000

Incoming transaction (any)

0001

Incoming ReadSharedReq

0010

Incoming ReadModREq

0011

Incoming ReadReq

0100

Incoming FetchReq

0101

Incoming SharedToDirtyReq

0110

incoming SharedToDirtySTCReq

0111

Incoming InvalToDirtyReq

1000

Incoming Victim

1001

Incoming VictimClean

1001

Outgoing Forward (any)

1010

Outgoing Forward=InvalSingle

Compaq Confidential
5 January 2001 - Subject To Change

Internal Processor Registers 16-87

Zbo:xlPRs

Table 16-80 Zbox Performance Control Fields Description (Continued)
Name

Extent

Description

Type

CTLO

Item to count (ZPM_CTRO)

1011

Outgoing Forward=lnvalMask

1100

Outgoing Forward=Read(anytype)Forward

1101

Outgoing Forward=FetchForward

1110

Outgoing Forward=ItoDForward

1111

Forward Miss received

16.6.21 Zbox Sweep Directory Bits - ZBOXn_DRAM_SWEEP_DIR
There are two Zbox sweep directory bits registers; ZB OXO_DRAM_SWEEP_DIR and
ZBOXl_DRAM_SWEEP_DIR.
Figure 16-57 shows the Zbox sweep directory bits register.
Figure 16-57 Zbox Sweep Directory Bits
282726

2120

DIR_DATA[21]---------------~
DIR_ECC[5:0]-----------------~

DIR_DATA[20:0]------------------------~
LK99-0114A

Table 16-81 describes the Zbox sweep directory bits fields.
Table 16-81 Zbox Sweep Directory Bits Fields Description
Name

Extent Type

Reserved

[63:28] RO,
MBZ

DIR_DATA[21]

[27]

DIR_ECC[5:0]

[26:21] RO

Contains directory entry ECC from last read from memory. Valid
only if SWEEP_ON is set.

DIR_DATA[20:0]

[20:0]

Contains directory entry data from last read from memory. Valid
only if SWEEP_ON is set.

Description

Contains directory entry data[21] from last read from memory. Valid
only if SWEEP_ON is set. This bit is normally 0, and should be set
only if the directory entry is written with the SET_DIR21 control bit
set.

Compaq Confidentia I
16-88 Internal Processor Registers

5 Janwiry 2001 - Subject To Change

ZboxlPRs

16.6.22 Zbox Force-Error Address register - ZBOXn_FRC_ERR_ADR
There are two Zbox force-error address registers; ZBOXO_FRC_ERR_ADR and
ZBOXl_FRC_ERR_ADR.
The Zbox provides a means to force errors when a particular physical address is written.
The address is specified in Rambus device, bank, row and column format. When the
matching function is enabled (by means ofDRAM_ERR_CTL[MAT_ERR_ENA]), the
LSB of the Column address is cleared (as specified by
DRAM_ERROR_CTL[FRC_WTERR]). The physical address match is on a cacheblock granularity, with separate, independent registers for each Zbox. Use of this register requires knowledge of the address mapping scheme in use.
This register can also be used to provide a means to set directory _data[21] when an
address match occurs (by means of DRAM_ERR_CTL[ADR_MAT_TRIGGER]). The
Zbox can also optionally generate an interrupt (if
DRAM_ERROR_CTL[ERR_INT_ENAB[6]] is set) upon reading a block of memory
with directory_data[21]=1.
Figure 16-58 shows the Zbox force-error address registers.
Figure 16-58 Zbox Force-Error Address Register
2625

2019

7 6

Reserved------~
FRC_DEV[4:0]---------------~

FRC_BNK[5:0]--------------------'
FRC_ROW[12:0]--------------------------'
FRC_COL[6:0]------------------------------'
LK99-0124A

Table 16-82 describes the Zbox force-error address fields.
Table 16-82 Zbox Force-Error Address Fields Description
Name

Extent

Type

Reserved

[63:31]

RW,
MBZ

FRC_DEV[4:0]

[30:26]

Rambus Device number

FRC_BNK[5:0]

[25:20]

Rambus Bank number

FRC_ROW[12:0]

[19:7]

Rambus Row number

FRC_COL[6:0]

[6:0]

Rambus Column number

Description

Compaq Confidential
5 Jatnmry 2001 -- Subject To Change

Internal Processor Registers 16-89

ZboxlPRs

16.6.23 Zbox DIFT Error Status - ZBOXn_DIFT_ERR_STATUS
There are two Zbox DIFT error status registers; ZBOXO_DIFT_ERR_STATUS and
ZB OXl_DIFT_ERR_STATUS.
Figure 16-59 shows the Zbox DIFT error status registers.
Figure 16-59 Zbox DIFT Error Status Register

DIFT_ERR_STATUS[31 : 0 ) - - - - - - - - - - - - - - - - - - - - - - - - - '
LK99-0125A

Compaq Confidential
16-90 Internal Processor Registers

5 Janw~ry 2001 ·-Subject To Change

ZboxlPRs

Table 16-83 describes the Zbox DIFT error status fields.
Table 16-83 Zbox DIFT Error Status Fields Description
Name

Extent

Type

Reserved

[63:32]

RO,
MBZ

DIFT_ERR_STATUS[31:0] [31:0]

RWAC

Contains hardware error status bits from DIFT. For HW
debugging only. A write clears all bits.

[31:18]

MBZ

Reserved

Description

[17]

DIFT debit counter overflow

[16]

DIFT debit counter underflow

[15]

Block response to Rbox credit overflow

[14]

Block response to Rbox credit underflow

[13]

Non-block response to Rbox credit overflow

[12]

Non-block response to Rbox credit underflow

[11]

Forward to Rbox credit overflow

[10]

Forward to Rbox credit underflow

[9]

Protocol violation or unsupported case detected during DIR
write

[8]

DirWrite command logic error

[7]

DirRead command logic error

[6]

Simultaneous issue of Forward and Invalidate to FORWARD
channel

[5]

Simultaneous issue of DirRead and DirWrite to MEM channel

[4]

Incoming ACK failed to merge to any DIFT entry

[3]

New entry allocation failed due to empty freelist

[2]

Unknown command decoded by packet accumulator

[l]

Framing error on incoming packet from Rbox

[0]

z_dft_acc->err_cbad_frame_a_h = framing error on incoming packet from Cbox

16.6.24 Zbox RAC Control - ZBOXn_RAC_CTL
There are two Zbox RAC control registers; ZBOXO_RAC_CTL and
ZBOXl_RAC_CTL.
Figure 16-60 shows the Zbox RAC control registers.
Figure 16-60 Zbox RAC Control Register

Figure TBS when bits are defined.

Compaq Confidential
5 January 2001 -·Subject To Change

Internal Processor Registers 16-91

ZboxlPRs

Table 16-84 describes the Zbox RAC control fields.
Table 16-84 Zbox RAC Control Fields Description
Name

Extent

Type

Reserved

[63:32]

RO,
MBZ

RAC_CTL[31:0]

[31:0]

Description

Contains control bits for RAC.
No bits are currently defined.

Compaq Confidentia I
16-92 Internal Processor Registers

5 Jc1nuc1ry 2001 - Subject To Change

HW___.LD and HW___.ST Instructions

17
Privileged Architecture Library Code
17.1 HW_LD and HW_ST Instructions
PALcode uses the HW_LD and HW_ST instructions to access memory outside the
realm of normal Alpha memory management and perform special Dstream load and
store transactions. The data conversions are identical to byte, word, long and quad integer counterparts.
Data alignment traps are disabled for all forms of the HW_LD and HW_ST instructions
and the effective address is forced to match the specified data size.
The instruction format of the HW_LD and HW _ST instructions is:
Figure 17-1 HW_LD/HW_ST Instruction Format
31

HW_LD:

2625

Opcode

Displacement

HW_ST:
Table 17-1 HW_LD/HW_ST Instruction Fields Description
Field Name

Extent Description

Opcode

31: 26

The instruction Opcode.
OxlB HW_LD
OxlF HW_ST

25:21

The destination register number for loads or the write data for stores

20: 16

Source register which holds the base address of the operation.

Compaq Confidential
5 Janm1ry 2001 - Subject To Change

Privileged Architecture Library Code

17-1

HW____ lD and HW____ ST Instructions

Table 17-1 HW_LD/HW_ST Instruction Fields Description (Continued)
Field Name

Extent

Description

Type

15:13

Type of memory reference to perform. The /PfE and /WrChk operations are
valid only for HW _LD operations.

Length

Disp

12: 11

10:0

Bits Set Type of Reference

Meaning

000

Physical

010

Virtual/PfE

100

Virtual

101

Virtual/WrChk

110

Virtual/Alt

111

Virtual/WrChk/Alt

The effective address for the HW_LD/ST
instruction is physical, not virtual.
Valid only for HW _LD, used to fetch page table
entries from memory. TB faults vector directly
to the double-miss flows. Kernel mode access
checks are performed.
The address virtual. How is this different from
LD/ST? Alignment checks are disabled.
The effective address for a HW_LD instruction
is virtual. Access checks for fault-on-read,
fault-on-write, read and write are performed
Same as Virtual but the ALT field of the
M_MODE register is used for access checking.
Same as Vitrual/WrChk but the ALT field of the
M_MODE register is used for access checking.

The size of the data transaction. Data alignment checks are not performed but
alignment is forced to the data size.
Bits Set

Meaning

00
01
10
11

Byte Access
Word Access
Longword Access
Quadword Access

An 11-bit signed displacement that is added the value in Rb to form the effective
address of the load or store.

Compaq Confidential
17-2

Privileged Architecture Library Code

5 Jc1nuc1ry 2001

Subject To Change

HW____MFPR and HW.___MTPR Instructions

17.2 HW_MFPR and HW_MTPR Instructions
PALcode uses the HW_MFPR and HW_MTPR instructions to access the internal processor registers. The HW_MFPR instruction reads the value from the specified IPR
into the integer register specified by the Ra field. The HW_MTPR instruction writes
the value from the integer register specified by the Rb field into the specified IPR.

17.2.1 HW_MFPR Instruction
The instruction format of the HW_MFPR instruction is:
Figure 17-2 HW_MFPR Instruction Format
5 4

HW MFPR:

Opcode

Index

Table 17-2 HW_MFPR Fields Description
Field Name

Extent Description

Opcode

31:26

The instruction Opcode: Ox.19

4:0

Destination integer register.

Index

12:5

Identifier of the IPR to read. See the IPR table for a complete list of indexes.
The MSB of the Index field differentiates between IPRs located in the Mbox and
IPRs located in the Ibox.
MSB =0 Mbox
MSB = 1 Ibox

Rclass

24:21

Reader class of the instruction. The reader class defines an dependency against a
previous IPR writer of the same class. The reader will not issue until the writer
dependency has cleared. The format of the reader class field is as follows:
Bits

Description

3
Valid bit. If clear, no dependency exists
2:0
Class number
The currently defined/allowed values for reader class are:
Bits Set

Meaning

OXX:X
lXXO
lXXl

No dependency
Dependency against an IPR writer class of 0
Dependency against an IPR writer class of 1

5 January 2001 - Subject To Change

Compaq Confidential
Privileged Architecture Library Code

17-3

HW____ MFPR and HW___ MTPR Instructions

17.2.2 HW_MTPR Instruction
The instruction format of the HW_MTPR instruction is:
Figure 17-3 HW_MTPR Instruction Format
31

HW_MTPR:

2120

5 4

-O-p-cod_e....,l
__R_c_la-ss_,..l_R_b_ _.--_ _
lnd-e-~-......
1-w-c-las_s_,I

.-1

Table 17-3 MT_MTPR Instruction Fields Description
Field Name

Extent Description

Opcode

31:26

The instruction Opcode: OxlD.

20: 16

Source integer register.

Index

12:5

Identifier of the IPR to read or write. See the IPR table for a complete list of indicies. The MSB of the fudex field differentiates between IPRs located in the
Mbox and IPRs located in the Ibox.
MSB =0 Mbox
MSB = 1 Ibox

Rclass

24:21

Reader class of the instruction. The reader class defines an dependency against a
previous IPR writer of the same class. The reader will not issue until the writer
dependency has cleared.
The format of the reader class field is as follows:
Bits

Description

3
2:0

Valid bit. If clear, no dependency exists
Class number

The currently defined/allowed values for reader class are:
Bits Set

Meaning

OXXX
lXXO
lXXl

No dependency
Dependency against an IPR writer class of 0
Dependency against an IPR writer class of 1

Compaq Confidential
17-4

Privileged Architecture Library Code

5 J<1nwiry 2001 -~Subject To Change

Execution of the RET Instruction in PAlmode

Table 17-3 MT_MTPR Instruction Fields Description (Continued)
Field Name

Extent Description

Wclass

4:0

Writer class of the instruction. The writer class defines the source of a reader
class dependency. HW_MTPR instructions that define a writer class create an
issue dependency that must be cleared before any IPR reader (MFPR or MTPR)
of the same class can issue. The dependency is cleared when the writer issues
unless the bubble-bit is set, when the bubble-bit is set, the dependency does not
clear until a bubble acknowledgement is received for the writer.
The general format of the writer class field is:
Bits

Description

4
3

Bubble bit - If set, issue logic waits for notification
Valid bit. - If clear, no writer dependency is set.
Class number.

2:0

The currently defined/allowed values for writer class are:
Bits Set

Meaning

xoxxx

No dependency
Set dependency for IPR writer class 0
Set dependency for IPR writer class 1
Set completion buble dependency for IPR writer class 0
Set completion buble dependency for IPR writer class 1

OlXXO
Ol:XX:l
llXXO
llXXl

Completion bubble dependencies are only created for HW _MTPR instructions
that target the Mbox IPRs. If the MSB of the Index field is set, indicating an Ibox
register target, the bubble bit is ignored and an issue dependency is created.

17.3 Execution of the RET Instruction in PALmode
The special PALmode HW_RET instruction that was implemented in the 21264 is not
supported by the 21464. Instead, the normal RET instruction is used to return instruction flow to a specified PC and to exit PALmode and SuperPALmode.
The RET instruction Rb field specifies an integer general-purpose register (GPR) that
holds the target PC. GPR[l :0] specifies the new value of PALmode after the RET is
executed, as follows:
Table 17-4 GPR[1 :O] Encoding
Value

Meaning

Normal mode

PALmode

SuperPALmode

The only exception is that Normal mode RET instructions cannot cause a transition into
PAlmode or SuperPALmode. Only a CALL_PAL instruction, an interrupt, or a trap
condition can elevate the mode from Normal mode to PAlmode, and only a PNMI event

Compaq Confidential
5 January 2001 ··· Subject To Change

Privileged Architecture Library Code

17-5

CMOV Execution Within PAlcode

can cause a transition from Normal mode to SuperPAlmode. The implementation actually allows PALmode code to transition to SuperPALmode by using the RET instruction. It is not clear why PALcode would ever do that.
Table 17-5 RET Instruction Mode Transitions
Old Mode

GPR[1 :O]

New Mode
Normal mode

PALmode

SuperPALmode

Normal mode

RET

CALL_PAL!frap/Interrupt

PALmode

RET

PNMI/RET

SuperPALmode

RET

In a RET to Native mode, the Rb field is likely to be a PALcode shadow register. In a
RET to PALmode, the register may or may not be a PALcode shadow register. It is
expected that Ra field of the RET will usually be R31.
Normally, the RET instruction succeeds a CALL_PAL instruction, an exception entry,
or a BSR subroutine call from within PALmode. Those cases push the return PC onto
the prediction stack and subsequently pop that stack to generate a predicted target
address. That address is always predicted Native mode, regardless of the circumstances
of the push onto the prediction stack and, therefore, all returns to PALmode incur a
mispredict.
Figure 17-4 RET Instruction Fields
31

RET:

2625

2120

1615

I Opcode I Ra I Rb I

Hint

Table 17-6 RET Instruction Fields Description
Name

Extent

Description

Opcode

31:26

The instruction Opcode. OxlA

25:21

Receives the PC of the instruction following the RET.

20: 16

Holds the target PC of the RET.

Hint

15:0

Return predictor stack hints. See Section (I) 4.3.3 of the Alpha SRM
for a complete description.

17.4 CMOV Execution Within PALcode
Because the shadow register replacement process in PALmode is keyed to different
registers numbers for Rb and Re, the 21464 does not correctly replace the inserted reference to Re for native CMOVxxl instructions in PALmode.
Legacy CMOV instructions in PALcode are special cased to disable all replacements of
Re/Fe. This allows PALcode to modify the architectural registers R24/F24 and R25/
F25 without requiring a special mode to control the PALcode shadow replacement process.

Compaq Confidential
17-6

Privileged Architecture Library Code

5 Januc1ry 2001

Subject To Change

PALcode Restrictions and Guidelines

The general coding rule is that shadow registers cannot be used as the destination of
either a legacy or native CMOV instruction in PALmode. If PALcode needs to modify
architectural registers R24/F24 or R25/F25, it must do so in PALmode by using the legacy version of CMOVxx/FCMOVxx. Other uses of either legacy or native CMOV
instructions in PALmode are allowed.
See Section 2.11.2.5 for complete information about CMOV instruction execution.

17.5 PALcode Restrictions and Guidelines
Open questions:

1. Is Restriction 5 necessary? Clarification needed.
2. Is Restriction 6 necessary? Should we doubly map all elements of the DTBWINQ
group? This would allow us to avoid the DTBWINQ mechanism if there is a bug in
it, an IFETCHB would be required in the single miss flow.
3. For Restriction 7, clarify how back-to-back single misses work.
4. For Restriction 16, why is the IFETCHB necessary in the flow?

17.5.1 Restriction 1: PALcode Must Guarantee That IPR Writes Retire Before
Returning
Use the IFETCHB instruction to guarantee IPR write data is committed before instructions that depend on the IPR value are allowed to proceed. In general, all PALcode
flows that write IPRs have an IFETCHB instruction after the last IPR write before
returning.
Exception:
The DTB writer block in DTBM_SINGLE is protected through the DTB Writer In
Queue (DTBWINQ) interlock logic. In that case, an IFETCHB is not necessary.

17.5.2 Restriction 2: IFETCHB Required Between IPR Writes in the Same IPR
Group
There can be at most one in-flight good-path IPR Write for each TPU to each IPR
group.

IPR writes are speculatively issued but not committed until the HW_MTPR instruction
retires. The internal storage that holds the speculative value is shared among all IPRs in
a group, so take care to ensure another IPR write does not attempt to overwrite the speculative storage before the first writer retires. The 21464 interlocks the speculative register that grants access to the oldest writer, which ensures that random bad-path code
does not alter the speculative value before it is written. However, if a younger goodpath IPR write is issued before an older good-path IPR write in the same speculative
group, the younger IPR write might get the data of the older write.
PALcode must separate writes to IPRs in the same speculative group with an IFETCHB
instruction.

17.5.3 Restriction 3: Mbox IPRs Must be Written Twice to Ensure Correct SlotCompaq Confidential
5 January 2001 ···Subject To Change

Privileged Architecture Library Code

17-7

PAlcode Restrictions and Guidelines

ting
Mbox IPRs must be written twice by consecutive instructions in the same fetch
block.

Mbox IPR write data is communicated to the Mbox by using the primary address busses. Mbox IPR write instructions that slot to an odd position utilize bus PO and instructions that slot to an even position utilize bus Pl.
Mbox IPRs consist of two groups:

•

Mbox IPRs in speculative group Ml only connect to bus PO and can, therefore, only
be written by instructions that slot to odd positions. Writing these IPRs in consecutive instructions in the same fetch block guarantees that one of the writes is slotted
in an odd position.
Exception:
Mbox IPRs in speculative group Ml can be written with only a single write if
care is taken to use the map-block alignment instruction to guarantee that the
write is slotted in an odd position.

•

Mbox IPRs in speculative groups M2 and M3 have two copies of each IPR, one
connected to bus PO and one connected to bus Pl. Writing these IPRs in consecutive instructions in the same fetch block guarantees that one write is slotted for each
bus and both copies are updated.

If the instructions were allowed to span a fetch block, the second fetch block could

!cache miss, allowing the instructions to possibly map into separate blocks. Without
forced alignment, the last instruction in the first block has a 50 percent chance of being
even aligned and, since the first instruction of the second block is guaranteed to be even
aligned, the rule could be violated.

17.5.4 Restriction 4: All Instructions in the OTB Writer Block Must be in the
Same Map Block
The DTB Writer Block consists of the following instructions:
HW_MTPR SO-> DTB_TAG, R#l, W#O
HW_MTPR SO-> DTB_TAG, R#l, W#l
HW_MTPR SI-> DTB_PTE, R#O
HW_MTPR SI-> DTB_PTE, R#l, W#l, BB
HW_MFPR DTBMS_RET_ADDR -> SI
NOP
NOP
ALIGN_NOP
These instructions must be in the same map block beause the DTB Writer In Queue
(DTBWINQ) interlock logic assumes that all members of the block are allocated into
the queue in the same cycle.
The ALIGN_NOP instruction must be in position 7 of a fetch chunk so these eight
instructions are also guaranteed to be in the same fetch block.

17.5.5 Restriction 5: All Four OTB MTPR Instructions Must Appear in the Same

17-8

Compaq Confidential
Privileged Architecture Library Code
5 Jt1nwiry 2001 -· Subject To Change

PAlcode Restrictions and Guidelines

Fetch Block
If any MTPR to DTB_TAG or DTB_PTE is in a fetch block, all four MTPRs must be
in that fetch block.

All four MTPR instructions are necessary to write the DTB. These instructions cannot
be separated:
HW_MTPR SO-> DTB_TAG, R#l, W#O
HW_MTPR SO-> DTB_TAG, R#l, W#l
HW_MTPR Sl -> DTB_P'fE, R#O
HW_MTPRSl -> DTB_P'fE,R#l, W#l,BB
.***What hardware case caused this restriction? Assuming Restriction 3 is obeyed, the
TAG and PTE writes are correctly paired. Why must the TAG and PTE writes be in the
same fetch chunk?

17.5.6 Restriction 6: Non-OTB Writer Block OTBMS_RET_AOOR MFPRs Require
IFETCHB
H an MFPR From DTBMS_RET_ADDR appears in a PALcode Flow, there must be
an IFETCHB before the end of that PALcode flow.

MFPRs from DTBMS_RET_ADDR are considered part of the DTB Writer In Queue
(DTBWINQ) interlock mechanism. Therefore, PALcode must guarantee that a read
from DTBMS_RET_ADDR retires before leaving PALmode.
Exception:
If the MFPR from DTBMS_RET_ADDR is part of the writer block in a flow

that is protected by the DTBWINQ mechanism (that is, DTB Miss Single), the
IFETCHB is not necessary
***Is this restriction necessary? We have provided a second mapping for
DTBMS_RET_ADDR that has no side effects. What is the impact of not including the
IFETCHB? I assume it has to do with the MFPR for the DTBMS_RET_ADDR creating a writer block and somehow effecting the writer block of a subsequent OTB _MISS?
We need to clarify this, and explain how normal back-to-back DTB misses do not have
this problem.

17.5.7 Restriction 7: IFETCHB Required Between Non-OTB Writer Block OTB
Writer Block MxPRs
If any DTB writer block instruction appears in a PALcode flow, an IFETCHB is
necessary before any subsequent fetch block that contains DTB writer block
instructions.

Since the DTB writer block instructions are part of the DTB Writer In Queue (DTBWINQ) mechanism, PALcode must guarantee that only one DTB Writer Block can be
in flight at a time. Therefore, PALcode must issue an IFETCHB between fetch blocks
containing DTB Writer instructions.
***How does PALcode accomplish this? How can it know that back-to-back
DTB_SINGLE flows are separated by a IFETCHB?

17.5.8 Restriction 8: Padding Required Between OTB Writer Block and OTBCompaq Confidential
5 January 2001 ~· Subject To Change

Privileged Architecture Library Code

17-9

PALcode Restrictions and Guidelines

Dependent Instructions
OTB-dependent instructions must not be allowed to map the cycle after a OTB
writer sequence maps.

The implementation does not recognize load or store instructions that allocate into the
Instruction Queue in the cycle immediately following the DTB Writer Block as DTBdependent. Those instructions could thus issue before the DTB Writer Block has modified the speculative DTB entry or even left the IQ, causing a second DTB miss before
the first has been satisfied.
Padding the DTB writer flow with enough NOPs to fill the succeeding map block guarantees that memory operations after the DTB writer flow are not allocated on the cycle
following the DTB Writer Block.
Note:

This issue should only apply to the DTBM_SINGLE_CONS and
DTBM_SINGLE flows because they are the only flows expected to rely on
the DTB WRT interlock mechanism.

17.5.9 Restriction 9: PALcode Must Not Allow Writes INVALID DTB_PTE Entries to
Retire
PALcode must explicitly check the value being written to the DTB PTE and ensure
bit<O> is set. Ifbit<O> is not set, PALcode must branch away. Since the 21464 predicts
all branches in PALmode as not-taken, the MTPR speculatively issues, but is killed
when the branch resolves.
The valid bit of the DTB_PTE entry is used as a completion condition. Invalid PTE
entries do not complete and, therefore, preserve the DTBWRT interlock. If a invalid
DTB_PfE write was not killed and attempted to retire, the machine would hang.
Note:

Writing R31 to the DTB_PTE has a slightly different effect. The 21464
completes the write of R3 l, so the machine does not hang, but the DTBWRT block is also lifted and subsequent loads and stores are allowed to
issue.

17.5.10 Restriction 10: TAG and PTE Must be Written as Pairs with TAG Writes
Before PTE Writes
The TAG (OTB_TAG/ITB_TAG) and PTE (OTB_PTE/ITB_PTE) must be written
together with the tAG being ·written before the PTE.

The TAG and PfE IPRs represent two fields of a single PfE and, therefore, updates
must be atomic - hence the pairing requirement. Also, the PTE field might contain
granularity hint information that modifies the contents of the TAG field; therefore, the
updates must occur in the order of TAG, and then PfE. Otherwise the TAG value might
not correctly reflect the granularity hint data in PTE, potentially causing multiple
matches and electrical contention in the TB.
To ensure ordering of the writes, the write the PfE must be Reader Class dependent on
the write to the TAG.

17.5.11 Restriction 11: Register-Dependent MTPRs Must Not Have Read Class
Compaq Confidential
17-10 Privileged Architecture Library Code

5 Jc1m1c1ry 2001 -·Subject To Change

PALcode Restrictions and Guidelines

Dependent MxPRs
HW_MTPR and HW_MFPR instructions must never have reader class issue
dependencies on any HW_MTPR possessing a register dependency (direct or
indirect) on a load unless they share the same register dependency.

The reason for this rule is poison, because poison is communicated through physical
register dependencies, while Reader/Writer Class dependencies are independently
transmitted through INum dependencies (which are translated into queue entry dependencies in the Instruction Queue).
Consider the following instruction sequence:
LD (Sl)-> SO
HW_MTPR SO -> IPRx, W#O
HW_MTPR S 1 -> IPRy, R#O
If the load misses, the HW_MTPR from SO is poisoned. In principal, the HW _MTPR
from Sl should also be poisoned, but isn't, because it has no physical register dependency on the first HW_MTPR. It is free to issue, and may do so long before the load is
satisfied and the first HW_MTPR is replayed. The two HW_MFPRs effectively issue
out of order, violating the Reader/Writer class dependency.

Recall that writes to the speculative DTB entry must occur in the order of DTB_TAG,
and then DTB_PfE. Now consider this hypothetical PALcode flow:
LD (Sl)-> SO
HW_MTPRSO->DTB_TAG, W#O
HW_MTPR Sl -> DTB_PTE, R#O
If the load misses, the MBox ignores the poisoned write to DTB_TAG, allows the write

to DTB_PTE, and then allows the second write to DTB_TAG, possibly causing the
speculative DTB entry to be incorrect.
Note that if the value used in the HW_MTPR to DTB_TAG is sourced by a
HW_MFPR, the HW _MTPR cannot be poisoned. This is the case in the
DTBM_SINGLE flow.
Also note that this rule does not apply to Reader/Writer Class retire dependencies,
because any poison cases have been resolved by the time the HW_MTPR makes its
retire-time bubble.
If there must be a Reader Class dependency on an MTPR that has a register dependency

on a load, an artificial register dependency can be created so that if the MTPR is poisoned, the Reader Class dependent MTPR or MFPR is also poisoned. For example:
LD (Sl)-> SO
XOR Sl, SO-> Sl
XOR Sl, SO-> Sl
HW_MTPR SO -> IPRx, W#O
HW_MTPR Sl -> IPRy, R#O

17.5.12 Restriction 12: CMOV instructions Cannot Specify PALcode Shadow
Registers as Destinations

5 January 2001 ···Subject To Change

Compaq Confidential
Privileged Architecture Library Code 17-11

PAlcode Restrictions and Guidelines

Because of the shadow register overlay rules and the way a CMOV is split into two
instructions, PALcode shadow registers cannot be used as the destination (Re) of any
CMOV instruction.
Legacy CMOV instructions in PALmode are special cased to disable shadow register
replacements of Re. This allows PALcode to use a legacy CMOV to modify the architectural registers R24 and R25 without requiring a special mode to disable the PALcode
shadow replacement process.

17.5.13 Restriction 13: PALmode Native CMOV Instructions Cannot Specify R24
or R25 as Destinations
The 21464 does not have a "mode" bit to enable/disable shadow register replacement.
Instead, the 21464 implements the following position-dependent replacement policy.
R22

R23

Operand A
Operand B
Destination

R24

R25
Sl

The SrcA and SrcB operands have different mappings for shadow registers to make it
easy for PALcode to read from any register, shadow or architectural. The SrcA operand
and the destination have the same mapping to ensure the correct handling of STx_C,
where Ra specifies both SrcA and Destination.
For Native CMOV instructions, the 21464 replaces the original instruction:
CMOVxx Ra, Rb-> Re
With:
CMOVxl Ra, Re-> Re
CMOV2 Re, Rb-> Re
Because the SrcB operand is keyed to a different replacement policy than the destination, the CMOVxl part of the instruction sequence fails to replace the SrcB operand
with a the correct shadow register.
If PALcode needs to modify architectural registers R24 or R25, it must do so in PALm-

ode by using the legacy version of CMOVxx. Other uses of legacy style CMOV
instructions in PALmode are allowed.
Other uses of native the 21464 style CMOV instructions in PALmode are allowed.

17.5.14 Restriction 14: PALmode JMP Instructions Must be Followed by IFETCHB
To provide PALmode with a nonspeculative jump instruction, all JMP instructions in
PALmode predict to the next instruction in the flow and always cause a jump mispredict
trap.
For example, the following code sequence in PALmode behaves as follows.
JMP Sl
IFETCHB
Compaq Confidential
17-12 Privileged Architecture Library Code
5 J~1m.u1ry 2001 -·Subject To Change

PALcode Restrictions and Guidelines

The JMP predicts to the IFETCHB, which prevents further speculation by inhibiting
fetching. When the JMP issues, it causes a jump mispredict trap, which causes a kill
and redirects the machine to the true jump target.
Without the trailing IFETCHB, the 21464 speculates past the JMP, possibly leading to a
trap on the bad path and corruption of implicitly instruction-written IPRs.
Note:

An IFETCHB after a JMP does not satisfy the requirement for an
IFETCHB prior to a return from PALcode (Restriction 2). The IFETCHB
that follows a PALmode JMP never reaches its retire point and is killed by
the JMP mispredict. It cannot serve as the required retire barrier since it is
guaranteed to be on the bad path.

Also note that this rule applies only to JMP, not other jump instructions (i.e. JSR,
JSR_COROUTINE, and RET).

17.5.15 Guideline 15: No Push or Pop Instructions in the First Fetch Block of a
PALmode Flow
During the cycle when the first eight instructions of a PALmode flow are fetched, the
Ibox is busy writing the trap return address to the return stack and cannot write the
return address for the push (BSR, JSR, JSC, CALL_PAL) or pop (RET, JSC) instruction.
Violating this rule degrades performance because the stack order is broken and future
return instructions are almost guaranteed to mispredict.
Doing a superfluous PUSH without a corresponding RET, such as a B SR to the next
instruction, can repair the damage to the return stack. The first return mispredicts, but
the rest of the return stack is not corrupted.

17.5.16 Restriction 16: PALmode MT_FPCR Must be Followed by IFETCHB
PALcode must guarantee the retirement of any MT_FPCR instruction in PALcode
prior to the return of control to native code, and prior to the issuing of another
PALmode MT_FPCR or MF_FPCR, or any PALmode instructions that implicitly
read the FPCR.

The Floating-Point Control Register (FPCR) is a special form of Speculative-Committed IPR (SCIPR); the speculative value is only committed to the architectural FPCR
when the MT_FPCR instruction that wrote it becomes retireable. The FPCR is explicitly read by an MF_FPCR instruction and implicitly read by all floating-point instructions.
**Peter: What makes it special? All SCIPRs commit when retirable.
MT_FPCR instructions in native mode cause a PALcode trap to the MT_FPCR entry
point, which consists of the following code:
HW_MFPR EXC_ADDR -> S 1
IFETCHB
RETSl

The IFETCHB causes a next-to-retire event that is younger than the MT_FPCR, causing its value to be committed.

5 January 2001 ~·Subject To Change

Compaq Confidential
Privileged Architecture Library Code 17-13

PALcode Restrictions and Guidelines

***Peter: Why is the IFETCHB necessary here? The User-mode MT_FPCR commits
at retire time and kills any inflight FP instructions. Isn't the trap alone enough to ensure
correct state?
MT_FPCR instructions in PALmode do not trap, which means that the values in
EXC_ADDR and Sl do not need to be safeguarded before executing an MT_FPCR.
However, as a result, any PALcode that includes an MT_FPCR must execute an
IFETCHB prior to exiting the flow. If there are any MT_FPCR, MF_FPCR, or floating-point instructions in a PALcode flow after an MT_FPCR, there must be an intervening IFETCHB to ensure that the MT_FPCR value has been committed before the
reading/writing instructions issue.

Issues
*The 21264 had the behavior that an implicitly written register would read as zero if
read while being written. Will the 21464 have the same behavior? Should we define a
valid bit in each of the implicitly written registers to explicitly flag this case? All registers except VA have bit<63> available.

Compaq Confidential
17-14 Privileged Architecture Library Code
5 J(1nu(1ry 2001 -·Subject To Cfumge

18
Initialization and Configuration

Compaq Confidential
5 January 2001 --·Subject To Change

Initialization and Configuration

18-1

Compaq Confidential
18-2

Initialization and Configuration

5 J,1m.1,1ry 2001 ···Subject To Change

Instruction Based Profiling

19
Performance Monitoring
The 21464 provides the most performance monitoring hardware of any Alpha implementation to date. The goal of the 21464 performance monitoring hardware is to provide information about the running CPU in order to:
1. Drive profiling-directed-feedback optimizations to improve application performance.
2.

Enable the development of useful performance monitoring software tools.

3. Assist in post-silicon chip and system debug.
4.

Provide information to enable more intellegent OS scheduling of processes onto
TPUs.

5. Provide architectural feedback for future Alpha microprocessor and system implementations.
This document consists of two main sections. The first details the implementation of an
instruction-based profiling algorithm called ProfileMe. The second section describes
performance monitoring hardware for memory addresses that was developed for the
21364 and is being supported by the 21464.

19.1 Instruction Based Profiling
As microprocessors get more and more complicated, the behavior of instructions flowing through the machine is harder to determine with aggregate event counter style performance monitoring hardware, such as what was implemented in the 21164. Profilebased feedback has become very important in getting the best performance out of complex CPUs. The data obtained from hardware performance monitoring can be used to
drive compiler optimizations that increase a CPU's architectural performance. Instruction-based profiling can get more accurate data about performance bottlenecks in speculative out-of-order microprocessors such as the 21264, the 21364, and the 21464. The
basic idea is to enable software to sample fetched instructions randomly, collecting
detailed information about each sampled instruction's execution in the machine. This
includes information such as the amount of time the instruction spends in each phase of
its lifetime in the CPU, as well as performance impacting events, such as cache misses
and branch mispredictions. In the 21464, we implement a variation of ProfileMe called
paired-sampling, where two in-flight instructions can be sampled simultaneously. This
provides for better analysis, because events and data that are collected for the two
instructions can be correlated. For more information on ProfileMe in general, or on
paired sampling specifically, refer to the Micro-30 paper:
Compaq Confidential
5 January 2001 -~ Subject To Change

Performance Monitoring

19-1

Instruction Based Profiling

ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors.
In Proceedings of the 30th Annual International Symposium on Microarchitecture,
pages 292-302. IEEE, December 1997. (Postscript)
Jeffrey Dean, James E. Hicks, Carl Waldspurger, William E. Weihl, and George
Chrysos.

19.1.1 Profiling Methodology
Instruction-based profiling is performed by sampling the dynamic instruction stream
running on the 21464. Sampled instructions are chosen at map time based upon a software programmable IPR (PR_INST_CTL<63:0>) and are monitored while in-flight in
the CPU. Latencies and events are recorded for two separate instructions into a set of
profile record IPRs. When both instructions have finished utilizing CPU resources, a
general interrupt to PALcode is triggered. The general interrupt service routine will
read the INTERRUPT-SUMMARY IPR to determine that the interrupt was caused by
an instruction profile event. A privileged PAL routine can then read out the associated
data for each profiled instruction by reading from the profile record IPRs. In continuous
sampling, software would record the data from the current sample and reinitialize the
PR_INST_CTL IPR to begin the process for selecting the next pair of sampled instructions.
The hardware provides two countdown values which determine the selection of the first
and second profile instructions. In the normal profile mode, the countdown values indicate the number of valid (that is, having at least one valid instruction) map blocks from
the specified profiled TPUs that are counted before the profiled instruction is assigned.
In a special profile mode, called "PC trigger mode", the countdown values represent the
number of valid map blocks from the specified profiled TPUs after instructions correspondh1g to a specified PC have been mapped. Due to hardware constraints, the countdown values actually represent the number of valid map blocks plus some small
constant number of valid map blocks (currently 10).

19.1.2 Initiating an Instruction Profile Sample
To setup a profile sample, software calls a PAL routine that writes to the
PR_MEM_EVENT_CTL and (possibly) the PR_TRIG_PC IPRs, followed an
IFETCHB and PR_INST_CTL. The PAL routine should first check that an existing
profile is not already underway. It can do this by reading the PR_I_INFO IPR and
exa.111ining the "outstanding" bits, PR_I_INF0<63> and PR_I_INF0<31>, which indicate whether either the first or the second profiled instruction from a prior profiling
attempt is still outstanding. The outstanding bits should both be clear when a profile is
complete, and it is safe to read the profile record and start a new sample.
The profiling control IPRs: PR_INST_CTL, PR_MEM_EVENT_CTL, and
PR_TRIG_PC are specified in Table 19-1.

Compaq Confidential
19-2

Performance Monitoring

5 J(1nuary 2001 ~- Subject To Change

Instruction Based Profiling

Table 19-1 Control IPRs for Instruction-Based Profiling
IPR

Field

Bits

Description

I_EVENT_CTL

63-60

Specifies the event to be counted by the IAGG_EVENTn IPRs
between the retire/kill of the first profiled instruction and the retire/
kill of the section profile instruction. The events are listed in Table
19-10.

PROFILING_EN

Profiling Enabled. This will create a profiled sample according to the
fields specified below.

PC_TRIG_MODE 58

PC Trigger Mode Enabled. PRO_CTR will not start countdown until
instructions corresponding to the PR_TRIG_PC for one of the
PR_TPUS have been mapped.

PRl_POS

57-55

Profile Second Map-Block Position
Contains the exact position of the second profiled instruction in the
map-block selected by the PR l_CTR. Two bits per instruction are
delivered with each map-block to the Pbox. The bits correspond to
whether each instruction is the first or second profiled instruction.

PRl_CTR

54-31

Profile Interval for the second instruction
Contains an unsigned number between 0 and 16M-1 that represents
the number of metered map-blocks between the profile 0 and profile 1
instructions. A counter is initially written with the PRl_CTR value
when the PR_INST_CTL IPR is written. When the first interval's
counter reaches 0, the second interval's counter begins decrementing
its value, again based on metering map-blocks from all appropriate
TPUs specified in PR_TPUS. The profile 1 instruction of the pair is
chosen from the first valid map-block sent to the Pbox after the second interval counter reaches 0.
If the PC_TRIG_MODE is enabled, the countdown does not begin
until one of the PR_TPUs has mapped instructions that correspond to
the PR_TRIG_PC.

PRO_POS

30-28

Profile First Map-Block Position
Contains the exact position of the first profiled instruction in the mapblock selected by the PRO_CTR.

PRO_CTR

27-4

Profile Interval for the first instruction
Contains an unsigned number between 0 and 16M-1 that represents
the number of metered map-blocks. A counter is initialized to the
PRO_CTR value when the PR_INST_CTL IPR is written. Each cycle
that the Ibox 's collapsing buffer sends a map-block with any valid
instructions for a TPU specified by PR_TPUS to the Pbox, the
counter is decremented. The first profiled instruction of the pair will
be chosen from the first valid map-block that is being sent to the Pbox
after the first interval counter reaches 0.
If the PC_TRIG_MODE is enabled, the countdown does not begin
until one of the PR_TPUs has mapped instructions that correspond to
the PR_TRIG_PC.

PR_INST_CTL<63 :0>

Compaq Confidential
5 January 2001 - Subject To Change

Performance Monitoring

19-3

Instruction Based Profiling

Table 19-1 Control IPRs for Instruction-Based Profiling
IPR

Field

Bits

Description

PR_TPUS

3-0

Specifies the TPUs to be selected for profiling. Only map-blocks for
specified TPUs are metered and only instructions from those TPUs
are profiled. If the value is 0000, no TPU will be profiled. Different
instructions in a profiled pair can come from different specified
TPUs.

M_EVENT_CTL

63-60

Specifies the event to be counted by the MAGG_EVENTn IPRs
between the retire/kill of the first profiled instruction and the retire/
kill of the section profile instruction. The events are listed in Table
19-10.

Reserved

59-0

Reserved for future use. Not currently used.

Reserved

63-52

Reserved for future use. Currently, writing to these bits does not cause
any defined action in the chip.

Match PC

51-5

These bits are used to compare against bits 51:5 of the PCs of valid
mapped instructions for the PR_TPUS. A match will start the MAJ
and MIN countdown counters if PC_1RIG mode is set in the
PR_INST_CTL IPR. If the PC_TRIG_MODE is not set in the
PC_INST_CTL IPR the counters will countdown without reguard to
the Match PC. Mapped instructions on a badpath (instructions that are
mapped but never commit to machine state (as in a branch mispredicted path) will also trigger the match.

Reserved

4:0

Reserved for future use. Currently, writing to these bits does not cause
any defined action in the chip. This implies that bits 4:0 are don't care
in the PC comparison.

PR_MEM_EVENT_CTL<63:0>

PR_TRIG_PC<63:0>

Considerations:

•

The instruction position indicators (PRO_POS, PRl_POS) may be assigned to a
map-block instruction position that does not contain a valid instruction. This will be
reflected in the profile record, and will result in some invalid samples. The samples
do provide some meaningful information, however, which is that the map-block
was not full.

•

If the value in the second interval counter (PRl_CTR) is initially 0, the two profiled
instructions will be chosen from the same map-block. Note, however, that

PRO_POS and PRl_POS can point to any position in the map-block, so the second
ProfileMe instruction could be before the first in program order. Also, if PRO_POS
and PRl_POS indicate the same position in the map-block, the hardware collects
two records of the SAME instruction. It is up to software to avoid these effects if
they are unwanted.
The I_EVENT_CTL and M_EVENT_CTL fields in the controlling IPRS listed above
pertain to the use of the IAGG_EVENTn and the MAGG_EVENTn IPRs which are
described in Section 19.1.3.3. Certain events per TPU in the window between when the
first profile instruction is retired or killed and when the second profile instruction is
retired or killed can be counted. During on paired profile sample, one Ibox event and

Compaq Confidential
19-4

Performance Monitoring

5 Januc1ry 2001 - Subject To Change

Instruction Based Profiling

one Mbox event can be counted together, per TPU. The events that are counted are
determined by the value in the I_EVENT_CTL and M_EVENT_CTL fields. Those
event designations are listed in the following table:
Table 19-2 IAGG_EVENT and MAGG_EVENT IPRs
Event Counted
Value

I_EVENT_CTL

M_EVENT_CTL

0000

!Cache Misses

0001

!stream Scache Misses

0010

Line Mispredicts

0011

!stream misses serviced by remote Rambus

0100

Way Mispredicts

0101

Squashes

0110

Squash Mispredicts

0111

!Cache read bank conflicts

1000

!Cache fill bank conflicts

1001

Total postmap exceptions

1010

Retired Instructions

1011

ITB Misses

1100

Branch Mispredicts

1101

Jump/Return Mis predicts

1110

Mapped Instructions

1111

Load/Store Order Traps (tentative/temporary)

0000

Total Retries

0001

Total Traps

0010

Synonym Traps

0011

Store/Load Order Traps

0100

Dcache Misses

0101

Dstream Scache Misses

0110

DTB Misses

0111

Bad End Inum Retries

1000

Wrong Size Retries

1001

Reserved

1010

Reserved

1011

Reserved

1100

Dcache Bank Conflict Retries

Compaq Confidential
5 January 2001 ··· Subject To Change

Performance Monitoring

19-5

Instruction Based Profiling

Table 19-2 IAGG_EVENTand MAGG_EVENT IPRs
Event Counted
Value

1101

Reserved

1110

Dstream misses serviced by local Rambus

1111

Reserved

19.1.3 Instruction Profile Record IPRs
Several IPRs collect information about the selected profiled instructions. The data consists of events, addresses, and various timing information about each profiled instruction's execution in the CPU.
19.1.3.1 Data/Event IPRs

The program counter, address space number (ASN) and TPU identifier of each profiled
instruction are recorded in the PRO_PC<63:0> and PR1_PC<63:0> IPRs. A valid bit
associated with each IPR is set to indicate whether the profiled instruction was part of a
valid position in a valid map-block. Finally, an additional bit indicates whether the
instruction is PALcode. When the instruction is PALcode, the associated PC will be a
physical address, and the ASN is irrelevant.
Table 19-3 Fields in the PRO_PC<63:0> and PR1_PC<63:0>
Field

Extent

Description

Reserved

Not yet assigned.

Valid

The profiled instruction was from a valid instruction in a valid map block.

ASN

61-54

The address space number for the profiled instruction, if it v1as not in
PALmode

TPU

53-52

The encoded TPU ID of the profiled instruction

51-2

The PC (normally virtual) of the profiled instruction (bits 1 and 0 are
always clear and not recorded).

SuperPalMode
PalMode

Indicates the profiled instruction was taken in a special debug mode of the
21464
0

Indicates the profiled instruction was a PAL instruction in privileged mode.
If this bit is set, it also indicates that the PC field is a physical address

Compaq Confidential
19-6

Performance Monitoring

5 J<·u1uc1ry 2001

Subject To Change

Instruction Based Profiling

Events and data associated with the profiled instructions during the fetch, map, and
queue stages are recorded in the PR_I_INF0<63:0> and PR_Q_INF0<63:0>. These
IPRs record events for both profiled instructions.
Table 19-4 Fields in PR_l_INF0<63:0>
Name

Extent

Description

For the Profile1 Instruction

PRl OUTSTANDING

Indicates that the profile 1 instruction is outstanding. The bit is
cleared, when the profile 1 instruction is either killed or retired.

PRlSLOT

Indicates which icache access (up to two per cycle) the profile 1
instruction was fetched with.

PRl LGHIST

61-56

Indicates the latest 6 bits of the lghist of the branch predictor corresponding to its 3-slot-old index that was used to access the
branch predictor the cycle that a branch in the profiled instruction's block would have been predicted. The higher numbered bits
are older and the lower numbered bits are younger.

RESERVED

55-53

Not yet assigned.

PRl CAUSED KILL

The profiled instruction caused a kill. This means that if the
instruction was a branch, the kill was a branch mispredict, if it
was a jump, it was a jump mispredict, if it was a return, it was a
return stack error, or if it was a load, a memory trap occurred.

PRl SSID

51-47

The store set id of the profile 1 instruction. Only valid if PRl
SSID [51] is set and this instruction is a load or a store.

PRl BPRED

Indicates whether the profiled instruction was a PC changing
instruction, ie, either a predicted taken branch or an unconditional
branch or jump or callpal. For a conditional branch, this bit indicates the branch prediction.

PR 1 RETIRED

Indicates that the profile 1 instruction was retired. If PRl
RETIRED is set, the profiled instruction retired; if clear, the profiled instruction was killed.

PR 1 LP BNK CONF

Indicates that the fetch block for the profile 1 instruction was
delayed by 1 cycle due to a conflict with a training write into the
line predictor.

PRl EXCPREST

Indicates that the profile 1 instruction was in the first block
fetched after an exception restart in the Ibox (branch mispredict,
etc).

PRlICFBNKCNF

Indicates the profile 1 instruction was delayed by 1 cycle because
an attempt to fetch it from the icache failed due to a conflict with
an icache fill into the same Icache bank.

PRl IC BNK CNF

Indicates the profile 1 instruction was delayed by 1 cycle because
an attemp to fetch it from the icache failed due to a conflict with
an icache read for another icache fetch in the same cycle.

PR 1 SQSH MISP

Indicates the profile 1 instruction had a line mispredict due to an
incorrectly predicted squash.

PRlSQSH

Indicates that the profile linstruction was delayed for 1 cycle due
to a squash.

Compaq Confidential
5 January 2001 --·Subject To Change

Performance Monitoring

19-7

Instruction Based Profiling

Table 19-4 Fields in PR_l_INF0<63:0>
Name

Extent

Description

PRl WAYMISP

Indicates the profile 1 instruction was delayed for either 4 or 5
cycles due to an lcache way mispredict.

PRl LOCAL RAM

Indicates that the profile 1 instruction was an icache miss, and an
scache miss, and hit in the local RAMBUS memory (not a router
request).

PRl LINEMISP

Indicates the profile 1 instruction was delayed for 2 or 3 cycles
due to a line predictor mispredict.

PRl SC MISS

Indicates that the profile 1 instruction was an icache miss and an
scache miss

PRl ITB ENA

Indicates that the profile 1 instruction was an icache miss, and was
delayed an extra 4 or 5 cycles due to the micro TB being out of
date. This bit also indicates that the main 128 entry ITB was utilized to translate the PC from a VA to a PA.

PRl_ICMISS

Indicates that the profile 1 instruction was an icache miss.

PRl_SLOTl

Indicates that the original icache fetch for the profile 1 instruction
was a slot 1 fetch. This can help to determine penalties for line
mispredicts and way mispredicts. It this bit is clear, an indicated
line mispredict caused a 2 cycle delay, and a way mispredict
caused a 4 cycle delay. If the bit is set, an indicated line mispredict
caused a 3 cycle delay, and a way mispredict caused a 5 cycle
delay. Note these delay assumtions may be inaccu ate for a number of reasons, but should be right in the common cases.

For the ProfileO Instruction:

PRO OUTSTANDING

Indicates that the profile 0 instruction is outstanding. The bit is
cleared, when the prnfile 0 instruction is eit.11er killed or retired.

PRO SLOT

Indicates which icache access (up to two per cycle) the profile 0
instruction was fetched with.

PRO LGHIST

29-24

RESERVED

23-21

Not yet assigned.

PRO CAUSED KILL

The profiled instruction caused a kiiL Tnis means that if the
instruction was a branch, the kill was a branch mispredict, if it
was a jump, it was a jump mispredict, if it was a return, it was a
return stack error, or if it was a load, a memory trap occurred.

PRO SSID

19-15

The store set id of the profile 0 instruction. Only valid if PRO
SSID[19] is set and this instruction is a load or a store.

PROBPRED

Indicates whether the profiled instruction was a PC changing
instruction, that is, either a predicted taken branch or an unconditional branch or jump or callpal. For a conditional branch, this bit
indicates the branch prediction.

Compaq Confidential
19-8

Performance Monitoring

5 Januc1ry 2001 ··· Subject To Cfumge

Instruction Based Profiling

Table 19-4 Fields in PR_l_INF0<63:0>
Name

Extent

Description

PRO RETIRED

Indicates that the profile 0 instruction was retired. If PRO
RETIRED is set, the profiled instruction retired; if clear, the profiled instruction was killed.

PRO LP BNK CONF

Indicates that the fetch block for the profile 0 instruction was
delayed by 1 cycle due to a conflict with a training write into the
line predictor.

PRO EXCP REST

Indicates that the profile 0 instruction was in the first block
fetched after an exception restart in the Ibox (branch mispredict,
etc).

PRO ICF BNK CNF

Indicates the profile 0 instruction was delayed by 1 cycle because
an attempt to fetch it from the !cache failed due to a conflict with
an !cache fill into the same !cache bank.
·

PRO IC BNK CNF

Indicates the profile 0 instruction was delayed by 1 cycle because
an attempt to fetch it from the icache failed due to a conflict with
an icache read for another icache fetch in the same cycle.

PRO SQSH MISP

Indicates the profile 0 instruction had a line mispredict due to an
incorrectly predicted squash.

PROSQSH

Indicates that the profile 0 instruction was delayed for 1 cycle due
to a squash.

PROWAYMISP

Indicates the profile 0 instruction was delayed for either 4 or 5
cycles due to an !cache way mispredict.

PRO LOCAL RAM

Indicates that the profile 0 instruction was an icache miss, and an
scache miss, and hit in the local RAMBUS memory (not a router
request).

PRO LINE MISP

Indicates the profile 0 instruction was delayed for 2 or 3 cycles
due to a line predictor mispredict.

PRO SC MISS

Indicates that the profile 0 instruction was an icache miss and an
scache miss

PROITB ENA

Indicates that the profile 0 instruction was an icache miss, and was
delayed an extra 4 or 5 cycles due to the micro TB being out of
date. This bit also indicates that the main 128 entry ITB was utilized to translate the PC from a VA to a PA.

PRO_ICMISS

Indicates that the profile 0 instruction was an icache miss.

PRO_SLOTl

Indicates that the original icache fetch for the profile 0 instruction
was a slot 1 fetch. This can help to determine penalties for line
mispredicts and way mispredicts. It this bit is clear, an indicated
line mispredict caused a 2 cycle delay, and a way mispredict
caused a 4 cycle delay. If the bit is set, an indicated line mispredict
caused a 3 cycle delay, and a way mispredict caused a 5 cycle
delay. Note these delay assumtions may be inaccu ate for a number of reasons, but should be right in the common cases.

Compaq Confidential

5 January 2001 ···Subject To Change

Performance Monitoring

19-9

Instruction Based Profiling

Table 19-5 Fields in PR_Q_INF0<63:0>
Field

Bits

Description

For the Profile1 Instruction:

Reserved

63-45

Not yet assigned

PR 1 INSTRS MAPPED

44-41

Indicates the number of valid instructions in the profile 1 instruction's map
block.

PRl GRANTCNT

40-36

Indicates the number of times the profile 1 instruction was granted for execution. If the number is greater than 1, it indicates that the profile 1
instruction was poisoned and reexcecuted (this value - 1) times.

PR 1 BRMISS OLD

If the profile 1 instruction was a branch that mispredicted, this bit indicates that it was the oldest mispredicting conditional branch in the cycle it
mispredicted, thus qualifying it to take the "fastpath" to restart the PC on
the goodpath. If it is not set for a mispredicting branch, it indicates that
this branch had a several cycles of additional branch mispredict penalty
(about 4 or 5).

PRl PICKER

34-32

Indicates the number of the Qbox Picker which selected the profile 1
instruction for execution. If the picker number is not the same as the map
block position of this instruction (PRl_POS for the profile 1 instruction),
it indicates that this instruction followed a parent instruction to another
picker.

For the ProfileO Instruction:

Reserved

31-13

Not yet assigned

PRO INSTRS MAPPED

12-9

Indicates the number of valid instructions in the profile 0 instruction's map
block.

PRO GRANT CNT

8-4

Indicates the number of times the profile 0 instruction was granted for execution. If the number is greater than 1, it indicates that the profile 0
instruction was poisoned and reexcecuted (this value - 1) times.

PRO BRMISS OLD

If the profile 0 instruction was a branch that mispredicted, this bit indi-

cates that it was the oldest mispredicting conditional branch in the cycle it
mispredicted, thus qualifying it to take the "fastpath" to restart the PC on
the goodpath. If it is not set for a mispredicting branch, it indicates that
this branch had a several cycles of additional branch mispredict penalty
(about 4 or 5).

PRO PICKER

2-0

Indicates the number of the Qbox Picker which selected the profile 0
instruction for execution. If the picker number is not the same as the map
block position of this instruction (PRl_POS for the profile 0 instruction),
it indicates that this instruction followed a parent instruction to another
picker.

Compaq Confidential
19-1 o Performance Monitoring

5 January 2001

Subject To Change

Instruction Based Profiling

Data associated with loads and stores is recorded in the PRO_MEM_INF0<63 :0> and
PRl_MEM_INF0<63:0> IPRs.
Table 19-6 Fields in PRO_MEM_INF0<63:0> and PR1_MEM_INF0<63:0>
Name

Extent

Description

Reserved

<63:62>

Not yet assigned ..

CONTENTION

<61>

Mbox trap or replay was due to contention with the other profiled instruction.

TRP_INV

<60>

Mbox trap - invalidate speculative load or store

TRP_SYN

<59>

Mbox trap - virtual synonym dependence ignored

TRP_SLOO

<58>

Mbox trap - dependent store-load executed out of order

TRP_DTBM

<57>

Mbox trap - DTB miss.

RET_SCM

<56>

Mbox retry - scache miss

RET_BAD_EINUM

<55>

Mbox retry - bad end inum

RET_WRSZ

<54>

Mbox retry - wrong size Id/st

RET_BCNF

<53>

Mbox retry - dcache bank conflict.

RET_DCM

<52>

Mbox retry - dcache miss.

<51:0>

The virtual address of the profiled instruction if it was a load or store.

The S cache, memory controller and router can be invoked for loads and stores that miss
in the first level Dcache. Two IPRs, PRO_DMISS_INF0<63:0> and
PRl_DMISS_INF0<63:0> collect latency and event information for the profiled
instructions that generate activity in the Cbox.
Table 19-7 Fields in PRO_DMISS_INF0<63:0> and PR1_DMISS_INF0<63:0>
Name

Extent

Description

Reserved

<63:61>

Not yet assigned.

ROUTER_DEST

<60:52>

Processor id of a memory request for the profiled instruction that is serviced remotely (ie, another processor is the home node)

LAT_SNAP_FWD

<51:43>

For a local RAM access, the number of cycles from the snapshot point until
the data is forwarded.

LAT_DENQ_SNAP

<42:35>

For a local RAM access, the number of cycles from DIFT enqueue until the
snapshot point.

RAM_RAS

<34>

Indicates the bank for the profiled instruction's request had to perform a
row access strobe, ie, this request was not a page hit in the local RAM. For
local RAM requests only.

RAM_PRCHG

<33>

Indicates the bank for the profiled instruciton' s request had to precharge.
For local RAM requests only.

DIR_CACHE_HIT

<32>

Indicates the profiled instruction's request hit in the directory cache.

LCL_RMBS

<31>

If the profiled instruction results in an scache miss, this bit indicates
whether the data was resident in the local memory.

Compaq Confidential
5 January 2001 ··· Subject To Change

Performance Monitoring 19-11

Instruction Based Profiling

Table 19-7 Fields in PRO_DMISS_INF0<63:0> and PR1_DMISS_INF0<63:0>
Name

Ex.tent

Description

COHER_CNT

<30:26>

Records the number of invalidates the profiled instructions memory
request must await before being allowed to obtain ownership of the
requested block.

MAF_ALC_DLC

<25: 16>

Records the number of cycles the profiled instruction's MAF entry was
allocated to the time when it was deallocated.

PMAF_LAT_RPL

<15: 12>

Records the number of cycles the profiled instruction's preMAF request
waited for the Scache due to replays.

PMAF_LAT_PRB

<11:8>

Records the number of cycles the profiled instruction's preMAF request
waited for the Scache due to probes.

PMAF_LAT_ICM

<7:4>

Records the number of cycles the profiled instruction's preMAF request
waited for the Scache due to Icache miss requests.

PMAF_LAT_FILL

<3:0>

Records the number of cycles the profiled instruction's preMAF request
waited for the Scache due to other fills.

19.1.3.2 Timeline/Latency IPRs

The 21464 keeps an internal running cycle counter whose value is recorded when certain events happen to a profiled instruction. A sequence of recorded counter values creates a timeline of a profiled instruction's execution in the CPU. For example, the
counter is recorded into a timeline register field when the instruction fetch unit first
tries to fetch the profiled instruction. The counter is recorded into another timeline register field when the instruction is retired. By subtracting the two recorded counter values, software can determine how long a profiled instruction was in-flight in the CPU. A
complete timeline for an instruction gives insight about the performance bottlenecks in
the CPU. Figure 19-1 illustrates the timeline that is captured for each profiled instruction:
Figure 19-1 Captured Timeline for Each Profiled Instruction

Counter
Bits
..
1Kecoraea1

Fetch
(Last
Valid
Map)

Map

IQAlloc

Data
Ready

Bid
Enable

Grant

Chunk IQ
IQ
Dealloc Dealloc

Retire
Able

Killed or
Retired

In addition to the basic timeline described above, the timeline IPRS also record the time
that the last valid instruction before the profile instruction, and in the same TPU,
became retireable. This is useful to determine whether the profile instruction's execution had delayed the executing program, and if so, by how many cycles. If the prior
instruction's retireable time and the profiled instruction's retireable time are the same,
the profiled instruction did not directly contribute to the execution time of the running
program. If, on the other hand, the prior instruction's retireable time is earlier than the
profiled instruction's, speeding up the processing of the profiled instruction could
increase the performance of the running program.

Compaq Confidential
19-12 Performance Monitoring

5 January 2001 --·Subject To Change

Instruction Based Profiling

The sizes of the "Fetch" and "Retire" timeline fields are intentionally large, to ensure
that in the event of an instruction that was very slow to execute, the total time from
Fetch to Retire can still be computed. The 32 bit register field sizes of fetch and the 48bit register field of retire also serve to allow software to determine the time between the
two profiled instructions. The upper 16 bits of the retire field allow software to determine the time between two pairs of samples. So, the latency between when the first profiled instruction retired and the second profiled instruction retired can be calculated
simply by subtracting the two retired timeline snapshots. The same running cycle
counter is recorded for both profiled instructions, which provides this feature.
Due to poisoning (see Mbox/Qbox documentation), instructions can actually go
through some of the timeline points more than once. If this happens the register field(s)
corresponding to the recurring events will simply be set more than once. This will actually yield the correct data, as earlier event occurances were due to a misspeculation. So,
for example, an instruction may appear data ready, but is not in reality.
There are four timeline IPRs per profiled instruction. The fields of the timeline IPRs
and their meanings are specified in Table 19-8. Note that the following timestamps are
UNPREDICTABLE for instructions that are invalid (not mapped), decoded as a 21464
NOP, or killed:
IQ_ALLOC
DATA_READY
BID_ENABLE
GRANT
IQ_DEALLOC
Also, the RETIREABLE timestamp is UNPREDICATABLE for instructions that are
killed.
Table 19-8 PRn_TIMELINE IPRs 1
IPR

Bits

Description

FETCH

63-32

Fetch really implies the earliest time at which the profiled instruction
could have been fetched, or the time of the last valid map from one of
the PR_TPUs. If there is more than 1 cycle between the last valid
map, and the map time of the profiled instruction, a fetch delay of
some sort was encountered. The fetch delay could have been due to
an icache miss, line mispredict, way mispredict, etc, or just because
the map thread chooser chose against the PR_TPUs. The use of the
PR_l_INFO IPR data can help to determine the cause of the fetch
delay.

RETIRE/KILL (bits

31-0

The time when the profiled instruction's map block inum was broadcast on the RK bus as a retire block, OR, the time that a kill that killed
the profiled instruction was broadcast on the rk bus. PR_l_INFO will
indicate whether the profiled instruction retired, was killed, or caused
a kill itself.

63-48

The time when the profiled instruction's map block was driven from
the Ibox to the Pbox.

Field

PRn_TIMELINEO

31-0)

PRn_TIMELINEl
MAP

Compaq Confidential

5 January 2001 ··· Subject To Change

Performance Monitoring 19-13

Instruction Based Profiling

Table 19-8 PRn_TIMELINE IPRs 1
IPR

Field

Bits

Description

IQ_ALLOC

47-32

The time when the profiled instruction's map block was allocated
space in the instruction queue. This is usually a fixed delay from the
"map" time, unless the instruction queue is backed up and not able to
allocate more space. If this happens the profiled instruction's map
block can be delayed in the post-map skid buffer.

DATA_READY

31-16

The time when the profiled instruction is data ready, that is all of it's
source operands (including store sets for loads) were available.

BID_ENABLE

15-0

This time is normally the cycle following data_ready for most
instructions. The notable exceptions are loads and stores, which could
be data ready, but not issue because there are not enough entries in the
load or store queues. Also, if bid_enable occured the same cycle as
data_ready, it implies that the profiled instruction followed it's parent
to another picker.

GRANT

63-48

This is more commonly known as "issue time", or the time when the
profiled instruction is sent to it's functional unit for execution. If there
is more cycles beween grant and bid_enable than the normal fixed
delay, it implies that the profiled instruction suffered from functional
unit queueing delay.

IQ_DEALLOC

47-32

This is the time that an instruction is past it's poison point and will no
longer hold up it's queue chunk's iq deallocation time. This also corresponds to the completion unit's notion of "complete".

PRED_RETIREABLE 31-16

This is the time that the last valid predecessor instruction, before the
profiled instruction, became retireable. The predecessor instruction
must be in the same thread (and running on the same TPU) as the profiled instruction.

RETIREABLE

This is the time that the profiled instruction itself became retireable,
that is when it is complete, and all instructions older than it in the
same TPU are also complete. It should be greater than or equal to
PRED_RETIREAB LE above. If this is not the same as
PRED_RETIREABLE, it indicates that the profiled instruction contributed to the overall program execution time by the number of
cycles difference between the two. Instructions which have much
greater retireable times than PRED_RETIREABLE times point to
areas in the program that contribute to significant performance loss.

PRn_TIMELINE2

15-0

PRn_TT_MELINE3

Compaq Confidential
19-14 Performance Monitoring

5 Jarwc1ry 2001 -·Subject To Change

Instruction Based Profiling

Table 19-8 PRn_TIMELINE IPRs 1
IPR

Field

Bits

Description

IQ_CHK_DEALLOC

63-48

This is the time that the iq chunk that the profiled instruction was a
part of is deallocated. If this is the same time as the iq_dealloc time in
PRn_TIMELINE2 above, it indicates that the profiled instruction was
one of, or the only, instruction gating the deallocation of the queue
chunk. If the iq_cnk_dealloc time is greater than the iq_dealloc time,
it indicates that a different instruction in the iq chunk was gating the
deallocation of the queue chunk. By subtracting iq_alloc time
(PRn_TIMELJNEl) from iq_cnk_dealloc time, software can obtain
the total q chunk lifetime for the profiled instruction's queue chunk.
Since the queue chunks are a limited resource, a high average queue
chunklifetime may indicate a performance bottleneck in a running
program. The compiler or a run time optimizer, may be able to associate long latency instructions together in the same queue chunks, so
that other queue chunks with all shorter latency instructions are deallocated sooner, alleviating the queue chunk resource constraint.

RETIRE (bits 47-32)

47-32

Upper 16 bits of retire timestamp

Reserved

31-0

Not yet assigned.

Where n = 0 means first profiled instruction and n = 1 means second profiled instruction.

There is an additional latency IPR associated with store processing. The latency
counters are only for the first profiled instruction. The data is interesting if both profiled
instructions are stores, and the latter store is not able to issue because the prior store is
clogging store processing. The IPR that holds the latencies is called
PR_ST_LATENCY<63:0>.
Table 19-9 Fields in PR_ST_LATENCY<63:0>
Name

Extent

Boxes

Description

Reserved

<63:24>

ACK_2_MBFREE

<23:16>

Number of cycles between the first profiled instruction's merge
buffer entry is acknowleged and the time that that merge buffer
entry is freed.

MB_2_ACK

<15:8>

Number of cycles between the first profiled instruction is eligible
to merge in the merge buffer and the time that its entry into the
merge buffer is acknowleged.

STQ_2_MB

<7:0>

Number of cycles between the first profiled instruction enters the
store queue and it becomes eligible to merge in the merge buffer.

Not yet assigned.

19.1.3.3 Aggregate Event/Data IPRs

The IPRs listed before here in this section obtain information specifically about the profiled instructions. The 21464 also collects aggregate events in region of execution
between the retrires or kills of the two profile instructions. This allows for the collecting
of aggregate event or data information in a certain region, per TPU. So, over an interval
delimited by the retire times of two dynamic instructions, information such as:
Retired instructions
ITB Misses
Compaq Confidential
5 January 2001 -- Subject To Change

Performance Monitoring 19-15

Instruction Based Profiling

Store Load Order Traps
Scache Misses
And so forth (see the I_EVENT_CTL and M_EVENT_CTL IPR definitions above
for a full list).
can be accrued per TPU. The per TPU information can be summed to give total CPU
stats. The rate of these events can also be determined by subtracting the retire/kill time
of the first profiled instruction from the second. This gives the total number of cycles
that the aggregate event counters were monitoring the selected events. Dividing the
events by cycles, yields the event rate (eg. Retired instructions per cycle). Repeated
profile samples will give multiple data points over the span of a programs execution. In
general, fairly infrequent samples can create very accurate data, but, if needed, the samples can be very close together, so as not to miss any statistically significant information using the following algorithm:
In PR_INST_CTL the profile PRO_CTR to something small (say 0 or slightly
larger to ignore the profiling PAL routine and other software overhead), and the
PRl_CTR to something relatively large (say 1-16 Million Map Blocks). When
the interrupt is triggered, collect the information, and reset the PR_INST_CTL
register again to initiate the next sample.
The profiling hardware can only collect two aggregate events per sample, 1
IBOX(lnstruction Unit) related event, and 1 MBOX (Memory Unit) event. A software
profiler can alternate between the events to get them all. This should work well for programs whose behavior is repetative. Varying the PRl_CTR, or the order that the events
are collected, will avoid missing phasic behavior (eg, If retired instructions and icache
misses are alternately collected for l 6M map blocks each, and the program just so happens to have different phases at the same frequency, it would be a mistake to assume
that the icache miss rate and the IPC rate are constantly at their measured values. Alternating the order in which they are collected, or varying the PRl_CTR time wiii heip
avoid this). If a program does not have repeated behavior, the program can be sampled
over several runs to obtain all the data.
This can be quite powerful in finding bottlenecks in a running program. A chart of IPC
over time will reveal sections of the program that are performing the least. The other
aggregate event information can hint at the cause of this diminished performance. Also,
the PC's of profiled instructions collected in the regions of low performance, can later
become trigger PCs in the PR_TRIG_PC IPR, in order to collect more instruction samples in the regions of diminished performance.
The aggregate event counter IPRs are specified in Table 19-10

Compaq Confidential
19-16 Performance Monitoring

5 Jc1nuc1ry 2001 - Subject To CfJange

Memory Reference Performance Monitoring

Table 19-10 Aggregate Event Counter IPRs
IPR

Bits

Description

IAGG_EVENTn

31-0

The aggregate event count chosen by the value written by software into
I_EVENT_CTL. The events are listed in Table 19-1. Then pertains to the TPU
for which the events are collected. There are a total of 4 32 bit IPRs here, one per
TPU. The events are counted between the retire/kill of the first and second profile
instructions.

MAGG_EVENTn

31 :0

The aggregate event count chosen by the value written by software into
M_EVENT_CTL. The events are listed in Table 19-1. Then pertains to the TPU
for which the events are collected. There are a total of 4 32 bit IPRs here, one per
TPU. The events are counted between the retire/kill of the first and second profile
instructions.

19.2 Memory Reference Performance Monitoring
The memory reference performance monitoring hardware is identical to that of the
21364. While the 21464 designers intend to support the same functionality, this specification may change to reflect architectural differences in the memory subsystem of the
two processors.
Instead of IPRs, this performance monitoring hardware is controlled and collected via
IO mapped CSRs. There are separate sections for the Cbox, Rbox, and Zbox.

19.2.1 Cbox Performance CSRs
19.2.1.1 Cbox Performance Control -CBOX_PRF _CTL<31 :0>
Table 19-11 Fields in CBOX_PRF_CTL<31:0>
Name

Extent

Access

Description

ISTM_SAMP_ENA

<31>

RW,O

Enable istm sampling (on non-abtd bcache lookup)

PRF_SAMP_ENA

<30>

RW,O

Enable performance sampling

PAGE_MIGR_FAST

<29>

RW,O

Selects between 0 (fastest possible sample (=1)) or 16 (=0)
events between migration samples

PAGE_MIGR_ENA

<28>

RW,O

Enable page migration sampling

WATCH_SEL

<27>

RW,O

Event for watch register to trigger on
0- BC lookup (non-aborted)
1- SYS sent

WATCH_ENA

<26>

RW,O

Enable watch register

Compaq Confidential

5 January 2001 -·· Subject To Change

Performance Monitoring 19-17

Memory Reference Performance Monitoring

19.2.1.2 Cbox Performance Address- CBOX_PRF_ADR<63:0>
Table 19-12 Fields in CBOX_PRF_ADR<63:0>
Name

Extent

Access

Description

PA<42:6>

<63:27>

RW physical address

REQPID<lO:O>

<25:15>

Requestor PID

OPCODE<7:0>

<9:2>

Network opcode -or- CMAF rdtype w/opcode<7:4> == 0

The performance sample.
A sampled watch address due to WATCH_EN locks the register until
CBOX_PRF_ADR is written.
19.2.1.3 Cbox Performance Status - CBOX_PRF _STS<25:0>
Table 19-13 Fields in CBOX_PRF_STS<25:0>
Name

Extent

Access

Description

SYS_BYP_USED

<25>

RO,O

System port bypass was used

SYS_BYP

<24>

RO,O

Address granted bypass to system port

TAG_BYP

<23>

RO,O

Address bypassed from Mbox to BTAG

COUPLED

<22>

RO,O

Lookup was CMAF coupled lookup

CMAF_HIT

<21>

RO,O

c_cmf -> prq_cmaf_addr_hit_l2a

SVAF_HIT

<20>

RO,O

c_cmf-> prq_svaf_addr_hit_l2a

BC_DTY

<19>

RO,O

b -> c_blk_dirty_12a

BC_SHR

<18>

RO,O

b -> c_blk_shared_12a

DMR_BCV

<17>

RO,O

b -> c_dm_reqd_bcv_12a

DMR_BCVDC

<16>

RO,O

b-> c_dm_reqd_bcv_in_dc_12a

DMR_DCS

<15>

RO,O

b -> c_dm_reqd_dc_syn_12a

BC_VLC

<14>

RO,O

b -> c_local_bcv_l2a

BC_VSH

<13>

RO,O

b -> c_bcv_shared_12a

DC_BCV

<12>

RO,O

b -> c_bcv_in_dcache_l2a

BC_VIC

<11>

RO,O

b -> c_bc_victim_l2a

DC_SYN

<10>

RO,O

b -> c_dc_synonym_12a

BC_HIT

<9>

RO,O

b -> c_bc_hit_l2a

DC_VIC

<8>

RO,O

b -> c_dc_victim_l2a

OPCODE<7 :0>

<7:0>

RO,O

Network opcode -or- CMAF rdtype w/opcode<7:4> == 0

The status associated with CBOX_PRF_ADR.These bits are reset to zero on either cold
or fast reset.
19.2.1.4 Cbox Performance Match - CBOX_PRF _MAT<25:0>

Fields are the same as CBOX_PRF_STS, except they are RW. See Section 19.2.1.3)
Compaq Confidential
19-18 Performance Monitoring

5 J<"inmiry 2001 -· Subject To Change

Memory Reference Performance Monitoring

This register provides, in combination with CBOX_PRF_MATV, a way to filter performance events. A set bit in this register means that the CBOX_PRF_CNTevent counter
will only increment forperformance events which would set the corresponding bit in
CBOX_PRF_STS to the value given by the corresponding bit in CBOX_PRF_MATV.
For example, if CBOX_PRF_MAT<25:24> == 3 and ICBOX_PRF_MATV <25:24> ==
1, only performance events which were granted a system port bypass and did not use it
would be countedby the event counter.
19.2.1.5 Cbox Performance Match Value - CBOX_PRF_MATV<25:0>

The fields are the same as CBOX_PRF_STS, except they are RW. See Section 19.2.1.3.
Also see the description under CBOX_PRF_MAT, Section 19.2.1.4.
19.2.1.6 Cbox Performance Counter - CBOX_PRF_ CNT<31 :0>
Table 19-14 Fields in CBOX_PRF_CNT<31:0>
Name

Extent Access Description

EVENT_CNT<31:0>

<31:0> RW

RW performance counter
Software can write this counter to any value to provide any resolution desired. Interrupts are triggered upon carry out of the high-bit of
thecounter, so writing the initial counter value with a large number
willcause an earlier interrupt.

See description of CBOX_PRF_MAT (3.1.4) for explanation of how the decision to
increment this counter is made.

19.2.2 Zbox Performance CSRs
19.2.2.1 Zbox Performance Counter O - ZBOXn_ZPM_CTR0<31 :0>
Table 19-15 Fields in ZB0Xn_ZPM_CTR0<31:0>
Name

Extent

Access Description

ZBOX_PERF_CTRO_UND

<31>

Indicates counter underflow.

ZBOX_PERF_CTRO

<30: 0> RW

Zbox Performance counter 0.

Decrements when the condition specified by ZPM_CTLO have been met. A performance counter interrupt will be signalled when the counter underflows. A 31-bit event
counter and an underflow bit. ZPM_CTRO can be programmed to count one of 32items
related to the Zbox middle. The counter can be preloaded with an initial count via software.When the selected event occurs, the corresponding counter is decremented. When
either counter counts below zero, the Zbox will generate a
performance_monitorinterrupt.Note that only the first underflow causes a perf-monitor
interrupt,so we can disable the interrupt by writing a 1 to the underflow bit.The interrupt occurs on the 0->-l transition, so #eventsl must be loaded into the counters.

Compaq Confidential
5 January 2001 ··· Subject To Change

Performance Monitoring 19-19

Memory Reference Performance Monitoring

19.2.2.2 Zbox Performance Counter 1 -ZBOXn_ZPM_CTR1<31:0>
Table 19-16 Fields in ZBOXn_ZPM_CTL1<31:0>
Name

Extent

Access Description

ZBOX_PERF_CTRl_UND

<31>

Indicates counter underflow.

ZBOX_PERF_CTRl

<30: 0>

Zbox Performance counter 1.

Decrements when the condition specified by ZPM_CTLl have been met. A performance counter interrupt will be signalled when the counter underflows.A 31-bit event
counter and an underflow bit. ZPM_CTRl can be programmed to count one of 16 items
related to the Zbox front-end (DIFT).The counter can be preloaded with an initial count
via software.When the selected event occurs, the corresponding counter is decremented.When either counter counts below zero, the Zbox will generate a
performance_monitorinterrupt.Note that only the first underflow causes a perf-monitor
interrupt,so we can disable the interrupt by writing a 1 to the underflow bit.The interrupt occurs on the 0-> -1 transition, so #event-1 must be loaded into thecounters.
19.2.2.3 Zbox Performance Control- ZBOXn_ZPM_CTL<31 :O>

Table 19-17 shows the fields in ZB0Xn_ZPM_CTL<31:0> IPR
Table 19-17 Fields in ZBOXn_ZPM_CTL<31 :0>
Name

Extent

Access Description

unused<31: 12>

<31:12> RO

MBZ

Compaq Confidential
19-20 Performance Monitoring

5 J(1nuary 2001 - Subject To Change

Memory Reference Performance Monitoring

Table 19-17 Fields in ZBOXn_ZPM_CTL<31 :0> (Continued)
Name

Extent

Access Description

unused<l1:9>

<11:9>

MBZ

Control for Zbox performance counter 1:

ZPM_CTL1<3:0> <8:5>

ctl1

Item to Count (ZPM_CTR1)

0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1001
1010
1011
1100
1101
1110
1111

Incoming transaction (any)
Incoming ReadSharedReq
Incoming ReadModREq
Incoming ReadReq
Incoming FetchReq
Incoming SharedToDirtyReq
Incoming SharedToDirtySTCReq
Incoming InvalToDirtyReq
Incoming Victim
Incoming VictimClean
Outgoing Forward (any)
Outgoing Forward=InvalSingle
Outgoing Forward=InvalMask
Outgoing Forward=Read(anytype)Forward
Outgoing Forward=FetchForward (tRR, tPP)
Outgoing Forward=ItoDForward
Forward Miss received

Compaq Confidential
5 January 2001 ··· Subject To Change

Performance Monitoring 19-21

Memory Reference Performance Monitoring

Table 19-17 Fields in ZB0Xn_ZPM_CTL<31:0> (Continued)
Name

Extent

ZPM_CTL0<4:0> <4:0>

Access Description

Control for Zbox performance counter 0:
ctlO

Item to Count (ZPM_CTRO)

00000 nq_any -- regardless of reject status
00001 nq_prq -- regardless of reject status
00010 nq_rsq -- regardless of reject status
00011 nq_csq -- regardless of reject status
00100 nq_any -- qualified by & !reject
00101 nq_prq -- qualified by & !reject
00110 nq_rsq -- qualified by & !reject
00111 q_csq -- qualified by & !reject
01000 q_any -- qualified by & reject
01001 nq_prq -- qualified by & reject
01010 nq_rsq -- qualified by & reject
01011 nq_csq -- qualified by & reject
01100 nq_rej -- No Fill Buffers available
01101 nq_rej -- Shadow Reject in SHDPND, PDN Interval
01110 nq_rej -- Page-Conflict Reject
(tRAS) in HLDPND, PND Interval (R/w)
(tRDP) in SHDPND interval (R)
(tRP,tRCD) in BLKPND, PND or NQPRPND Interval
01111 nq_rej -- WRB Reject (tRTR, tRTP)
10000 nq_rej -- Queue Full Reject
10001 nq-rej -- NQ' Waterfall prior over DFf-NQ 10010 cmd =
dir_only_read
10011 cmd = dir+data_read
10100 cmd = dir_only_write
10101 cmd = dir+data_write
10110 PRER precharge
10111 PREX precharge
11000 PREC precharge
11001 COL=RD
11010 COL=WR
11011 COL=NOCOP
11100 COL=Any
11101 Starvation detections
11110 Force wr. ret
11111 Deferred write retire

19.2.3 Rbox Peformance CSRs
This section describes the Rbox performance IPRs.

Compaq Confidential
19-22 Performance Monitoring

5 Januc1ry 2001 ·- Subject To Change

Addendum: lmplemention Notes

19.2.3.1 Rbox Port Performance Counter- RBOX_n_PERF<27:0>
Table 19-18 Fields in RBOX_n_PERF<27:0>
Name

Extent

Access Description

PCV<23:0>

<27:4>

Counter value
There is a hidden (not software visible) 8-bit register. The hidden register is
cleared on every write (by software) to this register. The software-visible
counter is only incremented when the hidden register overflows. The counter
stops incrementing once it overflows (carry out of bit <27>). An interrupt is
also asserted at that point (but may be blocked by the interrupt mask).

PCC

<1:0>

Counter Control
00 - port utilization (increment for every outgoing used tick)
01 - undefined
10 - #of message bypasses
11 - #of messages

19.2.3.2 Rbox 10 Port Performance Counter- RBOX_IO_PERF<27:0>
Table 19-19 Fields in RBOX_IO_PERF<27:0>
Name

Extent

Access Description

PCV<23:0>

<27:4> RW

Counter value
There is a hidden (not software visible) 8-bit register. The hidden register is
cleared on every write (by software) to this register. The software-visible
counter is only incremented when the hidden register overflows. The counter
stops incrementing once it overflows (carry out of bit <27> ). An interrupt is
also asserted at that point (but may be blocked by the interrupt mask).

PCC<l:O>

<1:0>

Counter Control
00 - port utilization (increment for every outgoing used tick)
01 - # cycles the router is in drain mode
10 - # cycles the router is in starve mode
11 - # of messages

19.3 Addendum: lmplemention Notes
19.3.1 From Data/Event I PRs
Implementation Note:

The information for generating the profileMe PCs is already kept in the PC Table.
When the collapsing buffer drives map blocks to the Pbox, information from the premap PC table is read out and merged with information coming from the collapsing
buffer to be stored into the Post-Map PC Table. The PC<51 :2> of a profiled instruction
can be determined from the following pieces of information:

•
•
•

PC A<51:5> -the first fetch-block's PC in the map block
PC B<51 :5> - the second fetch-block's PC in the map block
Total Map Block Length<3 :0> - in instructions
Compaq Confidentia I

5 January 2001 ··· Subject To Change

Performance Monitoring 19-23

Addendum: lmplemention Notes

•

Length of instructions from slot A<3 :0> - in instructions

•

Starting PC offset position for slot A <3 :0>

•

Starting PC offset position for slot B <3 :0>

•

Position of the profiled instruction in the map block, (PRO_POS, and PRl_POS in
PR_INST_CTL)

•

Two bits from the selection engine that indicate either or both of the profiled
instructions are being chosen this cycle.

19.3.2 Following Table 17-4
Implementation Note:

•

Event-flags will be kept, per TPU, along with the current PC latches in the PC datapath. This state will indicate whether certain types of events happened while
attempting to fetch from that PC. We can take advantage of the fact that all restarts:
Icache Miss, Line Mispredict, Way Mispredict, micro TB (uTB) out of date, and
Exception/Disruption restarts, will all restart in slot 0, resetting the current PC latch
in I2B. When that PC latch is set, the event-flags for those restarts can also be set.
Whenever a PC is sent onto I3, the event information is sent along with it, indicating the events that occurred regarding that PC fetch. If a pipe restart occurs while
attempting to fetch a PC that already has event-flags set, the event-flags will follow
the PC and be copied back into the current event-flags state. In this way, we can
record multiple events for one attempted PC fetch. So, if first we line mispredict,
then Icache miss, we can see both events.

•

These event-flag bits will be recorded for each of the profileMe instructions when
they are chosen. If the PC changes due to a PC-redirecting exception, the eventflags are cleared, because the old event-flags no longer pertain to THIS fetch. Also,
an event flag indicating that this was a PC redirect is set. Whenever the PC is incremented or changed as part of normal program flow, the event-flags are cleared. The
event-flags must be passed through the pre-map PC Table to stay with the appropriate fetch block. Unfortunately, the pre-map PC Table will have space for eventflags for all fetch slots, even though slot 1 fetches do not need any. This is because
any or all slots in the collapsing buffer could be slot-0 fetches. When a map-block
is sent with one or more profileMe instructions in it, we store the event-flags in a
register (there is no need to store these in the post-map PC Table), to be read later
by IPR reads. Right now, we should plan for about 8 bits for event flags.

Compaq Confidential
19-24 Performance Monitoring

5 Jc1nuc1ry 2001

Subject To Change

Debug Process

20
Hardware Debug Features
Debugging real 21464 hardware in a timely fashion is a key element to achieving our
time-to-market goals. As the complexity and speed of Alpha chips increases, the ability
to observe, understand and potentially effect the operation of faulty hardware through
just software or the pin interface gets more difficult and time consuming. This document will outline the capabilities we intend to include in the 21464 to aid in the hardware debug and FRS effort.
In the past, capabilities such as feature disable or bypass bits, performance monitors,
and status ports have been embedded into silicon for the purpose of debug. Many of
these features have proved invaluable in the quest to understand unexpected behavior.
Previous experience has shown how difficult and time-consuming the process can be
when there is minimal controllability and almost no visibility features. The 21464 will
be significantly more complex and must have better debug hooks in the hardware.
Although this document will focus on features to help find logic bugs, other sources of
problems like manufacturing defects, implementation and layout errors or even software errors can create problems which appear to be logic errors and often can be isolated through the same techniques. This document will not cover manufacturing test
specific goals or features but it should be noted that observability features added for
system debug can ease the manufacturing test pattern generation process.
The process of specifying the debug features in the 21464 has just begun. At this point,
the goal of this document is to provide some focus and structure and to specify the common high level features that are currently being proposed. Eventually, as the detailed
lists of signals and controls being integrated into each box becomes better defined, this
document will become a reference. As this is still in the proposal stage, any suggestions for enhancements or other concerns are definitely welcome.

20.1 Debug Process
Typically, when something goes wrong, the first goal is to just reproduce the failure,
then to prune the case down to reduce the number of cycles and eliminate interactions
with other possible sources of error. The difficult task here is identifying the conditions that contribute to the failure and recreating just those conditions in an ever more
simplified way.

Compaq Confidential

5 J~muary 2001 ··· Subject To Change

Hardware Debug Features

20-1

Feature Overview

Once the failure is easily reproducible, the task shifts to isolation of the exact cause. In
a simulator we would typically augment tracefiles with signals until the problem is
traced backwards from symptom to cause. In real cases where the stimulus fails to
reproduce the failure in the model, hardware features that allow internal state to be
observed are precious.
Quickly identifying the cause of a failure is very important to achieving time-to-market
goals, but the ability to find a workaround may be equally important. Fab times for the
21464 will likely result in weeks or months to tum-around changes to silicon. Once
prototypes or revenue hardware is shipped, the cost of hardware upgrades also factors
into the cost and impact of bugs. Bugs will exist, our success depends on the ability to
quickly resolve the issues and ship revenue hardware.
The Debug features we incorporate into the 21464 need to address the demands of all
the phases of debug. When reviewing this document or defining visibility hooks, keep
the debug flow in mind and try to ensure we create a complete solution.

20.2 Feature Overview
The 21464 has committed to implementing some amount of debug logic specifically
targeted at observing and controlling the processors flow. The trick is to carefully balance the cost with the benefit. The overall guideline is minimal area penalty (<5%) and
to avoid any speed penalty. The current plan is to support debug with the following global structures.

20.2.1 Scan
Manufacturing test is the driving force behind the Scan and BiST implementation but
this infrastructure also allows for significant debug visibility to internal states. The
21364 model of multiple scan islands, each with multiple scan chains, is also the plan of
record for the 21464. This methodology requires scan on only a small percentage of the
latches but allows almost any latch to be included in the chain if desired. Detecting difficult stuck-at faults is the most common reason for adding latches to the scan chains,
but debug visibility is an equally good reason and designers should be encouraged to
make as many important states as possible visible through scan.
When considering what to add to the scan chains for debug remember that although
scan is an efficient way to extract the current state of a large number of signals, it is a
destructive process that reflects only one instant in time. If the failure symptom is a
hang, the problem that caused the hang can hopefully be deduced from the scanable
state. When the failure symptom is dynamic (data corruption, application crash, etc.),
narrowing in on a point in time where incorrect behavior can be observed through scan
is much more difficult.
BiST can help. The BiST engines in the 21464 will have a read-out mode that extracts
the internal state into the scan chain. Structures like the register file, pc and branch history tables can be completely dumped through the scan paths. Counters, fifo's or other
structures that preserve some extended state can also provide some historical information to the scan dump. CAM structures like the TBs are directly dumpable even
through BiST. Additional debug hooks will need to be designed into CAMs if they are
to be dumped through scan.

Compaq Confidential
20-2

Hardware Debug Features

5 Januc1ry 2001 - Subject To Change

Feature Overview

Aside from ensuring the correct information is scanable, the major issue relating to scan
for debug is how to activate the scan dump.

20.2.2 Trace Bus
The traditional method of debugging hardware in the lab is to attach a logic analyzer to
some number of pins, find a trigger condition near the failure and infer internal operation from the pin trace. Compared to processors of just a few years ago, the 21464
more closely resembles a system on a chip. The information needed to understand the
events surrounding most failures is buried deep in the processor inaccessible to traditional oscilloscopes or logic analyzers. Our solution for the 21464 is to embed the logic
analyzer functions directly into the silicon.
A logic analyzer type trace differs from a scan dump in that it provides a continuous
view of signal state over a large number of cycles. The number of signals that can be
traced is very small but the time window is large. The 21464 will be able to trace up to
36 signals over a window of 64 million cycles.
Figure 20-1 Trace Bus Timing Relationships

-.(Trace

Time
The signals captured on the trace bus will be dumped off-chip through the redundant
RDRAM channel where software can later recover and process the information. Since
this channel runs at a fixed 800Mhz, the full 36-bits can only be dumped when the core
is running at or below 800Mhz. Between 800Mhz and 1.2Ghz the channel will be limited to 27-bits. Between 1.2Ghz and 1.6 Ghz 18-bits can be dumped and above l.6Ghz,
9-bits per cycle is all that can be traced.
The selection of which 36 bits to trace will be somewhat programmable but degree of
flexibility is still undecided. The simple approach is to allow each box to selectively
drive any or all of four 9-bit chunks. Once the list of signals that can be traced is available we can better evaluate whether more sophisticated sharing is necessary.
Each box can make many more than 36 signals available for tracing, but software must
select no more than 36 at once. Multiple runs can collect multiple 36-bit traces but
exact cycle-to-cycle reproducibility may not be possible and will make correlation of
Compaq Confidential
5 January 2001 -~Subject To Change

Hardware Debug Features

20-3

Global Support

multiple traces difficult. When selecting signals to be traced, consider generating
marker signals that could give a strong correlation among multiple traces. A signal
indicating the PC matched some address within a loop would give a strong indication of
how to overlay two traces.

20.2.3 Internal Processor Registers
The ability to read/write internal processor registers exits for many reasons other than
debug but it also plays a major role in supporting debug. Values that are readable can
provide good visibility into the current state of the machine and can allow software to
watch for situations that may be causing or symptomatic of a failure. Status bits, aggregate counters and the ProfileMe registers are excellent examples of planned readable
Internal Processor Registers that hold valuable information to debug. Debug is a perfectly valid reason to make additional information readable through IPRs.
IPR readability differs from scan and trace by being both non-destructive and immediate. Debug or workaround software that is trying to evaluate if the machine is in trouble could use the information to trigger a scan or trace dump. This is the primary
consideration when determining if a signal should be tracable, scanable or readable.
Writable IPR bits are a major part of the low-level infrastructure. The debug process
also needs writable IPR bits for two functions. First, IPRs will be used to customize the
selection of signals to trace as well as the trigger conditions for the trace bus and scan
chains. Writes to flow control or configuration IPRs will also be the primary method to
alter chip behavior and avoid conditions that are not reliable. Even if the controls are
insufficient to

20.2.4 Derived Signals
None of the methods described above provide visibility to more than a small fraction of
the total internal state. Often some form of internal processing can significantly
increase the information content. Encoders, programmable counters and comparitors
are examples of simple structures that compress the information.
A comparison match against a PC or memory address is an example of a very valuable
derived signal. It is only one bit vs. a 64-bit virtual address, makes an excellent marker
in a trace file, could be used as a trigger for the scan or trace dump or even a PAL trap
that could do some fix-up and prevent a bug from propagating.

20.3 Global Support
20.3.1 Scan
The debug process will leverage the existing scan for manufacturing support as much as
possible. Additional capabilities that will be added include:
1. The ability for the BiST engines to dump the existing contents of structures onto
the scan paths. The traditional BiST engine overwrites the existing contents first
then reads and computes a signature. In debug mode, the overwrite phase will be
skipped and the reads will be sequencing to deposit the data onto the scan chains.

Compaq Confidential
20-4

Hardware Debug Features

5 January 2001 - Subject To Change

Global Support

2. The ability for internal trigger conditions to initiate a scan dump operation. The
21464 will have a fairly elaborate mechanism to detect internal events which may
indicate a failure. The output of this detection circuit (the trigger) can be used to
initiate a scan dump operation. QUESTION: How does the external logic (that
captures the dump) know the scan operation is in progress?

20.3.2 Trace Bus
The Trace bus is a 36-bit bus that winds through the chip touching the section or sections of debug logic in each box. It terminates in the Cbox where the data is multiplexed onto the redundant channel in debug mode. When debug mode is enabled the
Cbox will continuously dump until signaled to stop. Every 64M (or is it RDRAM
depth?) cycles the addresses will wrap and start overwriting previously collected trace
data. The bus will be highly pipelined with latches positioned wherever necessary to
avoid timing problems. The bus will take a low priority in routing, utilizing low/slow
metal and winding around congestion whenever necessary.
Figure 20-2 Trace Bus Routing

----· ------· ---·· -- ----· ---·.
==----------------------------

____

::::::::::::::::::::::::::::::...__

___,

Each point will to either source or repeat the value on to the next point in the chain. An
enable bit will be used to quiesce the bus when tracing is not enabled. The trace bus
control signals will likely be distributed through the standard IPR mechanisms rather
than from a centralized debug control section. Another suggestion was to utilize a
serial channel to distribute the controls in much the same way the scan control information is distributed to the control registers in the scan islands.
There is no need to synchronize the injection of data onto the trace bus. Any skew
between traced signals relative to the architectural pipeline can be adjusted for when the
trace data is extracted and analyzed. This also eliminates any dependencies on the
number of stages or order of connection to the trace bus. Once the final structure of the
bus is known, it will be important to specify the timing of each traceable signal relative
to a common point so software can reassemble and interpret the data.
***The address of the last location written to the RD RAM will need to be readable so
software can unwind the trace. If this is a problem we could burn a bit in the data
stream that inverts with each pass, but that would be a waste of a precious resource.

Compaq Confidential
5 January 2001 ···Subject To Change

Hardware Debug Features

20-5

Global Support

20.3.3 Trigger Logic
On a 1.5Ghz chip even 64M samples is only l/20th of a second of information. Some
means to stop/initiate the data collection process near the probable cause of a failure is
required. The proposed trigger will accept one or two trigger signals from each box,
combine them, and count the hits. When the counter reaches some programmable
value, the global trigger will assert and cause one of the following events:

•
•

Initiate scan dump

•

Trap to PAL

Stop trace capture

The trigger signals generated by each box are assumed to be a boolean combination of
interesting status signals within the box. The Ibox may signal the trigger logic when it
detects a specific PC, or detects an Icache miss at a specific PC or takes a cache miss at
a specific PC that is a mispredicted branch.
How the global trigger combines the box triggers is still under debate. The simple
approaches are to either OR the signals or select a specific signal to use. More complex
approaches involve PLA type functions. A more precise definition of the types of trigger conditions that would be useful to create needs to take place before precisely defining the method of combining the signals.
Figure 20-3 Trigger Logic
~o

~~----

qboxtriggery"'--'-=--\ inc
~

<Xfl~~/

vomume

scan dump

I Counter
Delay

~
Another good suggestion was to place a variable delay after the global trigger. If a
reproducable trigger could be found relatively near an interesting event, the variable
delay would be useful in conjunction with the scan dump to capture several positions
near the failure. The variable delay might also be useful to delay stopping the trace
dump until the interesting data surrounding the failure has been written into the
RD RAMs.
The dynamic range of the two counters is an interesting question. The basic mechanism
should degenerate into a "fire after n cycles" (increment every cycle, set threshold ton)
or a "fire on first occurrence" (inc when signaled, set threshold to 1 ). The counter
should probably be in the 16-24 bit range, the delay is probably more like a 10-bit
value.
PALcode can directly force a scan dump or stop the trace collection. The operational
model would be to trigger into PAL and examine the machine state. If not near the
expected failure, continue otherwise dump. Writing the threshold to zero, the delay to

Compaq Confidential
20-6

Hardware Debug Features

5 Jc1nuc1ry 2001 ··· Subject To Change

Box Support

zero and setting the scan_dump_en bit would force a scan dump. The trace dump is
started by setting an IPR bit and stopped by either clearing the bit or when the trigger
fires to stop the trace.
It would also be desirable for external logic to be able to activate the trigger and for the
trigger signal to be sent externally. This would allow multiple chips to trigger each
other or for an external logic analyzer to trigger or be triggered in conjunction with the
internal logic. An external interface, maybe the JTAG port that could trap to PAL and/
or manipulate IPRs and memory would allow a simple configurable monitor to be created.

20.4 Box Support
Each box needs to:
1. Identify the states and structures that will be most useful to debug and determine
which of the techniques (scan, trace, IPR) is best suited to provide visibility. The
box verification teams should play an active role in the definition.
2.

Define the interesting trigger conditions and any interactions with other boxes that
should be considered when developing the global trigger conditions. The number
of triggers sent to the global trigger logic should also be defined.

Define the IPRs necessary to support the debug features. These registers should
include trigger condition controls, trace bus controls, counter or comparison values
and general readable or writable signals for debug. To allow multiple boxes to
drive pieces of the trace bus, some type of swizzling (4:1 mux) at each box would
ensure that combinations of signals are not inaccessible because they share a fixed
bit position.

Review the BiSTable structures to ensure the contents are readable for debug.
Given the value to debug of most structures large enough to contain BiST, structures that will not be readable should be clearly identified and discussed.

5. Identify the point or points within the box where the trace bus should be routed.
The bus needs to be identified in the global floorplan.
The following is a list of items suggested during recent discussions about debug features. It does not represent a complete list or even a committed list but rather a seed for
thought as the individual boxes begin to develop official plans.

20.4.1 lbox
The Ibox must:

•
•

Allow the PC to be traced. Must include thread ID .
Include a programmable PC comparitor. The output should be tracable as a marker
and usable as a trigger.

•

Make the TPUalive status bits IPR readable and tracable .

•

Trace many of the PR_FE_INFO bits

•
•

Trigger on the profiled instruction issuing .
Make the PC comparison match tag an instruction for profiling .

Compaq Confidential
5 January 2001 - Subject To Change

Hardware Debug Features

20-7

Software Support

20.4.2 Pbox/Qbox
The Pbox and Qbox must:
•

Trace INUM allocation and kill sequences.

•

Trace LSNUM allocation and high-water marks.

•

Consider implementing some model assertion checks as debug trigger conditions.

•

Bits to unwedge a hung tpu.

20.4.3 Ebox/Register File
The Ebox and Register File must:
•

Encode and trace TPU for all pipes.

•

Add programmable instruction trigger. Detect specific opcode/function/tpu/pipeline patterns and trigger on a match.

•

BiST dump entire register file contents .

20.4.4 Mbox
The Mbox must:

•

Include both virtual and physical address match comparison logic. Trace and trigger against matches.

•
•

Trace inputs to exception funnel.
Trace the PR_MEM_INFO bits (except VA) .

20.5 Software Support
Extracting 64M samples from the RDRAMs and reassembling the data in some useful
way will take some assistance from various pieces of system software. Reset modes
need to be created that will ensure the information is not lost or overwritten. Boot flags
will need to be defined that cause the information to be extracted and saved or left
untouched during the boot process. Either PAL or OS hooks will be needed to make the
trace data available to an application for processing. And the application that process
the trace data needs to be written.
Console, PAL and/or OS hooks will need to be added to allow state to be monitored and
trigger conditions manipulated relative to initiating a failure.
Anyone who has been close to this process in the past is encouraged to help define the
software features that will best optimize the debug process.

Compaq Confidential
20-8

Hardware Debug Features

5 Jc1nuc1ry 2001 -·Subject To Change

21
Testability and Diagnostics
The Tbox provides the testability strategy and test solutions. The Tbox provides a comprehensive tester-driven access to the chip's testability features during manufacturing as
well as allows a simple automatic chip-initiated access that leverages the same features
during normal chip operation. The testability features themselves are scattered throughout the chip, implementing various components of the testability strategy, namely, selftest, self-repair, internal controllability and observability for debug diagnosis, manufacturing test and test pattern development. See [1] for details of the testability strategy.
Figure 21-1 shows the basic contract between the Tbox and a test target in the 21464. A
Test Target is simply a functional block under test for which hardware test assist, that is,
design for test feature is desired. For example, Icache, Dcache and Scache arrays, Register file, TLB are some test targets. The testability feature implements engines that
exercise test algorithms or provide controllability and observability required for testing
the test target. The Satellite Interface Unit provides the local control of the testability
feature and communicates with the Central Controller for the transport of test commands and test data and results.
Figure 21-1 Basic Tbox Contract

The Test Target, the Testability Feature, the Satellites Interface Unit together make up a
generic test satellite. The 21464 has a number of such test satellites. The Central Controller communicates with all test satellites in the chip over dedicated test command
broadcast buses and interlaces with the test pins to provide comprehensive and orderly
control of and exchange of test data/results with test features.
Compaq Confidential
5 January 2001 --·Subject To Change

Testability and Diagnostics

21-1

Global Block Diagram

The Central Controller, the Satellite Interface Units and the Testability features in the
distributed test satellites, the tets pins and the various test command broadcast busses
and the the test data lines make up the bulk of the Tbox.

.21.1 Global Block Diagram
Figure 21-2 shows the global block diagram of the Tbox. The Tbox consists of three basic components: the Central Controller; the distributed test satellites that house the function under test
and their testability features; and the test command broadcast buses connecting the Central Controller and the distributed test satellites. The test satellites have serial input and output data ports
that are daisy-chained to form scan rings with the Central Controller.

Figure 21-2 Tbox Global Block Diagram
TbaxReset_L

SromDfta_H .::::·
SromClk_H

ScanMcde_H :::::·.

ScanShill_H ·::::·
ScanDln_H(3,0) :::::·
ScarDOut_H(n,O)

Teststatus_H_H

Ltcb(1,o)

stct(2,0)

SerialO
Serialln

Jtcb(2,o)

Compaq Confidential
21-2

Testability and Diagnostics

5 Jc1nuc1ry 2001 ·- Subject To Change

Global Block Diagram

The Central Controller is based on the IEEE 1149.1 TAP. It receives commands from
both the IEEE 1149.1 TAP and the automatic chip agents (such as chip's reset state
machine), encodes them into command packets and distributes them to the test satellites
over five distinct test command broadcast buses.
The test satellites are organized into five general groups. Each is serviced by the Central
Controller by a dedicated test command broadcast bus and one or more pairs of serial
data lines.
Sections 21.1.1through21.1.5 describe the five groups of Satellites.

21.1.1 Group 1 - Array BiST/BiSR Satellites
This group consists of the satellites with large array structures with self-test and selfrepair features. The satellites in this group are the most comprehensive. Each has substantial Built-in Self-Test (BiST) and Built-in Self-Repair (BiSR) engines and each has
sophisticated Satellite Interface Unit capable of supporting a multitude of local test feature control and exchanging test data with tester via the central controller. This group is
accessed both externally from a tester and internally by the chip-initiated automatic
actions.
The group is serviced by a 3-wire Array Test Command Bus ATcb_h(2,0). Table 21-1
lists the commands supported ny the bus. The satellite itself is described later. The
!cache Data, Tag, and Line Predict arrays, the BHT array, the Dcache Data and Tag
arrays, the Register File, the Scache Dtata and Tag arrays are expected to belong to this
group.
Table 21-1 Array Test Command Broadcast Bus
Atcb(2,0)

Command

Purpose

111

ATNop

No operation.

110

ATShiftTcr

Shift Test Command Registers (Tcrs) in satellites.

101

ATShiftTdr

Shift the test data registers selected by Tcrs.

100

ATDolt

Initiate execution.

011

ATDoBiST

Chip initiated simultaneous BiST in satellites.

010

ATDoResult

Chip initiated simultaneous result extraction from satellites.

001

ATDoQuicklnit

Chip initiated simultaneous quick-init of embedded RAMs
and other structures in satellites.

000

ATDoReset

Reset all satellites.

21.1.2 Group 2 -

BiSt Satellites

This group also consists of array structures with only self-test support. The tets feature
do not have as many test modes as in the Group 1 satellites and the only darta
exchanged with the Central Controller is the Pass/Fail result and occasionally address
map. The satellite's test feature consists of simple BiST engine. The satellite interface
unit is also simple with limited capabilities.

Compaq Confidential
5 January 2001 ··· Subject To Change

Testability and Diagnostics

21-3

Test Pins

This group is serviced by a 2-wire test command broadcast bus called B tCB(l ,0). Table
21-2 lists the boradcast commands on this bus. A number of smaller enbedded RAM
arrays, CAM arrays such as TLBs etc are expected to belong to this group.
Table 21-2 Simple BiSt Command Bus
Btcb(1,0)

Command

Purpose

BTNop

DoBiST

ATShiftTcr

Shift Test Command Registers (Tcrs) in satellites.

ATShiftTdr

Shift the test data registers selected by Tcrs.

ATDoReset

Reset all satellites

21.1.3 Group 3 -

Observability Registers {LFSRs)

This group consists of the observability registers. Unlike the arrays and their self-test
features with multiple modes in the first group, the observability registers are highly
uniform and require a simple satellite interlace unit. This unit can turn on the observation, shift out the contents and selectively bypass itself in a chain. This group is controlled by the two-wire observability register command bus RTcb_h(l :0). The
commands are listed in Table 21-3 ..
Table 21-3 Observability Register Command Bus
LCB(1:0)

Command

Purpose

RTNop

Observability registers inactive.

RTShiftTcr

Shifts control bits in observability register satellite interface
units

RTShiftTdr

Shifts through observability registers to initiaiize it or off-ioad
signatures.

RTCapture

Captures chip data for test and debug.

21.1.4 Group 4 - Scan Islands {TBD)
21.1.5 Group 5 - Boundary Scan Register
The boundary scan register cells are located at the I/O pins. There is no satellite interface unit, but the broadcast
from the Central Controller directly controls each boundary scan register cell.

21.2 Test Pins
The Testability Access Architecture uses both dedicated and some shared pins. Table 21-4 lists the dedicated pins.
Table 21-5 lists the shared pins.

Table 21-4 Dedicated Test Port Pins
Pin Name

Type

Function

Tms_H

Input

IBEE 1149 .1 test mode select

Tdi_H

Input

IBEE 1149 .1 Test data in

Trst_L

Input

IBEE 1149.1 test logic reset

Compaq Confidential

21-4 Testability and Diagnostics

5 January 2001 -·Subject To Change

Central Port Controller

Table 21-4 Dedicated Test Port Pins (Continued)
Pin Name

Type

Function

Tck_H

Input

IEEE 1149 .1 test clock

Tdo_H

Output

IEEE 1149.1 test data output

SromData_H

Input

SROM data/Diagnostic terminal data input.

SromClk_H

Output

SROM clock/Diagnostic terminal data output

SromEn_L

Output

SROM enable/Diagnostic terminal enable

ScanMode_H

Input

Scan Mode Control (place holder)

ScanShift_H

Input

Scan shift operation control

Table 21-5 Shared Test Pins
Pin Name

Type

Test Function/Normal Function

TestStat_H

Output

BiST status/timeout output

DumpDataO_H(63,0)

Output

Bitmap-LFSR Dump port-0/Tbd

DumpValidO_H

Output

Bitmap-LFSR Sample valid for port-0/Tbd

DumpDatal_H(63,0)

Output

Bitmap-LFSR Dump port-1/Tbd

DumpValidl_H

Output

Bitmap-LFSR Sample valid for port-1/Tbd

ScanDatain_H(3,0)

Input

Scan Data Inputs/Tbd

ScanDOut_H(3,0)

Output

Scan Data Outputs/Tbd

21.3 Central Port Controller
The Central Port Controller links the external world with the testability features. It
broadcasts test commands to control test operation exchange test data between the test
controller (tester) and the testability features. It is an IEEE 1149.1 based controller that
accesses both the standard compliant features and the chip manufacturing test features.
The later are accessed synchronously with CPU clock.
Figure 21-3 shows the block diagram of the Central Port Controller, which consists of
the:

•
•
•
•
•
•
•
•

IEEE 1149.1 TAP Controller
Timing Control Unit
Configuration Flags and Fire Wall
SROM Engine, Reset Engine
OutputMux
Dispatch Units for the IEEE 1149 .1
Cache
LFSR featues

Compaq Confidential

5 January 2001 ··· Subject To Change

Testability and Diagnostics

21-5

Central Port Controller

Figure 21-3 Central Port Controller

serial Lim j/f to
!Box

gclk

21.3.1 IEEE1149.1 Test Access Port Controller

This is the IEEE 1149.1 compliant Test Access Port Controller. The port's pin interface
consists of Tdi_H, Tdo_H, Tms_H, Tck_H, and Trst_L pins.
The port supports access to the IEEE 1149.1 mandated public test features as well as
several chip manufacturing test features. The scope of 1149 .1 complaint features on the
21464 is expected to be limited to the board level assembly verification test. The systems that do not intend to drive this port MUST terminate the port pins as follows: pullups on Tdi_H and Tms_H, pull-downs on Tck_H and Trst_L.
The controller is clocked by the Clock Control Unit. It is clocked externally by the test
clock Tck_H during normal operation, and internally by the cpu clock during synchronous manufacturing operation.

Compaq Confidential
21-6

Testability and Diagnostics

5 Janwiry 2001 ~·Subject To Change

Central Port Controller

The Port Controller consists of the TAP Controller State Machine, the Instruction Register, the Bypass Register, and the TDO Mux. The Bypass Register provides a short
shift path through the chip's IEEE 1149.1 logic. It is generally useful at the board level
testing. It consists of a 1-bit shift register.
The Instruction Register holds test instructions. It is 8-bit wide. Section 4.9 lists (Table
8) and describes the instructions supported on the 21464.
Figure 21-4 shows the TAP Controller State Machine state diagram. Tms_H controls
the state transitions. The transitions occur with the rising edge of clock.. The TAP state
machine states are decoded and used for initiating various actions for testing.
The Output Mux steers the output from the various testability shift registers in the chip
to the Tdo_H pin.
Figure 21-4 TAP Controller State Machine
ti-reset

0
0

1
update-ir ------

21.3.2 Port Configuration and Firewall Logic
21.3.3 Clock Control Unit
21.3.4 Tbox Reset Engine
The Reset Engine controls the flow of automatic BiST/BISR, self-init, and lcache initialization operations.
The Reset engine consists of the Reset State Machine and the IRESET flag, and the
Master BiST Counter. Figure 21-5 shows the flow diagram of the Reset Engine. The
engine is triggered upon detection of the chip reset deassertion edge. MRESET and
DONTBIST flags determine the path through the flow diagram. Do-Array-Test state
either performs the simultaneous BiST or self-init. Do-Results state extracts the result

Compaq Confidential
5 January 2001 -~ Subject To Change

Testability and Diagnostics

21-7

Central Port Controller

of BiST from the test satellites. IRESET flag holds reset to the internal chip logic. The
flag is set by the chip reset and deasserted by the Reset State machine returning to the
Idle state. Figure 21-6 shows the reset engine state machine.
Figure 21-5 Tbox Reset Engine

ToJfrom
KB ox
F ireV\/all &
Config.
Flags

SROM
Engine

'I--------...
~ Satellite
·-------;.-Broadcast
,______...... bus
-., dis pate h
units

P crtcon11aJ

.,..QoSrom

SromDcne

Figure 21-6 Tbox Reset Engine State Diagram

..... SELFINIT

/do_bist-asr

21.3.5 SROM Engine
The SROM Engine controls the serial initialization of the Cbox configuration bits and the instruction cache array.
The SROM Engine drives the SROM pin interface as well as controls the shift and write operation in the Icache
test satellites. Normal loading occurs at the rate of 1 bit per 256 CPU cycles. If FASTSROM is set, the loading
occurs at the rate of one bit per 16 CPU cycles. The maximum rate at which the port may be operated is limited to
one-bit per 160 nanoseconds. Thus, the fast SROM loading is usable only if the CPU clock is slowed down, for
example, at wafer probe.
Figure 21-7 shows the state diagram of the SROM Engine. The engine is triggered by reset engine upon entry into
the DOSROM state. It first loads the Cbox configuration registers followed by the Icache address counter. The

Compaq Confidentia I
21-8

Testability and Diagnostics

5 Jc1nuc1ry 2001 -· Subject To Change

Central Port Controller

value shifted in the lcache address counter determines the number of fetch blocks to be filled. Unlike the previous
generations of Alpha microprocessors, EV6 allows variable amount of lcache to be filled. In the next step, engine
loads each fetch-block in reverse order. See Section 9.2 for the details of the SROM map and SROM operation.

Figure 21-7 SROM Engine State Diagram

shift CBOX
data & no. of
fetch blocks to

shift
fetch block

write
fetch block

fetch block ountcount = 0
This port supports two functions. During power-on, It supports automatic initialization
of the Cbox configuration registers and the instruction cache (!cache) from the system
serial ROMs. After power-on, it supports a serial diagnostic terminal terminal.
During SROM load:
•

The Srom_OE_L pin supplies the output enable as well as the reset to the serial
ROM. (Refer to the serial ROM specifications for details.) The 21264 asserts this
signal low for the duration of the lcache load from the serial ROM. Once the load is
complete, the signal remains deasserted.

•

The Srom_Data_H pin reads data from SROMs.

•

The Srom_Clk_H pin supplies the clock to the SROMs that causes it to advance to
the next bit. Simultaneously, it causes the existing data on Srom_Data_H pin to be
shifted into an internal shift register. The cycle time of this clock is 256 times the
CPU clock rate. (If FASTROM flag is set, the rate is 16 times the CPU clock rate.)
The hold time on Srom_Data_H is 2* CPU Cycle time with resepect to the
Srom_Clk_H.

Once the lcache load is complete, the port reconfigures into a simple software-timed
serial line interface, similar to RS422, that may be used for system debug and diagnosis.
In a system the serial line interface is automatically enabled if the Srom_OE_L pin is
wired to the active high enable of an RS422 (or 26LS32) driver driving to
Srom_Data_H and to the active high enable of an RS422 (or 26LS31) receiver driven
from the Srom_Clk_H pin.
After reset, the Srom_Clk_H pin is driven from the sl_xmit bit I_CTL (13) in the Ibox
IPR. This IPR is cleared during reset, so it will start driving as a 0, but it can be written
and modified by any program. The data becomes available at the pin after retire of the
HW_MTPR instruction that write the sl_xmit bit. (Remember that the output only
changes after retirement of the HW_MTPR which can take a variable number of cycles
depending on machine state.)
On the receive side, while in native mode, any transition on the sl_rcv bit (I_CTL(14)
driven from the Srom_Data_H pin result in a trap to the pal interrupt handler (assuming
that the serial line interrupt enable bit is set in the SIRR). Once in pal mode, all interrupts are blocked. The interrupt routine can then begin sampling the sl_rcv bit in the
I_CTL ipr under a software timing loop to input as much data as needed using whatever
Compaq Confidential
5 January 2001 -· Subject To Change

Testability and Diagnostics

21-9

Dot1 Test Decode and Dispatch Logic

serial line timing protocol chosen. The delay between transition on the pin and interrupt
trap is TBD, but probably around 5 cycles or so. For complete description of IPRs associated with this interface refer to IPR Chapter.

21.4 Dot1 Test Decode and Dispatch Logic
This logic controls the operation of the boundary scan register and the Die-ID register.
It basically decodes the instructions held in the Port Controller and combines the same

with suitable TAP Controller state machine decodes to generate control signals.
When BSRDLY instruction is loaded, bsr_drv_pins_h, bsr_highz_h, bsr_capture_h,
bsr_update_h, signals are con ected back-to-back to form a long inverter delay path
consisting of 780x4 = 3120 inverters with a nominal delay of approximately 624ns.
This delay path may be used for predicting the speed performance of a die.

Compaq Confidential
21-10 Testability and Diagnostics

5 Jam.1ary 2001 ···Subject To Change

Disruptions

22
Error Detection and Error Handling
22.1 Disruptions
We need a word for "Interrupts and Exceptions". We will use the word disruption to
describe an event that could be either an interrupt or an exception.
A disruption will be delivered - that is, it will start the events in motion that cause the
change in program flow - from one of three points in the pipeline.

•

Retire-time delivery - Most disruptions will be delivered when the instruction that
caused them retires.

•

Execution-time delivery - Some disruptions, whose rapid handling is important to
performance, will be delivered when their triggering instruction executes.

•

Pre-map time delivery - A few disruptions, including all interrupts, will be delivered before the triggering instruction has reached the INum map stage.

Besides delivery time alone, disruptions are divided into several classes. The first division is into Pre-map and Post-map disruptions; the former have no INums associated
with them, while the latter do. Post-map disruptions with PALcode handlers are further
divided into those delivered at execution time, i.e. DTB misses only, and those delivered at retire time, which includes everything else. Micro-traps, which have no associated PALcode flows, can have either execution-time or retire-time delivery depending
on the cause. Pre-map disruptions are subdivided into Internal IBox disruptions, such as
ITB misses, and Interrupts proper - both from hardware and software sources. Finally,
there is a special class of disruptions, Machine Checks, which signify fatal hardware
errors.
The PBox Bid/grant Exception Logic (BEL) prioritizes all post-map disruptions. It puts
execution-time disruptions from all sources (not just PALcode-assisted ones) through a
structure known as the Exception Funnel (or Efunnel) and picks the oldest. The BEL
also monitors whether the Completion Unit (CU) in the QBox is posting a retire-time
exception (RTE) for this TPU. RTE's are reported to directly to Completion Unit, or to
the QBox Inflight Table (from whence they flow into the CU), depending on at what
point in the Arafia pipeline they are detected. A valid retire-time exception will always
take priority in the BEL, since a retiring instruction is by definition the oldest in the
CPU. The IBox uses an algorithm to arbitrate between pre-map and post-map disruptions. Briefly, every cycle the IBox services disruptions with the following priority:
1. Post-map
2. IBox internal

5 Jatwary 2001 --· Subject To Change

Compaq Confidential
Error Detection and Error Handling

22-1

Disruptions

3. Interrupts
Interrupts are postponed in favor of IBox internal disruptions, which are deferred in
favor of post-map disruptions. If the IBox has decided to take a post-map disruption, it
stores the INum of the disrupting instruction and tells the BEL to broadcast the kill. The
IBox will accept no younger post-map disruptions for a fixed period of time - enough
for the kill to take effect across the chip, but not so long as to ignore valid disruptions
on the new good path. However, the IBox will restart the disruption flow if an older disruption is signaled. For taken post-map disruptions, the BEL vectors the IBox into the
appropriate PALcode flow, while on pre-map disruptions the IBox redirects itself. For
all taken PALcode-assisted disruptions, IBox starts fetching down the PALcode path
after saving the correct return PC. Note that there is a special class of interrupt, RESET/
WAKEUP, which has a PALcode entry point that the IBox only vectors into when a
TPU is restarted or woken from sleep mode.
It should be noted that there is an important consequence to delaying the taking of certain disruptions until retire time. This means that all of the data needed to identify and
rectify the disruption must be stored somewhere in the time between the error event and
the retirement of the instruction. Although the details are still being worked out, our
current plan is to store most of the retire-time disruption type information in the Completion unit in a compressed form. We have also defined, coded up, and reviewed the
Virtual Register Table (VRT) in the PBox, which supplies the virtual source or destination of the faulting instruction (depending on the exception type) for a particular class
of disruptions that require it.

Keep in mind that there are many error cases that can occur in different parts of the chip
which do not rise to the level of disruptions. For instance, !Cache misses and line
mispredictions are not only not visible architecturally, but they are handled entirely
within the IBox without intervention or assistance from any other part of the CPU.

22-2

Compaq Confidential
Error Detection and Error Handling
5 Jc1nuc1ry 2001 -~ Subject To Change

Disruptions

22.1.1 High-Level Features
Text goes here..... .
Table 22-1 Key to Table 22-2, "Summary of Disruption High-Level Features'
Heading

Meaning

Name:

Name of exception, interrupt, trap, and so forth, such as Integer Overflow.

Posted Time

Point in time when disruption is delivered.
Values

Meaning

Interrupt

Asynchronous with respect to instruction stream, lower priority
than other disruptions. Reported to Ibox PCC.
At interrupt time but Non-Maskable - even by PALcode. Reported
to Ibox PCC.
At interrupt time but taken unconditionally - that is, with highest
priority of all disruptions. Reported to Ibox PCC.
Prior to mapping INum assignment of disrupting instruction.
Reported to Ibox PCC
After mapping of disrupting instruction. Reported to Pbox BEL.
When disrupting instruction is next eligible to retire. Reported to
Qbox CMP.

N-M Interrupt
Reset
Pre-map
Execution
Retire
Restart PC:

Virtual address of post-disruption good path after handler (if any) relative to PC of disrupting instruction or interrupt victim.
Values

Meaning

PC
Disrupting instruction or interrupt victim.
PC+ 4
Instruction after disrupting instruction.
CBR/FCBR Target True branch target of mispredicted (integer/floating) conditional
branch.
Jump Target
True jump target of mispredicted jump.
n/a
Code does not return from handler.
Kill Point:

Location of kill relative to disrupting INum.
Values

Meaning

At
After
n/a

Kill disrupting instruction and all younger in TPU
Kill instruction after disrupting instruction and all younger in TPU
Not relevant- for example, is for pre-map disruptions)

PALcode Entry Point:

Name of PALcode disruption entry if any (for example, DTBM_SINGLE).

Description:

Textual explanation of disruption meaning, purpose, function, etc.

Table 22-2 Summary of Disruption High-Level Features
Posted
Time

Restart
PC

PALcode Entry
Point 1

Kill
Point Description

Bad Jump !stream VA

Retire

pc2

BAD_JUMP_IVA

Jump target is outside of current virtual address space

lbox Debug Trap

Retire

n/a

Placeholder for chip debug exception from Ibox

Name
lbox Disruptions

5 January 2001 ··· Subject To Change

Compaq Confidential
Error Detection and Error Handling

22-3

Disruptions

Table 22-2 Summary of Disruption High-Level Features (Continued)
Name

Posted
Time

Restart
PC

PALcode Entry
Point 1

Kill
Point Description

!stream Access Violation

Pre-map

pc2

IACV

n/a

!stream access violation (prvilege
mismatch or walk/branch out of IVA
space)

ITB Miss Single

Pre-map

ITB_MISS

n/a

Single level ITB miss (not in console mode)

ITB Miss Single Console

Pre-map

ITB_MISS_CONS

n/a

Single level ITB miss while in console mode

Jump Mispredict

Execution

Jump Target

n/a

After

Predicted Jump Target did not match
true Jump Target

Uncorrectable !stream ECC Error

Pre-map

pc2

IMCHK

n/a

Instruction fetch experienced an
uncorrectable ECC error

Add Overflow

Retire

PC+42

AR ITH

After

Add/Subtract operation overflowed/
underflowed

CBR Mispredict

Execution

CBR Target

n/a

After

An integer conditional branch

Ebox Disruptions

instruction was incorrectly predicted
Ebox Debug Trap

Retire

n/a

Placeholder for chip debug exception from Ebox

Floating-Point Disabled Fault

Retire

pc2

FEN

A legal FP instruction issued while
the Floating-Point Enable (FPE) bit
was deasserted

IFETCHB Issued

Retire

PC+4

n/a

After

A IFETCHB instruction was executed

Illegal Instruction

Retire

PC2

OPCDEC

Thread not allowed to execute this
instruction, or invalid opcode/function

Mui Overflow

Retire

PC+42

AR ITH

After

Integer multiply operation overflowed/underflowed

Native Mode MT_FPCR Issued

Retire

PC+4

MT_FPCR

After

An MT _FPCR instruction has
issued in user mode

Bad VA Alignment

Execution

pc2

UNALIGN

Computed virtual address LSBs
(VA<2:0>) not legal for datatype

Bad VA Sign

Execution

pc2

DFAULT

Computed Dstream virtual address
sign extension (VA<63:52>) not
correct

Dstream Access Violation

Execution

pc2

DFAULT

Process has insufficient privileges to
load to/store from this page

DTB Miss Double

Execution

DTBM_DOUBLE

DTB miss on LD_VPfE with
default page table configuration

DTB Miss Double Alternate

Execution

DTBM_DOUBLE_ALT

DTB miss on LD_VPfE with alternate page table configuration

DTB Miss Single

Execution

DTBM_SINGLE

Single level DTB miss (not in console mode)

DTB Miss Single Console

Execution

DTBM_SINGLE_CONS At

Single level DTB miss while in console mode

Fault On (Read/Write)

Execution

pc2

DFAULT

Fault on Read or Fault on Write bits
set in PfE for this VA

Mbox Disruptions

22-4

Compaq Confidential
5 Jam.utry 2001 -· Subject To Change
Error Detection and Error Handling

Disruptions

Table 22-2 Summary of Disruption High-Level Features (Continued)
Name

Posted
Time

Restart
PC

PALcode Entry
Point1

Kill
Point Description

Load Data Parity Error

Execution

n/a

Data returned from the Dcache had
bad parity

Load Double-Bit ECC Error

Execution

pc2

MCHK

Load tag or data experienced an
uncorrectable ECC error

Load ErrResp from Memory

Execution

pc2

MCHK

Memory system returned an Error
Response on a load

Load Invalidate

Execution

n/a

TPU received an invalidate probe

Load NXMResp from Memory

Execution

pc2

MCHK

Load attempted from Non-eXistent
Memory

Load Rambus Uncorrectable Error

Execution

pc2

MCHK

Rambus interface detected an uncorrectable error on a load

Load Single-Bit ECC Error

Execution

n/a

Load tag or data experienced a correctable ECC error

Load Tag Parity Error

Execution

n/a

Tag matching load VA experienced
a parity error

Load/Store Order Violation

Execution

n/a

Store executed out of order with
respect to a load

Load/Store Synonym Detection

Execution

n/a

Mbox has detected a virtual-tophysical alias

Mbox Debug Trap

Execution

n/a

Placeholder for chip debug exception from Mbox

QUIESCE

Execution3 PC+4

n/a

After

A QUIESCE instruction is about to
retire

FCBR Mispredict

Execution

FCBRTarget

n/a

After

A floating conditional branch
instruction was incorrectly predicted

FP Trap (SW = 0)

Retire

PC+42

AR ITH

After

Floating-Point trap without software
completion

FPTrap (SW= 1)

Retire

PC+42

AR ITH

After

Floating-Point trap with software
completion

FPCR Update

Retire

PC+4

AR ITH

After

Fbox Disruptions

Fbox requests an update of the
FPCR

Pbox/Qbox Disruptions
Pbox/Qbox Debug Trap

Retire

n/a

Placeholder for chip debug excepti on from Pbox/Qbox

Cold Reset

Reset

n/a

MILD_RESET4

n/a

Cold start (power-on or platform/
remote reset) - initialize all state, run
SROM and BIST

Fast Reset

Reset

n/a

FAST_RESET

n/a

Reset after loss of lockstep - initialize core/caches, no SROM, no BIST
- e.g. Tandem re-sync

Mild Reset

Reset

n/a

MILD_RESET

n/a

Reset after core HW error- initialize
core, no SROM, no BIST

Cbox Reset Interrupts

5 January 2001 -- Subject To Change

Compaq Confidential
Error Detection and Error Handling

22-5

Disruptions

Table 22-2 Summary of Disruption High-Level Features (Continued)
Name

Posted
Time

Restart
PC

PALcode Entry
Point 1

Kill
Point Description

Tepid Reset

Reset

ri/a

MILD_RESET4

ri/a

Reset after system HW error - initiali:re core, system/memory interfaces, run SROM, no BIST

TPU Restart

N-M Inter- ri/a
rupt

TPU_RESTARf

ri/a

Another TPU has requested a restart

Wakeup

Reset

ri/a

WAKEUP

ri/a

Wakeup from sleep mode - initialize
core/caches, no SROM, no BIST

Cbox Service/Error Interrupts
ALERT Interrupt

Interrupt

pc2

INTERRUPf

ri/a

A remote CPU has signaled an
ALERT

External Interrupt

Interrupt

pc2

INTERRUPf

ri/a

External hardware interrupt

IP Bus Correctable Error

Interrupt

pc2

INTERRUPf

ri/a

Switchport experienced a correctable (single-bit) ECC error

IP Bus Uncorrectable Error

Interrupt

pc2

INTERRUPf

ri/a

Switchport experienced an uncorrectable (double-bit) ECC error

ProfileMe Service

Interrupt

pc2

INTERRUPf

ri/a

Data collection for a ProfileMe
instruction pair is complete

Rambus Correctable Error

Interrupt

pc2

INTERRUPf

ri/a

Rambus experienced an correctable
(single-bit/RAID-correctable multibit) ECC error

Rambus Uncorrectable Error

Interrupt

pc2

INTERRUPf

ri/a

Rambus experienced an uncorrectable (double-bit/RAID-uncorrectable) ECC error

Scache Data Correctable ECC Error Interrupt

pc2

INTERRUPf

ri/a

Second-level cache data experienced a correctable (single-bit) ECC
error

Scache Tag Correctable ECC Error

Interrupt

pc2

INTERRUPf

ri/a

Second-level cache tag experienced
a correctable (single-bit) ECC error

Scache Uncorrectable ECC Error

Interrupt

pc2

INTERRUPf

ri/a

Second-level cache tag or data experienced an uncorrectable (doublebit) ECC error

Software Interrupt

Interrupt

pc2

INTERRUPf

ri/a

Software interrupt

TPU PALmode Timeout

N-M Inter- pc2
rupt

INTERRUPf

ri/a

A TPU has been in PALmode too
long

Cbox Logging Interrupts
Dcache Parity Error

Interrupt

pc2

INTERRUPf

ri/a

Data cache tag or data experienced a
parity error

Icache Parity Error

Interrupt

pc2

INTERRUPf

ri/a

Instruction cache tag or data experienced a parity error

Load IP Bus Parity Error

Interrupt

pc2

INTERRUPf

ri/a

Load switchport experienced a parity error

Oustanding DIFT Entry Timeout

Interrupt

pc2

INTERRUPf

ri/a

A forwarded DIFf entry has been
outstanding too long

22-6

Compaq Confidential
Error Detection and Error Handling
5 Jc1nuary 2001 ·- Subject To Change

Disruptions

Table 22-2 Summary of Disruption High-Level Features (Continued)
Name

Posted
Time

Restart
PC

PALcode Entry
Point1

Point Description

Outstanding MAP Entry Timeout

Interrupt

pc2

INTERRUPf

n/a

A MAP entry has been outstanding
too long

Store IP Bus Parity Error

Interrupt

pc2

INTERRUPf

n/a

Store switchport experienced a parity error

TPU Inst. Retirement Timeout

Interrupt

pc2

INTERRUPf

n/a

A TPU has not retired any instructions for too long

Kill

See Table 22-3

For these PALcode traps, the Restart PC is nominal; the PALcode handler may elect to not return to
the trapping code flow (e.g. in the case of uncorrectable errors). However, this value still needs to be
saved in the appropriate IPR.

The disruption is reported via the execution-time interface, but only when the disrupting instruction is
reported as next-to-retire on the Retire/Kill bus.

Since both Cold Reset and Tepid Reset execute from SROM code, which essentially has its own
address space, the 21464 overlays their entry points on top of the one for Mild Reset.

Table 22-3 Disruption PALcode Entry Points
Disruption PALcode Entry Points

IPRs Implicitly Written

Reserved 1

PB + xOOO

11fa

Available

PB+ x080

n/a

DTBM_DOUBLE

PB + xlOO

EXC_ADDR

DTBM_DOUBLE_ALT

PB+ x180

EXC_ADDR

FEN

PB + x200

EXC_ADDR

UN ALIGN

PB+ x280

EXC_ADDR, EXC_SUM, VA, VA_FORM, M_STAT

DTBM_SINGLE

PB+ x300

DTBMS_RET_ADDR, EXC_SUM, VA, VA_FORM, M_STAT

DFAULT

PB+ x380

EXC_ADDR, EXC_SUM, VA, VA_FORM, M_STAT

OPCDEC

PB + x400

EXC_ADDR

IACV

PB + x480

EXC_ADDR

MCHK

PB+ x500

EXC_ADDR, M_STAT

ITB_MISS

PB + x580

EXC_ADDR, IVA_FORM

ARI TH

PB+ x600

EXC_ADDR, EXC_SUM

INTERRUPT

PB + x680

EXC_ADDR

MT_FPCR

PB + x700

EXC_ADDR

IMCHK

PB + x780

EXC_ADDR

DTBM_SINGLE_CONS

PB+ x800

DTBMS_RET_ADDR, EXC_SUM, VA, VA_FORM, M_STAT

ITB_MISS_CONS

PB + x880

EXC_ADDR, IVA_FORM

BAD_JUMP_IVA

PB + x900

EXC_ADDR, EXC_SUM

FAST_RESET

PB + x980

11fa

WAKEUP

PB + xAOO 11fa

TPU_RESTART

PB + xA80 11fa

Compaq Confidential
5 January 2001 ~· Subject To Change

Error Detection and Error Handling

22-7

Disruptions

Table 22-3 Disruption PALcode Entry Points (Continued)
Disruption PALcode Entry Points

IPRs Implicitly Written

MILD_RESET

PB+xBOO

llfa

DST_NXM

PB+xB80 EXC_ADDR

Available

PB+xCOO n/a

Available

PB+xC80 llfa

Available

PB+xDOO llfa

Available

PB+xD80 n/a

Available

PB+xEOO llfa

Available

PB+xE80 llfa

Available

PB+xFOO

llfa

Available

PB+ xF80

llfa

1 PB + xOOO is reserved as entry point FROM the Swap PALcode (CALL_PAL SWPPAL) routine or

the SROM boot codeinto the RESET code sequence.

22.1.2 Low-Level Features
Text here .....
Table 22-4 Key to Table 22-5, "Summary of Disruption Low-Level Features'
Heading

Meaning

Name:

Name of exception, interrupt, trap, and so forth, such as Integer Overflow.

Detected By:

box responsible for detecting the disruption.

ETypeCode:

Encoding of exception type communicated to the Ibox to determine its restart address
(symbolic name defined in global I arana_traps. mnh)

Completion Prevention: Method of preventing retirement of disrupting instruction or interrupt victim.
Values

Meaning

Map
Inflight
Zap

Never mapped.
Invalidated in inflight table.
Zapped/retire stalled in completion unit.

Any other text here....

Table 22-5 Summary of Disruption Low-Level Features
Kill Point

Detected By

EType Code

Completion
Prevention

Bad Jump !stream VA

Ibox

AT_RTE_BAD_JUMP_IVA

Inflight

Ibox Debug Trap

Ibox

AT_RTE_I_DBG

Inflight

Name
lbox Disruptions

Compaq Confidential
22-8

Error Detection and Error Handling

5 Jc1nw1ry 2001 ·- Subject To Cfumge

Disruptions

Table 22-5 Summary of Disruption Low-Level Features (Continued)
Name

Kill Point

Detected By

EType Code

Completion
Prevention

!stream Access Violation

n/a

Ibox

n/a

Map

ITB Miss Single

n/a

Ibox

n/a

Map

ITB Miss Single Console

n/a

Ibox

n/a

Map

Jump Mispredict

After

Ibox

AT_ETE_JMP_MISPRED

Inflight

Uncorrectable !stream ECC Error

n/a

Ibox

n/a

Map

Add Overflow

After

Eb ox

AT_RTE_IOVF

Inflight

CBR Mispredict

After

Eb ox

AT_ETE_CBR_MISPRED

Inflight

Ebox Debug Trap

Eb ox

AT_RTE_E_DBG

Inflight

Floating-Point Disabled Fault

Ebox

AT_RTE_FPDIS

Inflight

IFETCHB Issued

After

Eb ox

AT_RTE_IFETCHB

lnflight

Illegal Instruction

Ebox

AT_RTE_OPCDEC

Inflight

Mui Overflow

After

Eb ox

AT_RTE_IOVF

Inflight

Native Mode MT_FPCR Issued

After

Ebox

AT_RTE_MT_FPCR

Inflight

Bad VA Alignment

Mbox

AT_ETE_BADVA

Zap

Bad VA Sign

Mb ox

AT_ETE_DST

Zap

Dstream Access Violation

Mbox

AT_ETE_DST

Zap

DTB Miss Double

Mbox

AT_ETE_DTB_DBL

In flight

DTB Miss Double Alternate

Mb ox

AT_ETE_DTB_DBL_ALT

Inflight

DTB Miss Single

Mbox

AT_ETE_DTB_SING

Inflight

DTB Miss Single Console

Mb ox

AT_ETE_DTB_SING_CONS

Inflight

Fault On (Read/Write)

Mb ox

AT_ETE_DST

Zap

Load Data Parity Error

Mb ox

AT_ETE_DST_RPLAY

Zap

Load Double-Bit ECC Error

Mbox

AT_ETE_DST_MCHK

Zap

Load ErrResp from Memory

Mb ox

AT_ETE_DST_MCHK

Zap

Load Invalidate

Mbox

AT_ETE_DST_RPLAY

Zap

Load NXMResp from Memory

Mbox

AT_ETE_DST_MCHK

Zap

Load Rambus Uncorrectable Error At

Mbox

AT_ETE_DST_MCHK

Zap

Load Single-Bit ECC Error

Mbox

AT_ETE_DST_RPLAY

Zap

Load Tag Parity Error

Mbox

AT_ETE_DST_RPLAY

Zap

Ebox Disruptions

Mbox Disruptions

5 January 2001 ~·Subject To Change

Compaq Confidential
Error Detection and Error Handling

22-9

Disruptions

Table 22-5 Summary of Disruption Low-Level Features (Continued)
Name

Kill Point

Detected By

ETypeCode

Completion
Prevention

Load/Store Order Violation

Mbox

AT_ETE_LDST_ORDER

Zap

Load/Store Synonym Detection

Mbox

AT_ETE_DST_RPLAY

Zap

Mbox Debug Trap

Mbox

AT_RTE_M_DBG

Zap

QUIESCE

After

Mbox

AT_ETE_QUIESCE

Zap

FCBR Mispredict

After

Ebox

AT_ETE_CBR_MISPRED

Inflight

FP Trap (SW = 0)

After

Fbox

AT_RTE_SWO [110xxxx]

Inflight

FP Trap (SW = 1)

After

Fbox

AT_RTE_SW 1 [ 11 lxxxx]

In flight

FPCR Update

After

Fbox

AT_RTE_FPCR [lOlxxxx]

Inflight

Pbox/Qbox

AT_RTE_PQ_DBG

Inflight

Cold Reset

n/a

Cbox

n/a

Map

Fast Reset

n/a

Cbox

n/a

Map

Mild Reset

n/a

Cbox

n/a

Map

Tepid Reset

n/a

Cbox

n/a

Map

TPU Restart

n/a

Cbox

n/a

Map

Wakeup

n/a

Cbox

n/a

Map

ALERT Interrupt

n/a

Cbox

n/a

Map

External Interrupt

n/a

Cbox

n/a

Map

IP Bus Correctable Error

n/a

Cbox

n/a

Map

IP Bus Uncorrectable Error

n/a

Cbox

n/a

Map

ProfileMe Service

n/a

Ibox

n/a

Map

Rambus Correctable Error

n/a

Cbox

n/a

Map

Rambus Uncorrectable Error

n/a

Cbox

n/a

Map

Scache Data Correctable ECC
Error

n/a

Cbox

n/a

Map

Scache Tag Correctable ECC Error n/a

Cbox

n/a

Map

Scache Uncorrectable ECC Error

Cbox

n/a

Map

Fbox Disruptions

Pbox/Qbox Disruptions

Pbox/Qbox Debug Trap

Cbox Reset Interrupts

Cbox Service/Error Interrupts

n/a

Compaq Confidential
22-1 o Error Detection and Error Handling

5 J,1m1,1ry 2001 ·-Subject To Change

Disruptions

Table 22-5 Summary of Disruption Low-Level Features (Continued)
Name

Kill Point

Detected By

EType Code

Completion
Prevention

Software Interrupt

n/a

Cbox

n/a

Map

TPU PALmode Timeout

n/a

Qbox

n/a

Map

Dcache Parity Error

n/a

Mb ox

n/a

Map

Icache Parity Error

n/a

Ibox

n/a

Map

Load IP Bus Parity Error

n/a

Mbox

n/a

Map

Oustanding DIFT Entry Timeout

n/a

Cbox

n/a

Map

Outstanding MAF Entry Timeout

n/a

Cbox

n/a

Map

Store IP Bus Parity Error

n/a

Mbox

n/a

Map

TPU Inst. Retirement Timeout

n/a

Qbox

n/a

Map

Cbox Logging Interrupts

5 January 2001 ···Subject To Change

Compaq Confidential
Error Detection and Error Handling 22-11

Disruptions

Compaq Confidential
22-12 Error Detection and Error Handling
5 Jam.u~ry 2001 ···Subject To Change

Signal Pad Requirements

23
Hardware lnterface
23.1 Signal Pad Requirements
Table 23-1 lists the signal pad requirements for the 21464.
Table 23-1 Signal Pad Requirements
Signal

110/B

Pins

Type

Description

RamDataA_L(8,0)

RSL

RAM Data

RamDataB_L(8,0)

RSL

RAMData

RamRow_L(2,0)

RSL

RAM Row Control

RamCol_L(4,0)

RSL

RAM Column Control

RamClkToMaster_H

RSL

RAM Receive Clock

RamClkToMaster_L

RSL

RAM Receive Clock

RamClkFromMaster_H

RSL

RAM Transmit Clock

RamClkFromMaster_L

RSL

RAM Transmit Clock

RamCMD

CMOS

RAM Control register command

RamSCK

CMOS

RAM Control register clock

RamSIO(l,O)

CMOS

RAM Serial rd/wr data for register (daisy chained)

RamVRef

Analog

RAM Reference Voltage for above signals

RamVTerm

Analog

RAM Termination Voltage for above signals

RamSCL

CMOS

RAM Presence Detect Clock

RamSDA

CMOSOC

RAM Presence Detect Data

RamClkOut_L

RAM 400 Mhz clock for distribution to RClk/TClk

Subtotal Per-Rambus

Subtotal Rambus Signals

PortData_L(55,0)

IorO

Port Data

PortClock_H(2,0)

I orO

Port Clock

PortVRef

!Analog

Analog

Port Reference Voltage for Data & Clock

Subtotal Per-Port

Srom_Data_H

Subtotal Port Signals

Serial ROM data/receive data

Srom_Clk_H

Serial ROM clock/transmit data

Srom_OE_L

Serial ROM output enable

Compaq Confidential
5 January 2001 ···Subject To Change

Hardware Interface

23-1

Signal Pad Requirements

Table 23-1 Signal Pad Requirements
Signal

Pins

Type

Description

JTAG test data in

JTAG test data out

Trst_L

JTAG test reset

Tck_H

JTAG test clock

Tms_H

JTAG test mode select

Test???

Clkln_H

Clock input, differential

Clkln_L

Clock input, differential

reset_L

Processor reset

DcOK_H

System DC power OK

PllBypass_H

Bypass internal PLL

1/0/B

Tdi_H
Tdo_H

TestStat_H

PllVdd

!Analog

PLL Supply voltage

VddSel

!Analog

Supply selection

Subtotal Common

Subtotal Common Signals

Subtotal Rambus Signals

39*10

Subtotal Rambus Signals

Subtotal Port Signals

63*10

Subtotal Port Signals

Subtotal Common Signals

Common Signals

Total Signals

1036

Total Signals

Compaq Confidential
23-2

Hardware Interface

5 Janwtry 2001 ~- Subject To Change

24
New Instructions
//This is a place holder for this chapter.//

Compaq Confidential
5 January 2001 -· Subject To Change

New Instructions

24-1

Compaq Confidential
24-2

New Instructions

5 Jm1w1ry 2001 -· Subject To Change

25
System Configurations
//This is a placeholder for a new chapter.//

Compaq Confidential
5 January 2001 ···Subject To Change

System Configurations

25-1

Compaq Confidentia I
25-2

System Configurations

5 Jc1mJc1ry 2001 - Subject To Change

26
Physical Addressing and Input/Output

5 January 2001 - Subject To Change

Compaq Confidential
Physical Addressing and Input/Output

26-1

26-2

Compaq Confidential
Physical Addressing and Input/Output
5 January 2001 ··· Subject To Change

27
Requirements to Support 11 Tandem 11

5 January 2001 ···Subject To Change

Compaq Confidential
Requirements to Support "Tandem"

27-1

27-2

Compaq Confidential
Requirements to Support "Tandem"
5 Jc1nuary 2001

Subject To Change

A
Instruction Decoding
This appendix defines the exact behavior of instruction decoding in the 21464. This is
not a rewrite of the Alpha System Reference Manual (the SRM). Rather, it is a clarification of some of the exact implementation details. The target audience is the design and
verification teams, but the information might also be useful to compiler developers or
anyone who generates assembly code by hand.
The instruction set is organized in ascending order, according to opcode value, or by
instruction type for Load and Store, Jump and Branch, and PALcode instructions.
Instruction decoding is a distributed event. The Ibox, Pbox, Ebox, Fbox, and Mbox all
decode portions of the !stream. To ease verification, we want to ensure that all boxes
that decode instructions make the same assumptions. For the instructions defined by
the SRM, this is straightforward, but there are many unused function codes and combinations of instruction bits that are only defined by the SRM as producing UNPREDICTABLE behavior. Verifying that several boxes that separately decode
UNPREDICTABLE instructions do not cause hangs or otherwise violate the requirements of the SRM would be a tedious task at best.
This appendix specifies an instruction decoding that uniquely maps all unused function
codes to a known behavior. Assuming that all boxes in the 21464 use this decoding
scheme, behavior is easily predictable and we will avoid cross-box bugs where the
instruction stream was interpreted differently.
Because of the large number of instructions and function codes, the decoding descriptions are broken into opcode groups.
Table A-1 Opcode Groups
Opcode Type

00
01-07
08-0F
10

11
12
13
14
15

Call_PALL
Reserved
Load and store
Integer add/sub/compare
Integer logical
Integer shift
Integer multiply
ITOFx and FSQRT
VAX floating-point

Format

In Section

PALcode
PALcode
Memory Displacement
Integer Operate
Integer Operate
Integer Operate
Integer Operate
Floating Operate
Floating Operate

A.6.1
A.6.2
A.6.12
A.6.3
A.6.4
A.6.5
A.6.6
A.6.7
A.6.8

Compaq Confidential
5 January 2001 -·Subject To Change

Instruction Decoding

A-1

Instruction Format

Table A-1 Opcode Groups (Continued)
Opcode Type

Format

In Section

16
IEEE floating-point
17
Miscellaneous floating-point
18
Miscellaneous
lA
Jump
lC
Multimedia
19,lD HW_MxPR
lB,lF HW_LD/ST
lE
IFETCHB
20-2F Load and store
30-3F Branch

Floating Operate
Floating Operate
Memory Function
Memory Function
Integer Operate
IPR
Memory Displacement
PALcode
Memory Displacement
Branch

A.6.9
A.6.10
A.6.11
A.6.14
A.6.13
17.2
17.1
A.6.12
A.6.14

A.1 Instruction Format
The Alpha SRM defines several instruction formats and specifies the format for each
instruction. The 21464 generally uses the SRM-defined formats with the following
exceptions:

•

The FTOix instructions are decoded as Integer Operate rather than the Floating
Operate specified in the SRM.

•

Instructions listed as Memory format in the SRM are explicitly categorized as
either Memory Displacement format or Memory Function format in the 21464.

•

Instructions listed as Misc format by the SRM are decoded as Memory Function
format by the 21464.

Figure A-1 Instruction Formats
31

PALcode:

I
I
I
I
I

2625

Opcode

Branch:

Opcode

I
I
I
I

Literal

1615

2625

Opcode

Memory Function:

I
I

2120

2625

Opcode

2120

I
I

Fe
0

5 4

Index

Wclass

Displacement

1615

2120

•

5 4

I Rclass I Rb

5 4

Function

Index
26

Opcode

I
I
I

5 4

11 Function

Opcode

I
Memory Displacement: I

5 4

Function

2120

Displacement

2120

2625

IPR:

2625

Opcode

2120

2625

Floating Operate:

PALcode function

2625

Integer Operate:

I
I
I
I
I

Function

I
I

Compaq Confidential
A-2

Instruction Decoding

5 Jc1nw~ry 2001 ~· Subject To Change

Predecodes

The instruction format defines the location of the operand specifiers and any function
code bits. The function code further subdivides the opcode into many separate instructions. The decode tables in this document define a decoding of function code bits that
form a non-overlapping map of every possible bit combination. The behavior of every
instruction bit pattern is known and consistently decoded throughout the 21464.
The tables use the Alpha SRM mnemonics (in upper-case) to identify instructions.
Mnemonics listed in lower-case do not exactly map to a specific Alpha instruction.

A.2 Predecodes
The lbox does a quick partial decode of instructions defining several buckets useful to
the early stages of the pipeline. The predecode logic identifies an instruction as belonging to one of the following 23 groups.
Table A-2 Predecode Logic Groups
Dest

PreDec Bits2

00010

Instruction Type

Format

PreDec Type 1 SrcA

CALL_PAL instruction

PALcode

XXP

Floating conditional branch

Branch

FXX

Floating-point load operation

Memory

SIF

11000

Floating-point operation

Floating

FFF

11100

Floating-point store instruction

Memory

FIS

11101

FTOI instruction

Integer

FXI

01110

HW _MFPR instruction

IPR

RXI

00110

HW _MTPR instruction

IPR

RIW

10110

Integer conditional branch

Branch

IXX

Integer load operation

Memory

Sii

10000

Integer operation

Integer

III

00100

Integer operation with Rb a literal

Integer

IXI

00101

Integer store instruction (Not STx_C)

Memory

IIS

11111

ITOF instruction

Floating

IXF

00111

LDQ_U instruction

Memory

SUI

11001

Misc with no A operand

Memory

XII

11011

Misc with no operands

Memory

xxx

Misc with no result

Memory

IIX

MT_FPCR instruction

Floating

FFC

11110

RPCC instruction

Memory

XIY

01011

Rs I Re VAX compatibility

Memory

XXN

00011

Store conditional

Memory

IIL

00000

Unconditional branch

Branch

XXI

00001

SrcB

01100

01000

01001

01111

Compaq Confidential

5 January 2001 -~ Subject To Change

Instruction Decoding

A-3

Instruction latency

The three-character type identifier defines the type of the A operand, B operand, and result,
as follows:
Character Meaning

C
F
I
L
N
P
R
S
U
W
X
Y
2

Floating-Point Control register
Floating-point register or result
Integer register or result
Lock flag value
Interrupt flag value
PALmode shadow register Sl (CALL_PAL only)
IPR reader class specifier
Store Set identifier
Unaligned address operand
IPR writer class specifier
No operand or No result
Cycle counter IPR

The Ibox IFU predecode bits EDCBA. See Section 3.8.2.3.1.

For Opcodes 10, 11, 12, 13 and lC, bit<l2> of the instruction defines whether a literal
or a register is used for Rb. In the tables, the predecode is listed as I?I for these instructions. They predecode to III if bit<12> is clear (register Rb operand) or IXI if bit<12>
is set (literal Rb operand).
Not every instruction is defined exactly as the predecodes suggest. Many instructions
identified as III or FFF do not require two input operands (ex. SEXT, SQRT). In most
of these cases the SRM requires the unused register to be R31/F31 which results in the
exact same treatment as if the extra predecodes had existed. The few exceptions are
listed in the format discussion below.

A.3 Instruction Latency
Defines the parent-to-child issue latency. Also identifies any cross-pipeline delay associated with broadcasting the parents results to other pipelines. Instructions that are not
pipelined are also identified as "bubbling" for completion. For Example:
n

N cycle latency to a child in any pipeline

m+n

M cycle latency plus extra n cycle to other pipelines.

n+B

N cycle latency non-pipelined, requires bubble (B) to signal completion.

A.4 Execution Pipelines
Identifies which of the eight pipelines the instruction can execute in. The actual slotting
algorithm is a function of the types and positions of the instructions in each map block.
Details about instruction slotting can be found at<???>. Just because an instruction is
slotted to a particular pipeline does not mean it must execute there, follow-me capabili-

Compaq Confidential
A-4

Instruction Decoding

5 Jcwuary 2001 - Subject To Change

Instruction Info (INST.JNF0<15:0>}

ties in the Qbox allow instructions whose operands are data-ready in another allowed
pipeline in the same half of the Queue to issue from that pipeline. Pipelines 0, 2, 5 and
7 are in one half of the Queue, pipes 1, 3, 4, 6 are in the other half.
Format

Meaning

0-7

0-3
0,3
o. . . 1
Alt 0-3

A.5 Instruction Info (INST_INF0<15:0>)
To optimize the efficiency of internal queues, the instruction longword is not passed
throughout the chip but compressed into two separate fields. The opcode field contains
the original 6-bit instruction opcode but the rest of the instruction longword is compressed into a 16-bit inst_info field based on instruction format.
The general rule is:
Instruction Format

Contents of INST_INF0<15:0>

PALcode
Memory/IPR
Otherwise

OR(inst<25:15>), inst<14:0>
inst<15:0>

inst<20:5>

The only exceptions to the general rule follow:
Instruction

INST_INF0<15:0>

RPCC
RS/RC

inst<15: 13>, index:OblOll 1000, inst<4:0>
inst<l5: l>, flag

A.6 Specific Opcode and Instruction Type Decoding
A.6.1 Opcode 00, CALL_PAL
The CALL_PAL instruction is only executed in combination with a valid PALcode
instruction. For example, a valid combination for OpenVMS is CALL_PAL BPT, with
an opcode/function code of 00.0080. Valid PALcode instructions and their function
codes are specified in the Alpha SRM according to operating system. The CALL_PAL
instruction issues on pipelines 0-1 with a latency of 5.

A.6.2 Opcodes 01 through 07, Reserved
Opcodes 01through07 are reserved for the 21464. They predecode to XXX and if executed, return an OPCDEC (or opDec) fault.

Compaq Confidential
5 January 2001 -· Subject To Change

Instruction Decoding

A-5

Specific Opcode and Instruction Type Decoding

A.6.3 Opcode 10, Integer Add/Subtract/Compare
Integer Add/Subtract/Compare instructions.
Table A-3 Opcode 10 Instruction Decoding
Function Code

21464 Decode

Mnemonic

Predecode

Pipelines

Latency

xOO OxOx

ADDL

I?I

0-7

1+1

xOO Oxlx

S4ADDL

I?I

0-7

1+1

xOO lOOx

SUBL

I?I

0-7

l+l

xOO 101x

S4SUBL

I?I

0-7

1+1

xxx 111x

CMPBGE

I?I

0-7

l+l

XOl Oxxx

S8ADDL

I?I

0-7

1+1

XOl lOxx

S8SUBL

I?I

0-7

1+1

001 110x

CMPULT

I?I

0-7

1+1

xlO OxOx

ADDQ

I?I

0-7

1+1

xlO Oxlx

S4ADDQ

I?I

0-7

1+1

xlO lOOx

SUBQ

I?I

0-7

1+1

xlO 101x

S4SUBQ

I?I

0-7

1+1

OxO 110x

CMPEQ

I?I

0-7

1+1

xll Oxxx

S8ADDQ

I?I

0-7

1+1

xll lOxx

S8SUBQ

I?I

0-7

1+1

011 110x

CMPULE

I?I

0-7

1+1

10x 110x

CMPLT

I?I

0-7

1+1

llx 110x

CMPLE

I?I

0-7

1+1

The specific logic functions within the Integer adder are selected as:
Table A-4 Opcode 1O Specific Logic Functions Within the Integer Adder
21464 Decode

Mnemoic

Description

xxx Oxxx

ADD

Add operations

xxx 10xx

SUB

Subtract operations

xxx llxx

CMP

Compare operations

xxo xxOx

Ra used unshifted

xxo xxlx

Ra shifted left two bits before use

xxl xxxx

Ra shifted left three bits before use

lxO OXOX

ADDxN

Enable overflow/under flow exception trapping

lxO lOOX

SUBx/V

Enable overflow/under flow exception trapping

xOx xxxx

Long

32-bit inputs/outputs sign extended into 64-bits

xlx xxxx

Quad

64-bit inputs/outputs

Compaq Confidential
A-6

Instruction Decoding

5 Jc1m.1c1ry 2001 -~ Subject To Cfwnge

Specific Opcode and Instruction Type Decoding

A.6.4 Opcode 11, Integer Logical
Integer Logical instructions.
Table A-5 Opcode 11 Instruction Decoding
Function Code

21464 Decode

Mnemonic

Predecode

Pipelines

Latency

oox OOxx

AND

I?I

0-7

1+1

OOx lxxx

BIC

I?I

0-7

1+1

oox OlOx

CMOVLBS

I?I

0-7

1+1

OOx Ollx

CMOVLBC

I?I

0-7

1+1

Olx OOxx

BIS

I?I

0-7

1+1

Olx OlOx

CMOVEQ

I?I

0-7

1+1

Olx Ollx

CMOVNE

I?I

0-7

1+1

Olx lxxx

ORNOT

I?I

0-7

1+1

lOx OOxx

XOR

I?I

0-7

1+1

lOx OlOx

CMOVLT

I?I

0-7

1+1

lOx Ollx

CMOVGE

I?I

0-7

1+1

lOx lxxx

EQV

I?I

0-7

1+1

llx OOxx

AMA SK

I?I

0-7

1+1

llx OlOx

CMOVLE

I?I

0-7

1+1

llx Ollx

CMOVGf

I?I

0-7

1+1

llx lOxx

CMOV2

I?I

0-7

1+1

llx llxx

IMPLVER

I?I

0-7

1+1

A.6.5 Opcode 12, Integer Shift
The mskbh, insbh and extbh decodes are not formally defined by the Alpha SRM
because all combinations of inputs produce a zero result. The generalized decoding in
the 21464 Integer Shifter does not special case these code points and will produce a
zero result.
Table A-6 Opcode 12 Instruction Decoding
Function Code

21464 Decode

Mnemonic

Predecode

Pipelines

Latency

000 OOlx

MSKBL

I?I

0-3

1+1

xOO 0110

EXTBL

I?I

0-3

1+1

xOO 1011

INSBL

I?I

0-3

1+1

001 OOlx

MSKWL

I?I

0-3

1+1

xOl 0110

EXTWL

I?I

0-3

1+1

xOl 1011

INSWL

I?I

0-3

1+1

Compaq Confidential
5 January 2001 -~ Subject To Change

Instruction Decoding

A-7

Specific Opcode and Instruction Type Decoding

Table A-6 Opcode 12 Instruction Decoding (Continued)
Function Code

21464 Decode

Mnemonic

Predecode

Pipelines

Latency

010 OOlx

MSKLL

I?I

0-3

1+1

xlO 0110

EXTLL

I?I

0-3

1+1

xlO 1011

INSLL

I?I

0-3

1+1

xxx 0000

ZAP

I?I

0-3

1+1

xxx 0001

ZAPNOT

I?I

0-3

1+1

011 OOlx

MSKQL

I?I

0-3

1+1

xxx OlOx

SRL

I?I

0-3

1+1

xll 0110

EXTQL

I?I

0-3

1+1

xxx lOOx

SLL

I?I

0-3

1+1

xlO 1011

INSQL

I?I

0-3

1+1

xxx llxx

SRA

I?I

0-3

1+1

100 OOlx

Mskbh

I?I

0-3

1+1

xOO 0111

Insbh

I?I

0-3

1+1

xOO 1010

Extbh

I?I

0-3

1+1

101 OOlx

MSKWH

I?I

0-3

1+1

xOl 0111

INSWH

I?I

0-3

1+1

xOl 1010

EXTWH

I?I

0-3

1+1

110 OOlx

MSKLH

I?I

0-3

1+1

xlO 0111

INSLH

I?I

0-3

1+1

xlO 1010

EXTLH

I?I

0-3

1+1

111 OOlx

MSKQH

I?I

0-3

1+1

xll 0111

INSQH

I?I

0-3

1+1

xll 1010

EXTQH

I?I

0-3

1+1

A.6.6 Opcode 13, Integer Multiply
Integer Multiply Instructions.
Table A-7 Opcode 13 Instruction Decoding
Function Code

21464 Decode

Mnemonic

Predecode

Pipelines

Latency

xOx xxxx

MULL

I?I

4,5

xlO xxxx

MULQ

I?I

4,5

xll xxxx

UMULH

I?I

4,5

Com p.aq Confidentia I
A-8

Instruction Decoding

5 Jc1nuary 2001 -· Subject To Change

Specific Opcode and Instruction Type Decoding

The specific logic functions within the Integer adder are selected as:
Table A-8 Opcode 13 Specific Logic Functions Within the Integer Adder
21464 Decode

Qualifier

lxO xxxx
110 xxxx

MULLN Enable overflow/under flow exception trapping for
MULQN MULL and MULQ instructions

Description

A.6.7 Opcode 14, ITOFx and Floating-Point Square Root
Integer to Floating register transfer and Floating square root instructions.
For ITOFx instructions, the Ebox format converts Ra and multiplexes the result into the
Fbox load datapath.
SQRT instructions only issue on even cycles and are not pipelined.
Table A-9 Opcode 14 Instruction Decoding
Function Code

21464 Decode 1

Mnemonic

Predecode

Pipelines

Latency

004

xxx xxOO Oxxx

ITOFS

IXF

6,7

014

xxx xxOl Oxxx

ITOFF

IXF

6,7

024

xxx xxlx Oxxx

ITOFT

IXF

6,7

xOA

ttt rrOx lxxO

SQRTF

FFF

Alt 0-3

18 + B + 1

xOB

ttt rrOx lxxl

SQRTS

FFF

Alt0-3

18 + B + 1

x2A

ttt rrlx lxxO

SQRTG

FFF

Alt 0-3

33+B+1

x2B

ttt rrlx lxxl

SQRTI

FFF

Alt 0-3

33+B+1

For SQRT instructions, the ttt and rr fields define the trapping and rounding modes, and all
modes are defined for each function code (see below). The 21464 generates an OPCDEC
(illegal instruction) trap for any opcode 14 function code that is not defined.
rr

ttt

Ox
lx
OxO
Oxl
lxO
lxl

IC
None

Chopped
Normal (default)

None

Imprecise (default)

IV
IS

Underflow Enable
Exception completion enabled
Underflow & Exception enabled

/SU

The FBOX decodes these modes for IEEE instructions (SQRTS, SQTRT) as follows:
rr

ttt

IC
JM

Minus Infinity

None

Normal (default)

Chopped

OxO

None

Dynamic
Imprecise (default)

Oxl

Underflow Enable

lOx

/SU

Software completion w/underflow

llx

/SUI

Software completion w /inexact

Compaq Confidential

5 January 2001 -· Subject To Change

Instruction Decoding

A-9

Specific Opcode and Instruction Type Decoding

A.6.8 Opcode 15, VAX Floating-Point
VAX floating-point instructions.
Table A-10 Opcode 15 Instruction Decoding
Function Code

21464 Decode 1

Mnemonic

Predecode

Pipelines

Latency

xOO

ttt rrOx xOOO

ADDF

FFF

0-3

3+1

xOl

ttt rrOx xOOl

SUBF

FFF

0-3

3+1

x02

ttt rrOx xOlO

MULF

FFF

0-3

3+1

x03

ttt rrOx xOll

DIVF

FFF

Alt 0-3

9+B+l

xlE

ttt rrOx llxx

CVTDG

FFF

0-3

3+1

x20

ttt rrlx xOOO

ADDG

FFF

0-3

3+1

x21

ttt rrlx xOOl

SUBG

FFF

0-3

3+1

x22

ttt rrlx xOlO

MULG

FFF

0-3

3+1

x23

ttt rrlx xOll

DIVG

FFF

Alt0-3

13 + B + 1

x25

ttt xxxx OlOx

CMPGEQ

FFF

0-3

3+1

x26

ttt xxxx 0110

CMPGLT

FFF

0-3

3+1

x27

ttt xxxx 0111

CMPGLE

FFF

0-3

3+1

x2C

ttt rrlO 1100

CVTGF

FFF

0-3

3+1

x2D

ttt rrlO 1101

CVTGD

FFF

0-3

3+1

x2F

ttt rrlO lllx

CVTGQ

FFF

0-3

3+1

x3C

xxx rrll llOx

CVTQF

FFF

0-3

3+1

x3E

xxx rrll lllx

CVTQG

FFF

0-3

3+1

The ttt and rr fields define the trapping and rounding modes. The Fbox decodes these modes
for VAX instructions as shown below.
rr

ttt

Ox
lx
OxO
Oxl
lxO
lxl

IC
None
None
/U

IS
/SU

Chopped
Normal (default)
Imprecise (default)
Underflow Enable
Exception completion enabled
Underflow & Exception enabled

The CMPxxx instructions only define one trap option. The txx mode is decoded as follows:
txx

Oxx
lxx

None
IS

Imprecise (default)
ForCMPxxx

Compaq Confidential
A-10 Instruction Decoding

5 January 2001 -- Subject To Change

Specific Opcode and Instruction Type Decoding

A.6.9 Opcode 16, IEEE Floating-Point
IEEE floating point instructions.
Table A-11 Opcode 16 Instruction Decoding
Function Code

21464 Decode 1

Mnemonic

Predecode

Pipelines

Latency

xOO

ttt rrOx xOOO

ADDS

FFF

0-3

3+1

xOl

ttt rrOx xOOl

SUBS

FFF

0-3

3+1

x02

ttt rrOx x010

MULS

FFF

0-3

3+1

x03

ttt rrOx x011

DIVS

FFF

Alt0-3

9+B+1

x20

ttt rrlx xOOO

ADDT

FFF

0-3

3+1

x21

ttt rrlx xOOl

SUBT

FFF

0-3

3+1

x22

ttt rrlx x010

MULT

FFF

0-3

3+1

x23

ttt rrlx x011

DIVT

FFF

Alt0-3

13+B+l

x24

txx xxxx 0100

CMPIUN

FFF

0-3

3+1

x25

txx xxxx 0101

CMPTEQ

FFF

0-3

3+1

x26

txx xxxx 0110

CMPfLT

FFF

0-3

3+1

x27

txx xxxx 0111

CMPfLE

FFF

0-3

3+1

x2C

ttt rrxO 110x

CVTTS

FFF

0-3

3+1

2AC

tlO xxxO 110x

CVTST

FFF

0-3

3+1

x2F

ttt rrxO lllx

CVTTQ

FFF

0-3

3+1

x3C

txx rrxl 110x

CVTQS

FFF

0-3

3+1

X3E

txx rrxl lllx

CVTQT

FFF

0-3

3+1

The ttt and rr fields define the trapping and rounding modes. The Fbox decodes these modes
for VAX instructions as shown below:
rr

ttt

00
01
10
11
OxO
Oxl
lOx
llx

IC
/M
None
ID
None
/U
/SU
/SUI

Chopped
Minus Infinity
Normal (default)
Dynamic
Imprecise (default)
Underflow Enable
Software completion w/underflow
Software completion w/inexact

The CMPxxx and CVTQx instructions only define one trap option. The txx mode is
decoded as follows:
txx

Oxx
lxx

None
/SU

Imprecise (default)
For CMPxxx, /SUI For CVTQx

Compaq Confidential

5 January 2001 ·-Subject To Change

Instruction Decoding A-11

Specific Opcode and Instruction Type Decoding

Unlike any other floating-point instruction, decoding of the trap mode bits differentiates
CVTST and CVTTS instructions. The special decoding is:
000
Oxl
lOx
111
010
110

ttt

tlO

CVTIS
CVTIS
CVTIS
CV TIS
CVTST
CVTST

Imprecise (default)
Underflow enable
Software w/Underflow
Software w/Inexact
Imprecise (default)
Software denormal fixup

None
/U
/SU
/SUI
None

A.6.1 O Opcode 17, Miscellaneous Floating-Point
For FCMOVxx instructions, the Ibox will scan for a FCPYS->R31 instruction immediately following the FCMOVxx instruction and if found replace it with a FCMOV2
instruction. If the instruction following a FCMOVxx is not a FCPYS->R31, the Ibox
tags the FCMOVxx instruction as legacy. Legacy FCMOVxx instructions terminate a
map-block and are repeated in the next map-block. The Pbox then converts the
repeated FCMOVxx instruction (which is always in position 0 of the new map-block)
to the FCMOV2.
The Ebox detects user-mode MT_FPCR instructions and traps to PALmode to fix-up.
Table A-12 Opcode 17 Instruction Decoding
Function Code

21464 Decode

Mnemonic

Predecode

Pipelines

Latency

xlO

xxx xxOl xxxx

CVTLQ

FFF

0-3

3+1

x20

xxx xxxO 0000

CPYS

FFF

0-3

1+1

x21

xxx xxxO 0001

CPYSN

FFF

0-3

1+1

x22

xxx xxxO OOlx

CPYSE

FFF

0-3

1+1

x24

xxx xxxO OlxO

MT_FPCR

FFC

0,3

x25

xxx xxxO Olxl

MF_FPCR

FFF

0,3

3+1

x68

xxx xxxO lOOx

FCMOV2

FFF

0-3

1+1

x2A

xxx xxxO 1010

FCMOVEQ

FFF

0-3

1+1

x2B

xxx xxxO 1011

FCMOVNE

FFF

0-3

1+1

x2C

xxx xxxO 1100

FCMOVLT

FFF

0-3

1+ 1

x2D

xxx xxxO 1101

FCMOVGE

FFF

0-3

1+1

x2E

xxx xxxO 1110

FCMOVLE

FFF

0-3

1+ 1

X2F

xxx xxxO 1111

FCMOVGT

FFF

0-3

1+1

CVTQL

FFF

0-3

X30
1

ttt

xxll xxxx

Only CVTQL has a defined trap mode, as shown below:
ttt

OxO
Oxl
lxx

None
/U
/SU

Imprecise (default)
Underflow Enable
Software completion w/underflow

Compaq Confidential
A-12

Instruction Decoding

5 Januc1ry 2001 - Subject To Change

Specific Opcode and Instruction Type Decoding

A.6.11 Opcode 18, Miscellaneous
TRAPB, EXCB and FETCHx instructions never actually issue from the Qbox but are
completed immediately and therefore act as NOPs. MBs also never formally issue from
the Qbox but are instead sent to the Mbox as soon as they enter the Qbox. MB instructions do not complete until the Mbox notifies the Qbox that the necessary conditions
have been met.
The WH64EN instruction is currently proposed as ECO#l27 to the Alpha SRM.
Table A-13 Opcode 18 Instruction Decoding
Function Code

21464 Decode

Mnemonic

Predecode

0000

OOxx xOxx xxxx xxxx

TRAPB

0400

OOxx xlxx xxxx xxxx

EXCB

4000

0 lxx OOxx xxxx xxxx

xxx
xxx
xxx

4400

· Olxx Olxx xxxx xxxx

WMB

IIX

4800

0 lxx lxxx xxxx xxxx

(MB)

8000

1OOx xxxx xxxx xxxx

FETCH

AOOO

1010 xxxx xxxx xxxx

FETCH_M

xxx
xxx
xxx

BOOO

1011 OOxx xxxx xxxx

LDL_ARM

Sii

6,7

B400

1011 0 lxx xxxx xxxx

LDQ_ARM

Sii

6,7

B800

1011 lxxx xxxx xxxx

QUIESCE

IIX

4,5

cooo

11 Ox xxxx xxxx xxxx

RPCC

XIY

0-1

EOOO

1110 Oxxx xxxx xxxx

XXN

4,5

1+1

E800

1110 1Oxx xxxx xxxx

ECB

IIX

4,5

ECOO

1110 llxx xxxx xxxx

CCB

IIX

4,5

FOOO

1111 Oxxx xxxx xxxx

XXN

4,5

F800

1111 lOxx xxxx xxxx

WH64

IIX

4,5

FCOO

1111 llxx xxxx xxxx

WH64EN

IIX

4, 5

Pipelines

Latency

4,5

1+1

A.6.12 Load and Store Instructions
Load and store instructions.
Table A-14 Load and Store Instruction Decoding
Opcode

Mnemonic

Predecode

Pipelines

Latency

LDA

XII

0-7

1+1

LDAH

XII

0-7

1+1

LDBU

SII

6,7

LDQ_U

SUI

6,7

LDWU

SII

6,7

Compaq Confidential
5 January 2001 -· Subject To Change

Instruction Decoding A-13

Specific Opcode and Instruction Type Decoding

Table A-14 Load and Store Instruction Decoding (Continued)
Opcode

Mnemonic

Predecode

Pipelines

Latency

STW

IIS

4,5

STB

IIS

4,5

STQ_U

IIS

4,5

LDF

SIF

6,7

LDG

SIF

6,7

LDS

SIF

6,7

LDT

SIF

6,7

STF

FIS

4,5

STG

FIS

4,5

STS

FIS

4,5

STT

FIS

4,5

LDL

Sii

6,7

LDQ

Sii

6,7

LDL_L

Sii

6,7

LDQ_L

Sii

6,7

STL

IIS

4,5

STQ

IIS

4,5

STL_C

IIL

4 '5

STQ_C

IIL

4 '5

32
1

Store Conditional instructions issue as stores to pipelines 4 and 5 but bubble back
completion to the QBOX, and the final completion of the STx_C instruction appears
on the load pipes 6 and 7.

2 Although store instructions do not produce a register result and therefore do not
have normal dependents, the IBOX store-set logic can create dependency groups of
loads and stores. A load that is store-set dependent on a store instruction will have
an effective issue latency of three cycles from the issue of the store.

A.6.13 Opcode 1C, Integer Multimedia
Integer multimedia instructions.
Table A-15 Opcode 1C Instruction Decoding
Function Code

21464 Decode

Mnemonic

Predecode

Pipelines

Latency

000 OOxO

SEXTB

I?I

0-7

1+1

000 OOxl

SEXTW

I?I

0-7

1+1

000 OlxO

CMPWGE

I?I

2,3

000 Olxl

CMPLGE

I?I

2,3

000 lxxO

PERMB8

I?I

0' 1

Compaq Confidential
A-14

Instruction Decoding

5 Jc1nuc1ry 2001 - Subject To Change

Specific Opcode and Instruction Type Decoding

Table A-15 Opcode 1C Instruction Decoding (Continued)
Function Code

21464 Decode

Mnemonic

Predecode

Pipelines

Latency

000 lxxl

GPKBLB4

I?I

0,1

001 OOzz

VADDzzz

I?I

2,3

001 OlOx

VADDUL2

I?I

2,3

001 Ollx

VADDSL2

I?I

2,3

001 lOzz

VSUBzzz

I?I

2,3

001 110x

VSUBUL2

I?I

2,3

001 lllx

VSUBSL2

I?I

2,3

010 OOzz

VMINMAXzzz

I?I

2,3

010 OlOx

VMINMAXUL2 I?I

2,3

010 Ollx

VMINMAXSL2 I?I

2,3

010 1000

PKUWB8

I?I

0' 1

010 1001

PKULW4

I?I

0,1

010 1010

PKSWB8

I?I

0,1

010 1011

PKSLW4

I?I

0' 1

010 1100

UPKUBW4

I?I

0' 1

010 1101

UPKUWL2

I?I

0' 1

010 1110

UPKSBW4

I?I

0' 1

010 1111

UPKSWL2

I?I

0' 1

011 0000

CTPOP

I?I

2,3

011 0001

PERR

I?I

2,3

011 0010

CTLZ

I?I

2,3

011 0011

CTIZ

I?I

2,3

011 0100

UNPKBW

I?I

0,1

011 0101

UNPKBL

I?I

0' 1

011 0110

PKWB

I?I

0' 1

011 0111

PKLB

I?I

0' 1

011 1000

MINSB8

I?I

2,3

011 1001

MINSW4

I?I

2,3

011 1010

MINUB8

I?I

2,3

011 1011

MINUW4

I?I

2,3

011 llzz

MAXzzz

I?I

2,3

100 OOzz

TADDzzz

I?I

2,3

100 Olzz

TSUBzzz

I?I

2,3

100 lOzz

TABSERRzzz

I?I

2,3

Compaq Confidential
5 January 2001 ··· Subject To Change

Instruction Decoding A-15

Specific Opcode and Instruction Type Decoding

Table A-15 Opcode 1C Instruction Decoding (Continued)
Function Code

21464 Decode

Mnemonic

Predecode

Pipelines

Latency

100 llzz

TSQERRzzz

I?I

2,3

101 OOzz

TMULzzz

I?I

2,3

101 OlxO

TMULUSB8

I?I

2,3

101 Olxl

TMULUSW4

I?I

2,3

101 lxOx

VMULLUW4

I?I

2,3

101 lxlx

VMULHUW4

I?I

2,3

110 0000

VSRB8

I?I

0' 1

110 0001

VSRW4

I?I

0' 1

110 0010

VSRAB8

I?I

0' 1

110 0011

VSRAW4

I?I

0' 1

110 OlOx

VSRL2

I?I

0' 1

110 Ollx

VSRAL2

I?I

0' 1

110 lOxO

VSLB8

I?I

0' 1

110 lOxl

VSLW4

I?I

0' 1

110 llxx

VSLL2

I?I

111 Ox.xx

FTOIT

FXI

0' 1
4,5

111 lxxx

FTOIS

FXI

4,5

A.6.14 Branch and Jump Instructions
Branch and Jump instructions.
Table A-16 Branch and Jump Instruction Decoding
Opcode

21464 Decode

Mnemonic

Predecode

Pipelines

Latency

lA.O

0 Oxx xxxx xx ...

JMP

XII

0-1

lA.1

Olxx xxxx xx ...

JSR

XII

0-1

lA.2

1 Oxx xxxx xx ...

RET

XII

0-1

lA.3

1 lxx xxxx xx ...

JSR_CO

XII

0-1

XXI

0-1

FBEQ

FXX

0-1

FBLT

FXX

0-1

FBLE

FXX

0-1

BSR

XXI

0-1

FBNE

FXX

0-1

FBGE

FXX

0-1

Compaq Confidential
A-16

Instruction Decoding

5 Jc1nuc1ry 2001 ~· Subject To Change

Specific Opcode and Instruction Type Decoding

Table A-16 Branch and Jump Instruction Decoding (Continued)
Opcode

21464 Decode

Mnemonic

Predecode

Pipelines

FBGf

FXX

0-1

BLBC

IXX

0-7

BEQ

IXX

0-7

BLT

IXX

0-7

BLE

IXX

0-7

BLBS

IXX

0-7

BNE

IXX

0-7

BGE

IXX

0-7

BGf

IXX

0-7

Latency

A.6.15 PALcode Instructions
The MSB of the index field of the HW_MTPR instruction indicates the destination:
O=Mbox
1 = Ibox
Table A-17 PALcode Instruction Decoding
Opcode

21464 Decode1

Mnemonic

Predecode

Pipelines

Latency

Ox19

xxxO i i ii iiix xxxx
xxxl iiii iiix xxxx

HW_MFPR
HW_MFPR

RXI
RXl

4-5
0-1

5
5

OxlB

ttts SXXX XXXX XXXX

HW_LD

SIT

6,7

OxlD

xxxO iiii iiiw wwww
xxxl iiii iiiw wwww

HW_MTPR
HW_MTPR

RIW
RIW

6,7
0-1

1132
12

OxlE

xxxx xxxx xxxx xxxx

IFETCHB

xxx

4,5

OxlF

ttts SXXX XXXX XXXX

HW_ST

IIX

4,5

The decode bit symbols i, r, s, and t are defined as follows:
For HW_LD and HW_ST:
ttt

Type of memory reference to perform. See the Type field in Table 17-1.
Size of the data transaction. See the Length field in Table 17-1.

For HW_MFPR and HW_MTPR:

iiii iiii

Identifier of the WR to read. See the Index field in Tables 17-2 and 17-3 and Table 16-1.
rrrr
Reader class of the instruction. See the Rclass field in Tables 17-2 and 17-3.
w wwww Writer class of the instruction. See the Wclass field in Table 17-3.

Compaq Confidential
5 January 2001 ··· Subject To Change

Instruction Decoding A-17

Specific Opcode and Instruction Type Decoding

2 HW_MTPR instructions can specify a writer class to create an issue dependency to

future HW_MxPR instructions. HW_MxPR instructions that identify a reader class
dependency are scheduled to issue no earlier than 1 cycle after the HW_MTPR
instruction that wrote the class dependency. HW_MTPR instructions can also specify writer class dependencies that are satisfied on completion rather than issue.
HW_MxPR instructions that identify a reader class dependency against this type of
writer class are scheduled to issue no earlier than 3 cycles after the issue of the completion bubble signal for the writer. The 21464 only allows specifying completion
dependencies for Mbox HW_MTPR instructions; the completion bit is ignored for
lbox destinations.

Compaq Confidential
A-18 Instruction Decoding

5 Jc1m.1c1ry 2001 ~·Subject To CfJange

Relationship Between SMT and LD:x____ARM/QUIESCE

B
LDx_ARM/QUIESCE Instruction Characteristics
The 21464 supports simultaneous multithreading (SMT), where up to 4 threads (or processes) share the resources of the CPU. On an SMT CPU, a spin-lock loop wastes CPU
resources that could be used by other processes or threads that are executing. We are
proposing two new instructions for the Alpha architecture, LDx_ARM and QUIESCE,
which will permit a thread to wait on a memory location without actively spinning.

8.1 Relationship Between SMT and LDx_ARM/QUIESCE
The 21464 is implementing Simultaneous Multithreading because of the boost in
throughput it provides when running independent programs, and because of the
expected performance improvement for decomposed application. For independent programs executing simultaneously, performance studies show roughly 100% increase in
throughput, compared with running the programs only one at a time. June 1998 results
showed:
1-threaded IPC 1 harmonic mean 4-threaded IPC

Programs

Compress, Gee, M88ksim, Go (int)
2.38
Tomcat, Applu, Swim, Povray (float) 3.50
SQL traces (database)
1.33
1

5.15
6.12
3.40

4T/1T IPC increase

2.16x
1.75x
2.55x

IPC=lnstructions Per Cycle

We anticipate that SMTwill also be very useful for decomposed applications, where
one program is broken into multiple threads and locking protocols are used by each
thread to control access to shared data. Preliminary results show speedups from 1. lx to
2.5x (Cilk results from CRL). These results were produced with code that used QUIESCE; other runs done without QUIESCE showed no speedup at all, or even degradation. The reason for the poor performance without QUIESCE is due to the way threads
wait for access to a lock, as explained in the following paragraphs.
An integral part of many locking protocols is a busy wait loop, often referred to as a
spin lock. In a spin lock, a process loops looking at a particular memory location and
waiting for it to change to a specific value before proceeding. Once the value has
changed, the process is then free to attempt an atomic update of the location, thus
obtaining the lock.

Compaq Confidential
5 January 2001 -~Subject To Change

LDx_ARM/QUIESCE Instruction Characteristics

B-1

Goals for the LDx____ARM and QUIESCE Instruction Definition

In a conventional multiprocessor the CPU resources and memory bandwidth consumed
by a task in a spin lock are not simultaneously shared with any other tasks. Thus, while
the task is spinning there is no resource contention within the CPU, and no reason not to
let the task spin as much as it wants. Studies have shown that approximately 15% of
processor time is spent in spin loops.
In a simultaneous multithreaded CPU, however, the resources consumed by the spinning task are being denied to the other threads that are doing useful work. Thus, it is
desirable to prevent the task in the spin lock from consuming resources when there is no
chance that it will find the value it is looking for. We refer to the action of pausing execution of a thread (until the condition it is waiting for might be satisfied) as quiescing
the thread. In a simultaneous multithreaded machine, the act of quiescing would mean
that no instructions are executed from the quiesced "thread processing unit" or TPU.
The other TPUs continue normally.
We propose to add two new instructions to the Alpha architecture, LDx_ARM (long or
quad) and QUIESCE. Software uses these in sequence, LDx_ARM and then QUIESCE. These instructions allow a processor to declare that it has no work to do until
some other processor writes a specified location in memory space. Two internal processor registers are also involved, watch_physical_address and watch_flag. In addition, the
processor has a counter to signal the timeout of the quiesce period at intervals.
For backwards compatibility, the LDx_ARM/QUIESCE sequence must be conditionalized with the AMASK instruction, so that non-SMT processors do not execute these
new instructions. We also propose defining a new AMASK bit to identify a processor
with SMT capability.
Table B-1 SMT AMASK Instruction Bit
Bit

Meaning

Support for Simultaneous Multithreading. This processor is part of an implementation where multiple threads, or processes, are executed simultanously within a single
CPU.

B.2 Goals for the LDx_ARM and QUIESCE Instruction Definition
We would like to achieve the following goals in our definition of these instructions:

•

Define the instructions such that in-order execution of LDx_ARM and QUIESCE is
ensured. This is accomplished through the defined dependency on watch_flag LDx_ARM sets it and QUIESCE uses it as a condition on its operation.

•

Eliminate possibility of a race between the lock just becoming available, and quiescing the machine. This is accomplished by having the LDx_ARM load the lock
value so that code can test the lock before executing the QUIESCE.

•
•

May be used either in PALmode or in normal mode .
Have code using these instructions still be functional if executed in older machines .

B.2.1 Specific LDx_ARM Instruction Characteristics
The following sections contain the specific characteristics and requirements that define
the LDL_ARM and LDQ_ARM instructions.

B-2

Compaq Co11fide11tia I
LDx_ARM/QUIESCE Instruction Characteristics
5 Jc1nuary 2001 ···Subject To Change

Goals for the LDx____ARM and QUIESCE Instruction Definition

B.2.1.1 Instruction Description

The mnemonics/description for the LDx_ARM instructions are:
Mnemonic
LDL_ARM
LDQ_ARM

Description
Load sign-extended longword from memory to register and arm
Load quadword from memory to register and arm

The LDx_ARM instructions are described in a manner that is as similar as possible to
the LDx_L instructions, except the LDx_L instructions affect lock_flag and
locked_physical_address, while the LDx_ARM instructions affect watch_flag and
watch_physical_address. Note however, that LDx_ARM, because they use the Memory/function code instruction format, have no displacement.
Instruction Format:
LDx_ARM

Ra,

{Rb.ab)

! Mfc fo:rmat

Operation:
va <- Rbv
CASE

big_endian_data: va' <- va XOR 000 {base 2)

!LDQ_ARM

big_endian_data: va' <- va XOR 100 {base 2)

!LDL_ARM

little_endian_data: va' <- va

!LDL_ARM

ENDCASE

watch_f lag <- 1
watch_physical_address <- PHYSICAL_ADDRESS{va)
Ra<- SEXT{ (va')<31:0>)

!LDL_ARM

Ra<- {va')<63:0>

!LDQ_ARM

Exceptions:

Access Violation
Alignment
Fault on Read
Translation Not Valid
Qualifiers:

None
LDx_ARM is used in conjunction with QUIESCE to idle a process while waiting for a
shared resource (rather than looping and continually testing the lock bit).
The virtual address is in register Rb. For a big-endian longword access, va<2> (bit 2 of
the virtual address) is inverted, and any memory management fault is reported for va
(not va'). The source operand is fetched from memory, sign-extended for LDL_L, and
written to register Ra. If the LDx_ARM instruction encounters an exception, it is
treated just as for a LDQ instruction.

5 January 2001 -·Subject To Change

Compaq Confidential
LDx_ARM/QUIESCE Instruction Characteristics

B-3

Goals for the LDx____ARM and QUIESCE Instruction Definition

When a LDx_ARM instruction is executed without faulting, the processor records the
target physical address in a per-processor watch_physical_address register and sets the
per-processor watch_flag.
If the per-processor watch_flag is (still) set when a QUIESCE instruction is executed,
the processor quiesces, as described for the QUIESCE instruction.

Processor A causes the clearing of a set watch_flag in Processor B by doing any of the
following in B's watched range of physical addresses: a successful store, a successful
store_conditional, or executing a WH64 instruction that modifies data on processor B.
A processor's watched range is the aligned block of 2**N bytes that includes the
watch_physical_address. The 2**N value is implementation dependent, and must
match the lock range implemented for LDx_L and STx_C. It is at least 16 (minimum
lock range is an aligned 16-byte block) and is at most the page size for that implementation (maximum lock range is one physical page).
A processor's watch_flag is cleared if that processor's implementation-specific quiesce
timeout counter expires, as described for the QUIESCE instruction. A processor's
watch_flag is also cleared if that processor encounters a CALL_PAL REI, CALL_PAL
rti, or CALL_PAL rfe instruction. A processor's watch_flag is cleared if that processor
encounters a CALL_PAL retsys (return from syscall) or CALL_PAL urti (return from
user mode trap). It is UNPREDICTABLE whether or not a processor watch_flag is
cleared on any other CALL_PAL instruction. It is UNPREDICTABLE whether a processor's watch_flag is cleared by that processor executing a WH64 or ECB instruction.
The watch_flag may also be cleared for implementation-specific reasons.
It is UNPREDICTABLE whether the watch_flag is cleared by an interrupt to the pro-

cessor, whether or not the processor is in PALmode.
The fallowing sequence:
LDQ_ARM

<branch to GetLock if lock available>
QUIESCE

GetLock:

when executed on a given processor, will quiesce the processor if the branch to GetLock is not taken, and will continue execution if the branch is taken.
Notes

B-4

•

The conditions which clear watch_flag are intended to cover the cases when a
change of control, occurring between a LDx_ARM and a QUIESCE, may have
executed a LDx_ARM and changed the watch_physical_address value. We don't
want to quiesce on the wrong value of watch_physical_address.

•

Executing a LDx_ARM on one processor does not affect any architecturally visible
state on another processor, and in particular cannot clear watch_flag on another
processor, causing the other processor to come out of a quiescent state. Note: Without this restriction, two processors executing LDQ_ARM/QUIESCE sequences
could be continually re-arming each other.

Compaq Confidential
LDx_ARM/QUIESCE Instruction Characteristics
5 Jmuuiry 2001 ···Subject To Change

Goals for the lDx____ARM and QUIESCE Instruction Definition

LDx_ARM and QUIESCE instructions need not be paired. In particular, a
LDx_ARM may be followed by a conditional branch: on the fall-through path a
QUIESCE is executed, whereas on the taken path no matching QUIESCE is executed.
If two LDx_ARM instructions execute with no intervening QUIESCE, the second
one overwrites the state of the first one. If two QUIESCE instructions execute with
no intervening LDx_ARM, the second one never quiesces the processor because
watch_flag was clear after execution of the first, whether it quiesced the processor
or not.

•
•

•

Software will not emulate any LDx_ARM instruction.
If the physical address of the LDx_ARM and the physical address intended to be
watched are not within the same naturally aligned 16-byte sections of physical
memory, the processor may continue to be quiesced despite another processor's
store to the watched range; hence, care should be taken to specify the addresses
with correct alignment.
If any other memory access (ECB, LDx, LDDQ_U, STx, STQ_U, WH64) is executed on the given processor between the LDx_ARM and the QUIESCE, the QUIESCE may always fail on some implementations.

Note:

Otherwise, a direct-mapped TB could thrash. Or, the memory reference
could change the contents of the cache which the implementation might
depend upon. It should be possible to always code very few instructions
between the LDx_ARM and the QUIESCE.

•

If a branch is taken between the LDx_ARM and the QUIESCE, the sequence above
may always fail (processor will not quiesce) on some implementations. (CMOVxx
may be used to avoid branching.)

•

If a subsetted instruction (for example, floating-point) is executed between the
LDx_ARM and the QUIESCE, the QUIESCE may always fail on some implementations because of the Illegal Instruction Trap.

•

If an instruction with an unused function code is executed between the LDx_ARM
and the QUIESCE, the QUIESCE may always fail on some implementations
because an instruction with an unused function code is UNPREDICTABLE.

•

If a large number of instructions are executed between the LDx_ARM and the QUIESCE, the QUIESCE may always fail on some implementations because of a timer
interrupt always clearing the watch_flag before the sequence completes.

•

Execution of a WH64 instruction on processor A to a region within the watched
range of processor B, where the execution of the WH64 changes the contents of
memory, causes the watch_flag on processor B to be cleared. If the WH64 does not
change the contents of memory of processor B, it need not clear the watch_flag.

Implementation Notes

•

When not in PALmode, the signalling of an interrupt should clear the watch flag, so
that QUIESCE cannot be used as a way to delay interrupt processing.

5 January 2001 ···Subject To Change

Compaq Confidential
LDx_ARM/QUIESCE Instruction Characteristics

B-5

Goals for the LDx____ARM and QUIESCE Instruction Definition

•

The watch_flag and watch_physical_address register must be loaded simultaneously with reading the value of the lock. Hardware must ensure that even if the
lock value becomes unlocked immediately after reading it, and before the QUIESCE is executed, watch_flag will be cleared and will prevent the processor from
quiescing (the QUIESCE will fail, as should happen). Note: in some sense, this is a
performance issue, not a functional issue: if the watch_flag is not cleared due to the
change in the lock, QUIESCE_TIMER will eventually time out and end the quiesce
period.

•

Since watch_flag and watch_physical_address are implicitly written by LDx_ARM
and implicitly read by QUIESCE, implementations must ensure that any speculative execution of those instructions preserves the read-order and write-order of
watch_flag and watch_physical_address, as intended in the original program.
For example, in the code below, if the first branch is incorrectly predicted taken, the
second LDx_ARM must not be allowed to affect the behavior of the first QUIESCE
by changing watch_physical_address.
LDQ_ARM Rl, (RS)
BEQ Rl, test
QUIE'SCE

test:
LDQ_ARM Rl, (RS)
BEQ Rl, xxx
QUIE'SCE

B.2.2 Specific QUIESCE Instruction Characteristics
The following sections contain the specific characteristics and requirements that define
the QUIESCE instruction.
The rrmemonic/description for the QUIESCE instruction are as follows:
Mnemonic

Description

QUIESCE

Quiesce Conditional

Format

! Mfc format

QUIESCE
Operation:
IF (watch_flag != 0)

THEN

start QUIESCE_TIMER
suspend program execution
resume execution ~en watch_flag==O

Exceptions:

NONE

B-6

Compaq Confidential
LDx_ARM/QUIESCE Instruction Characteristics
5 J~1nuary 2001 ···Subject To Change

Goals for the LDx.___ARM and QUIESCE Instruction Definition

Qualifiers:

None
QUIESCE checks the watch_flag to see if it is set; if it is, the processor starts the implementation-specific QUIESCE_TIMER and pauses execution of this instruction stream.
When the watch_flag is cleared, execution begins again. It is implementation-specific
exactly how/if the machine pauses execution and when exactly it resumes. If the
watch_flag is set, the QUIESCE instruction is considered complete at the beginning of
the quiescent period.
The implementation-specific QUIESCE_TIMER starts counting when the QUIESCE
determines that the processor is going to quiesce. After some implementation-specific
finite period of time, QUIESCE_TIMER expires and clears the watch_flag. If the quiesce period ends before the QUIESCE_TIMER expires, the QUIESCE_TIMER must
be stopped, to prevent it clearing watch_flag after a future LDx_ARM.
After the quiescent period, execution resumes at the instruction following the QUI ESCE, or, if the QUIESCE was terminated because watch_flag was cleared by an interrupt, execution may resume at an interrupt servicing routine.
By definition, the watch_flag is clear at the end of the quiescent period, since the quiescent period cannot end until watch_flag is clear.
Implementation notes

•

If an interrupt causes a processor to end a quiescent period and immediately start

executing the interrupt servicing routine, that ISR may return to the QUIESCE
instruction only if watch_flag is guaranteed to be clear. If it is not, the ISR must
return to the instruction after the QUIESCE, since the value of
watch_physical_address may have been changed by a LDx_ARM executed while
servicing the interrupt.

•

If an interrupt occurs during a quiescent period, an implementation does not have to

start the ISR immediately after the QUIESCE; it may choose to delay execution of
the ISR until some later point in the instruction stream.

•

An implementation may allow software to specify the QUIESCE timeout period
through an IPR. A timeout value of 0 would effectively disable the QUIESCE
instruction.

•

The implementation-specific maximum timeout value should not exceed n microseconds (where n is TBD ).

Software/Hardware Note

The quiesce timeout counter is useful/necessary for the following reasons:

•

The timeout enables the implementation of a backoff algorithm, where a process
can deschedule itself after some period of time if it hasn't gotten the lock.

•

The quiesce timeout counter prevents a processor from deadlocking if there is a
coding error.

•

Suppose the code updating the memory location takes an access violation and never
gets to unlock the lock. The quiesce timeout allows the waiting processor to wake
up and discover the problem with checking code.

5 January 2001 ···Subject To Change

Compaq Confidentia I
LDx_ARM/QUIESCE Instruction Characteristics

B-7

Goals for the LDx____ARM and QUIESCE Instruction Definition

Software Note

If a longer quiesce period is desired than that provided by a given implementation, soft-

ware can accomplish that by looping and quiescing repeatedly.
B.2.2.1 Data Sharing Using LDx_ARM/Quiesce
Efficient Data Sharing in a Simultaneous Multithreaded Processor

In a simultaneous multithreaded (SMT) CPU, multiple threads, or processes, are executed simultaneously while sharing the resources of a single CPU. On an SMT CPU, a
spin-lock loop wastes CPU resources that could be used by other processes or threads
that are executing. The LDx_ARM and QUIESCE instructions are used in a simultaneous multithreaded CPU to keep a thread from consuming resources while it waits for
a lock.
An example code sequence using the quiesce operation follows. In this program, R5
contains the address of a lock. The program is spinlocking on the lock until it is 0. RO is
loaded with the value of the lock.
GetLock:
LDQ_L

RO, (RS}

;load the lock value

BNEQ

RO, HandleBusyLock

; i f not available, quiesce

<modify RO>
STQ_C

RO (RS}

;store new lock value if lock_flag still set

BEQ

RO, GetLock

;if store conditional failed, try again

; we have the lock, now do the real work

;done

RET

HandleBusyLock:
SMI' bit

in AMASK

IDA R2, Ox400 (R31}

;set bit 10,

AMASK R2, R2

;test whether SMT processor

BEQ R2, CheckLock

;if no SMT, skip quiesce

LDQ_ARM RO, (RS}

;load the lock value at address RS into RO
;put lock address into watch_physical_address
; set watch_flag

BEQ

RO, GetLock

;if lock available, t:ry to get it
;if watch_flag set, go quiet

QUIESCE
CheckLock:
1DQ

RO, (RS}

;load lock value again

BEQ

RO, GetLock

;if available, try for it again

<check for spinning on lock too long>
BR HandleBusyLock

;loop again

Compaq Confidential
B-8

LDx_ARM/QUIESCE Instruction Characteristics

5 Jc1nuary 2001 -·Subject To Change

Proposed Opcode Assignments

In that code sequence, testing the lock just after the LDQ_ARM is crucial to performance in the case where the lock is available - otherwise the code would quiesce for no
reason. Having the QUIESCE fall through into the CheckLock section allows us to
check the lock again, in case the QUIESCE ended for some other reason than a change
in the lock value. Note however that for a lock which is highly contended, the lines
"BEQ RO, GetLock" will mispredict when the lock is finally given up, assuming that
we issued QUIESCE multiple times before getting a chance at the lock. This mispredict
will slow down the attempt to get the lock.
Note also that if we execute the LDQ_ARM and we don't QUIESCE, because we
branch away to get the lock, the watch_flag will still be set. It will continue to be set
until it is cleared by one of the conditions given for clearing watch_flag. This should
have no actual effect on machine hardware since it won't be quiesced at the time. The
fact that watch_flag is set when a QUIESCE is not actually watching for anything is
harmless - the next LDx_ARM which executes will load a new watch_physical_address
and set watch_flag whether or not it is already set.

B.3 Proposed Opcode Assignments
We propose the following opcode assignments
Table B-2 Proposed LDx_ARM/QUIESCE Opcode Assignments
Mnemonic

Instruction Type

Opcode

LDL_ARM

Mfc

18.BOOO

LDQ_ARM

Mfc

18.B400

QUIESCE

Mfc

18.B800

Ideally, we would choose opcodes with the following characteristics:

•
•
•

They are memory format instructions, for ease of implementation .
They are NOPs to all pre-EV8 Alpha processors .
LDx_ARM has a displacement field.

If we found opcodes meeting these criteria, QUIESCE code could be written without
using AMAS K to condition the code based on the processor type. (If code depended on

the register value loaded by the LDx_ARM, an ordinary load would be needed before
the LDx_ARM, to accomplish the load operation in older machines.) We do not believe
it is possible to find opcodes that look like NOPs to all older implementations, and also
fit our other criteria. For example, we could use opcodes in the 11.xx (integer logical)
category, such as were used for AMASK; however, these are operate format, not memory format.
We conclude that QUIESCE code sequences will have to be conditioned with AMASK.
Since this is the case, we'd like to choose opcodes with the following characteristics:

•

They are memory format instructions, for ease of implementation .

•

LDx_ARM instructions are illegal operations to all pre-EV8 Alpha processors .

•

LDx_ARM has a displacement field.

5 January 2001 -~Subject To Change

Compaq Confidentia I
LDx_ARM/QUIESCE Instruction Characteristics

B-9

Implementation

The Miscellaneous category of opcodes ( 18.xxxx) provides memory format instructions. But, this category has no displacement field, since that field is used as the function field. This gives LDx_ARM a dissimilarity from LDx_L: LDx_ARM cannot have
a displacement when specifying the load address.
Note:

Matt Reilly tried all the 18.xxxx opcodes and found that most of the unused
ones trap (illegal instruction trap) on the 21164, but are NOPs on the 21064
and the 21264. So, we don't think we can meet the goal of having the
LDx_ARM instructions trap on all previous implementations.

B.4 Implementation
The design intent is that the LDx_ARM/Quiesce mechanism have the following properties:

•

For the most part, it makes use of hardware or architectural features or components
that already exist to support single threaded uniprocessor operation.

•

The quiesce instructions are implemented such that speculative or spurious execution of these instructions in any form or sequence will not result in an UND EFINED operation. This is accomplished by deferring most state changes related to
LDx_ARM and QUIESCE until retire time.

•

The quiesce operation eliminates the possibility of quiescing the TPU just as the
lock becomes available (this is necessary for conformance to the instruction definitions).

•

Restarting after a quiesce is low-overhead .

The LDx_ARM instruction looks very much like a load-lock. The load-lock returns
load data, sets lock_flag and loads lock_physical_address. LDx_ARM returns load
data, sets watch_flag and loads watch_physical_address. The load data is returned to
the LDx_ARM at the time the cache access is done, before retire, as is done also for
load-lock. The watch_flag and watch_physical_address are updated when the
LDx_ARM retires Gust as lock_flag and lock_physical_address are changed when the
load-lock retires).
When the LDx_ARM is executed, it is put into the Load Queue in the Mbox. Among
other things, the entry contains the physical address specified by the LDx_ARM. If, at
any time before the LDx_ARM retires, a memory write of any type occurs to that physical address, the LDx_ARM is aborted and scheduled to be reexecuted. At the time
when the LDx_ARM retires successfully, watch_flag is set and
watch_physical_address is loaded with the address from the LDx_ARM.
The LDx_ARM instruction, since it has the characteristics of LDx, may incur a DTB
miss. In this case the DTB miss is handled before watch_flag and
watch_physical_address are affected (since they don't change until retire time).
Similarly, the QUIESCE effects occur at the point of the QUIESCE retiring. This
ensures that the instructions are executed in-order, and the watch_flag/
watch_physical_address values loaded by the LDx_ARM are what are used by the
QUIESCE.

B-10

Compaq Confidential
LDx_ARM/QUIESCE Instruction Characteristics
5 Januc1ry 2001 --Subject To Change

Implementation

If the watch_flag is set when the QUIESCE instruction retires, this TPU enters sleep

mode. In this case, all instructions subsequent to the QUIESCE are flushed, the
QUIESCE_TIMER is started, and the QUIESCE retires. Certain hardware resources on
the chip are deallocated from the quiescent TPU (described below). Once watch_flag is
cleared, this TPU becomes active again, the map thread chooser resumes mapping at
the instruction following the QUIESCE, and the hardware resources are reallocated
back to the reactivated TPU.
We implement the QUIESCE trap by treating a QUIESCE as a WMB. When it is issued
to the MBOX, the MBOX makes an entry in the store queue for it and waits for the
QBOX to signal that the QUIESCE instruction is the next to retire. The MBox then
checks the watch_flag: if it is set, the MBOX signals a trap to flush the subsequent
instructions. The QUIESCE is allowed to retire.
The quiesce trap is analogous to a branch-mispredict, in that it kills the wrongly speculated instructions and restarts at the next correct instruction after the branch. The QUIESCE retires, all instructions after the QUIESCE are killed, and instruction fetch
restarts at the next instruction after the QUIESCE.
Alternate method: The trap clears out all subsequent instructions and instruction fetch
restarts at the QUIESCE. In this case, the second QUIESCE would never successfully
quiesce the machine, since watch_flag would by definition be clear since no
LDx_ARM instruction has set it again. Or, if an interrupt was serviced in this TPU just
after finishing the trap - the watch_flag would have been cleared on the REI.
After the TPU's instructions are flushed, instruction fetch resumes for the quiesced
TPU. Only at the map thread chooser is this TPU idled (not chosen). This means that
when the thread restarts, it can start from the map point, which is much faster than starting at instruction fetch. This brings the idle TPU's instructions as far forward as possible into the pipe, without using Inum space or registers for those waiting instructions.

B.4.1 Interaction of Interrupts and QUIESCE
When a TPU is quiescing, it kills all following instructions and starts refetching from
the instruction following the quiesce. These instructions enter the pipe up to the map
stage, where they are not chosen for mapping until the quiesce is over.
If the watch_flag is cleared due to an interrupt, the pipeline is already full of the instruc-

tions following the QUIESCE. The Ibox starts fetching the ISR but does not disturb the
instructions already in the pipeline. Thus, the ISR will be executed at some point downstream from the QUIESCE instruction.
If a branch mispredict on the previously fetched code kills the ISR code, the TPU needs

to remember to service the interrupt. This works because the interrupt signal is levelsensitive, and is only cleared once the interrupt servicing routine code is successfully
executing.
If an interrupt is directed to a quiesced TPU, the watch_flag is cleared so that the qui-

esce period will end immediately, in the interest of getting to the interrupt as soon as
possible.

5 January 2001 --·Subject To Change

Compaq Confidential
LDx_ARM/QUIESCE Instruction Characteristics B-11

Implementation

If an interrupt is pending to a quiesced TPU, any attempted setting of the watch_flag by

a LDx_ARM fails, so that the TPU will not quiesce. Again, this is in the interest of getting to the ISR as soon as possible. This situation could come up if a QUIESCE is
already in the pipe with an ISR coming along behind it.

B.4.2 Quiesce-Related Hardware
•

QUIESCE_TIMEOUT_ VALUE[3:0]<19:0> IPR. One per TPU, each 20 bits wide.
This is an implementation-specific IPR, which specifies a limit to the number of
CPU cycles that may elapse between the QUIESCE instruction and watch_flag
being cleared. This IPR is writable by PALcode; it does not have to be readable, as
it is never modified by hardware. The value in this register is used to load the
QUIESCE_TIMER (internal 21464 hardware). The default value loaded by hardware at power-up is lOK cycles, which proved to be an effective timeout period in
simulation.
QUIESCE_TIMEOUT_VALUE[3:0]<19:0> is loaded by startup (boot) code. The
timeout value can be specified up to 2"20, or 1048576 (lM) cycles.

Note:

For ease of implementation, it may be useful to have the bottom two bits be
free running, so that the incrementer only has to cycle every fourth cycle.

If software wanted to use a different QUIESCE_TIMEOUT_VALUE for each pro-

cess that is scheduled on a TPU, then QUIESCE_TIMEOUT_VALUE would have
to become part of the process context. We are assuming this is not the case, instead,
QUIESCE_TIMEOUT_VALUE is loaded by powerup code for each TPU.

Note:

B-12

Allowing a different value for each TPU is necessary to provide the capability of running virtual machines, i.e., the ability for different TPUs to run
different operating systems simultaneously. If we rule out this design alternative, one single QUIESCE_TIMEOUT_VALUE IPR, used by all TPU's
and loaded at powerup is sufficient.

•

QUIESCE_TIMER[3:0]<19:0>. This is hardware internal and not accessible by
software. There is one QUIESCE_TIMER per TPU. The QUIESCE_TIMER is
loaded with the value in the QUIESCE_TIMEOUT_VALUE IPR, when the QUIESCE instruction retires. It then decrements once per CPU cycle. When it reaches
zero, watch_flag is cleared, and the timer remains at zero until restarted by the next
QUIESCE retiring. Since QUIESCE_TIMER[n] is only started when a QUIESCE
retires on TPU[n], it is guaranteed to count down to zero eventually; it can't be
restarted speculatively by another QUIESCE. Also, this implementation should
give reproducible results, as desired for verification and also for Tandem.

•

QUIESCE_TIMEOUT[3:0]. Each TPU has its own QUIESCE_TIMEOUT signal.
This signal is asserted for one cycle when QUIESCE_TIMER reaches zero. This
assertion has the effect of clearing watch_flag.

•

watch_flag[3:0]. As specified by the LDx_ARM and QUIESCE instructions; one
perTPU.

Compaq Confidential
LDx_ARM/QUIESCE Instruction Characteristics
5 January 2001 - Subject To CfMmge

Implementation

•

watch_physical_address[3:0]<43:4>. One per TPU. Note that this register does not
have to be the full width of the physical address, it could be less wide. In this case
watch_flag would be cleared more frequently than with the full-width address.

B.4.3 Reallocation Hardware Resources During Quiesce
For as long as a TPU[j] is quiesced, one of four bits
M%QUIESCE_TPU_Q19A_H[3:0] is asserted which has the following effects:

•

Thread map chooser no longer chooses instructions from TPU[j]

•

Inum allocator allocates no more Inums to TPU[j], and as TPU[j] frees Inums,
allows them to be allocated to other, active, TPUs.

•

Load queue entries are repartitioned among the remaining active TPUs. (need more
description here).

•

Store queue entries are repartitioned among the remaining active TPUs. (need more
description here).

•

As a result of Inum reallocation, non-architectural physical registers gradually
migrate from TPU[j] to other TPUs, as they are freed. more description needed here

•

Other effects worth mentioning?

When watch_flag is cleared, M%QUIESCE_TPU_Q19A_H[3:0] is deasserted and
machine resources are gradually given back to the previously quiesced TPU, so that it
can resume execution. At this point, instructions are waiting in the collapsing buffer,
ready to be mapped, once chosen by the map chooser. The TPU is out of Inums, and it
may take some time for the Inums to become available for the TPU, which is probably
the critical resource as far as restarting this TPU.
We estimate it will cost about 100 cycles for a TPU to quiesce and wake up. It will take
about 50 cycles, on average, from the time a TPU comes out of quiesce to the time it
executes its first instruction. The delay is because while quiesced, the TPU gave up
resources (Inums, registers, load queue and store queue entries). It can only gradually
get those resources back as other TPUs retire instructions.
It is not desirable to reserve an Inum block for a quiesced TPU, because we would not

want to do that for a TPU that is not being used at all, and we want to use the same
mechanisms no matter why the TPU is inactive.

B.4.4 Issues to Consider While Finalizing the Hardware Design

•

How much performance is lost because LDx_ARM, QUIESCE, LDx_L and STx_C
all wait until retire time to have an effect?

•

If 4 TPU s are using a lock heavily, is the hardware fair in passing the lock from

TPU to TPU? Consider 2 EV8 CPUs, which each have 4 TPUs. A contended lock
will have the same kinds of concerns between CPUs as exist today with load-lock/
store conditional - the local TPUs will have an advantage over the remote ones. We
need to avoid the situation where the winner keeps winning on repeated uses of the
lock.

•

Does a TPU that quiesces and times out repeatedly eat up too much of the CPU
from the other TPUs?

5 January 2001 -~Subject To Change

Compaq Confidential
LDx_ARM/QUIESCE Instruction Characteristics B-13

Alternative Proposals to the LDx___.ARMIQUIESCE Current Design

•

For real time applications, should we have a mode so that a quiesced TPU would
not give up resources at all (except being scheduled for execution slots).

•

How can we make branch predictions go the right way - so that non-contended lock
works fast? Can we build a branch predictor hint in somehow?

8.5 Alternative Proposals to the LDx_ARM/QUIESCE Current Design
As the current proposal for LDx_ARM/QUIESCE was being developed, a number of
alternatives were considered but not chosen. The alternatives are presented here as
background material.

B.5.1 Timer-Based
A purely timer-based approach was studied at CRL, using the 21464 model, but found
not to work. A QUIESCE instruction that watched the memory location until it changed
was needed to obtain speedups.

B.5.2 Unified QUIESCE Instruction
QUIESCE Ra, (Rb)
This QUIESCE is a load. If a QUIESCE is executed when the watch_flag is clear, it
loads watch_physical_address, sets watch_flag and does not quiesce the processor.
Thus, it acts as an ARM. If a QUIESCE is executed when the watch_flag is set, it does
quiesce the processor. It then stays quiesced until watch_flag is cleared by a store to
watch_physical_address.
For the "first" QUIESCE, the load data can be tested by subsequent instructions to find
out if the lock is held. For a "second" quiesce, it is unclear what that load means or
when it is loaded. It would be nice to load it at the end of the QUIESCE period, to see
what it has changed to, but this is very difficult to implement.
Pros:

•

Just one instruction.

Cons:

•
•
•
•

More difficult to understand and implement.
Two flavors of the instruction, "first" and "second", are hard to think about.
Returning meaningful load data to the second QUIESCE would be difficult.
Specifying what can or can't happen "between" QUIESCEs seems unmanagable .

B.5.3 Use architectural Registers to Enforce LDx_ARM/QUIESCE Dependency
Here, LDQ_ARM is a load and QUIESCE is a store, of sorts.
LDQ_ARM RO,

(RS)

this is a load

(R31)

this is a "store"

BEQ getlock
QUIESCE RO,

Compaq Confidentia I
B-14 LDx_ARM/QUIESCE Instruction Characteristics
5 Janu(1ry 2001 ·-Subject To Change

Alternative Proposals to the LDx.___ARM/QUIESCE Current Design

Since the QUIESCE reads the value in RO, the already-existing hardware in an out-oforder implementation will naturally keep the QUIESCE in-order with the LDQ_ARM,
which it is dependent upon. The watch_physical_address and watch_flag registers are
used as in the originally proposed instructions.
However, having these registers explicitly part of the instruction still does not solve the
problem of keeping writes/reads to watch_flag and watch_physical_address in order.
Hardware still must solve this (the 21464 does it by not accessing them till retire time).
So this suggestion does not really solve any problem.
Pros:

•

Hardware implementations don't have to take special care to order the instructions.

Cons:

•
•

QUIESCE "looks" like a store but it really isn't; non-intuitive .
watch_physical_address and watch_flag access order still must be managed by
hardware other than the usual register-ordering hardware.

B.5.4 Add LDx_ARM Functionality to LDx_L
The LDx_ARM functionality is overloaded on the LDx_L instruction. Whenever a
LDx_L is executed, the watch_physical_address and the watch_flag are set, in addition
to the lock_flag and the lock_physical_address. Or, instead of having the watch_flag
and watch_physical_address registers at all, the lock_flag and the
lock_physical_address could be used both for LDx_L/STx_C functionality and for
ARM/QUIESCE functionality. In this case, QUIESCE would watch for the clearing of
the lock_flag. The same LDx_L would not be used both as the partner of a QUIESCE
and the partner of a STx_C. If the watch* registers are used, LDx_ARM functionality
could be specified using the low address bit of the LDx_L to specify ARM. If only the
lock* registers are used, no differentiation in the LDx_L instruction is needed.
Pros:

•

Have to define only one new instruction (QUIESCE) .

•

Backwards-compatible code, if QUIESCE is a NOP.

•

LDx_L and LDx_ARM already share a lot of functionality.

Cons:

•

Overloading the LDx_L instruction (even more difficult to understand and verify)

•

Restricts implementations by requiring two functionalities; for example, LDx_L
would not be able to request write privileges for a block, since it might be used in
conjunction with a QUIESCE rather than a STx_C.

•

Using low-order address bit to differentiate functionality seems kludgy.

5 January 2001 ·-Subject To Change

Compaq Confidential
LDx_ARM/QUIESCE Instruction Characteristics B-15

Alternative Proposals to the LDx___.ARM/QUIESCE Current Design

B.5.5 Define QUIESCE to be a load and test
The idea here is to have the QUIESCE load a value, and quiesce based on that value.
QUIESCE Ra, (Rb) would load Ra from the memory address in Rb. Then, the thread
would quiesce if the value in Ra was non-zero, and would effectively be a NOP if the
value in Ra was a zero. The QUIESCE instruction would also load the watch_flag and
the watch_physical_address.

It is too restrictive to have just one flavor of test, so we would have to define different
types of QUIESCE, just as there are many types of branches.
Pros:

•
•

LDx_ARM is not needed .

•

Only one instruction accomplishes the functionality.

Coding restrictions not needed.

Cons:

•

Too many new instructions to define (multiple flavors)

•

Different type of instruction - hardware has to operate on load data (data from
memory).

B.5.6 Define QUIESCE to be a read of memory and compare with a register
(Ernie Petrides)

This version of QUIESCE is used as follows:
LDQ RO,

(RS)

BEQ RO, getlock
QUIESCE RO I

(RS)

The QUIESCE translates the VA in R5 and reads the lock value from that physical
address. It then compares that lock value with the contents of RO, which was previously
loaded by a vanilla load preceding the QUIESCE. If the two values are equal, the QUIESCE succeeds and the thread goes to sleep. If they are not equal, the QUIESCE is like
a NOP and does not put the thread to sleep.
While the processor is asleep, the hardware watches the PA as calculated when the
QUIESCE executed. This is analogous to the watch_physical_address register as
defined in other instructions, but is entirely private to the hardware (not software visible
at all). The quiesce period ends if some write access happens to that PA, etc.
Pros:

•
•
•
•
•
B-16

LDx_ARM is not needed.
Coding restrictions not needed.
Only one instruction accomplishes the functionality.
watch_flag and watch_physical_register do not need to be defined as IPRs or mentioned in the SRM at all.
Very appealing from a software point of view.

Compaq Confidential
LDx_ARM/QUIESCE Instruction Characteristics
5 January 2001

Subject To CfJange

Open Issues

Cons:

•

Complicated instruction, unlike any other - loads from memory, reads from a register, does a compare all in the same instruction.

•

Difficult to implement - introduces datapath completely unlike anything we have
already.

B.6 Open Issues
•

Is the lack of a displacement for LDx_ARM a problem? We believe it is not an
issue for Unix or VMS.

•

Should we add anything to Section B.2.2.1?

•

Sections B.2.1.1 and B.2.2.1 use "processor" to mean "TPU" and "CPU" to mean a
thing that can contain multiple processors. It seems confusing to use "processor" to
mean both "TPU" and "CPU". Does the terminology need to be changed?

Compaq Confidential
5 January 2001 ~·Subject To Change

LDx_ARM/QUIESCE Instruction Characteristics B-17

Open Issues

B-18

Compaq Confidential
LDx_ARM/QUIESCE Instruction Characteristics
5 Jc1nuary 2001 -·Subject To Change

Motivation for This Design

c
Proposed Memory Management IPR Design
This appendix proposes a design for the 21464 memory management IPRs from Jeff
Wiedemeier, Judy Hall, and Eileen Samberg. Upon approval, it will be incorporated
into the body of the Specification.
There are references in this appendix to ECO 129, which is available at:
http://amt233.lkg.dec.com/alphaarchitect/approved-ecos/eco129/
eco129_prelim_mm.doc

C.1 Motivation for This Design
The 21464 memory management IPR design eliminates bit overloading that can limit
the options that are available to an operating system and the PALcode.
Limitation:

The same bit (VA_48) that controls sign-extension checking also dictates the format of VA_FORM.
When VA_48 is set, VA_FORM is based on a 4-level page table. Software that uses 48-bit superpage
and 43-bit (or smaller) mapped addresses has no VA_FORM that works. If software uses 3-level
page tables, it must use VA_48=0. This prevents it from using 48-bit superpage and thus from being
able to directly address (without mapping) the entire physical address space of the CPU outside of
PALmode.
Correction: Separate the control of sign-extension checking from the format control of VA_FORM.
Limitation:

Sign extension checking is applied to superpage and mapped addresses equally. If software wants to
use 48-bit superpage, sign extension checking must be set up for 48-bit checking. This means that if
software uses 48-bit superpage and 43-bit virtual addresses, VA<46:42> are not checked by hardware for proper sign extension and must either be checked by PALcode or ignored and assumed to be
correct. Besides being slow, checking by PALcode can only be done in memory-management related
traps and therefore cannot catch all cases.
Correction: Modify the sign extension checking algorithm to accommodate the large superpage in all virtual
addressing modes. This is how the 21464 will support the mixed mode described in ECO 129.
Limitation:

Arbitrarily basing the decision of which double miss flow to use (DTB_MISS_DOUBLE_3 or
DTB_MISS_DOUBLE_4) on VA_48 prevents other uses for multiple double miss flows.
Correction: Base the decision of which double miss flow to use on an independent IPR bit rather than VA_48.

C.2 Page Table Assumptions
The following assumptions are made concerning the page tables.

1. The SRM allows only 3-level page tables.
ECO 129, Section 1, removes 4-level page tables from the architecture.

5 January 2001 -- Subject To Change

Compaq Confidential
Proposed Memory Management IPR Design

C-1

Page Table Assumptions

2. The SRM allows the level 1 page table to be partially utilized.
ECO 129, Section 3, states:
The level one of the page table is partially utilized, similar to the previous 4level proposal.
If the level 1 page table is required to be fully-utilized, then 64 KB pages require
55-bit virtual addresses. Since the 21464 implements a 52-bit virtual address, the
level 1 page table in 64 KB page mode will not be fully-utilized.

3. At least 2 bits of level 1 page table index must be implemented
Page (II-A) 3-3 of the Version 7 SRM states:
An implementation that supports the fourth level-number field may further subset the supported address space to include only a subset of low-order bits within
that field. That subset must be at least two bits 1, and may be as large as n bits,
where n is the full bit count of any given level-number field. The most significant bit in the chosen subset is sign-extended to VA<63> for any valid virtual
address.
1

0penVMS requires at least three PfEs in the highest-level page table. The
lowest-order PTE must map process space, the highest-order PfE must map
system space, and the penultimate PfE maps the page table structure.
If 2 bits of level 1 page table index must be implemented, then 64 KB pages require
at least a 44-bit virtual address [2/13/13/16] but can be used with up to a 55-bit virtual address [13/13/13/16].

C-2

Compaq Confidential
Proposed Memory Management IPR Design
5 J~1nw~ry 2001 - Subject To Cfumge

1...stream {l ____CTL) and o. .stream (M____ CTl) Control Registers

C.3 I-Stream {l_CTL) and D-Stream {M_CTL) Control Registers
The following sections define the fields for I_C1L and M_CTL.

C.3.1 l_CTL
The following fields are defined for I_CTL.
Table C-1 l_CTL Field Definitions
Name

Type

Description

PAGE_SIZE

Controls the I-Stream page size:
Value

Meaning

1
0

I-Stream page size is 64 KB
I-Stream page size is 8 KB

PAGE_SIZE influences the format of IVA_FORM (see Section C.4).
VA_SIZE

Controls the I-Stream virtual address size:
Value

Meaning
1
I-Stream virtual address size is 52 bits
0
I-Stream virtual address size is 43 bits
VA_SIZE influences the format of IVA_FORM (see Section C.4) and
influences sign extension checking (see Section C.5).

5 January 2001 ·-Subject To Change

Compaq Confidential
Proposed Memory Management IPR Design

C-3

1...stream {t.CTL) and o. .stream (M ....CTL) Control Registers

Table C-1 l_CTL Field Definitions (Continued)
Name

Type

Description

REDUCED_PAGE_TABLE

Controls reduced page table mode:
Value

Meaning

Quadrant 1 of the virtual address space (VA<n:n-1> = 01)
is the reduced page table region
No special handling of quadrant 1

REDUCED _PAGE_TABLE influences the format of IVA_FORM
(see Section C.4). See ECO 129 for information on reduced page
table mode.
SPE<2:0>

I-Stream Superpage mode enable.
Bit

Mnemonic Meaning When Set

SPE<2> SPE52

Enables superpage mapping when
VA<63:50> = Ox3FFE. In this mode
VA<47:0> are mapped directly to PA<47:0>.
Because the physical address is only 48 bits,
VA<49:48> are ignored.
Enables superpage mapping when
SPE<l> SPE43
VA<63:41> = Ox7FFFFE. In this mode
VA<40:0> are mapped directly to PA<40:0>
and PA<47:41> are copies of PA<40> (sign
extension).
Enables superpage mapping when
SPE<O> SPE32
VA<63:30> = Ox3FFFFFFFE. In this mode,
VA<29:0> are mapped directly to PA<29:0>
and PA<47:30> are cleared.
o Any non-kernel mode access to an enabled superpage region must
result in an access violation.
o Any combination of these bits is allowed.
o These bits influence sign extension checking (see Section C.5).
DOUBLE_MISS_CONTROL

Controls the vectoring for all double TB misses, both I-Stream and
D-Stream, and determines which double miss flow is vectored to
when a hw_ld/vpte misses in the TB.
Value
Meaning

A TB miss on a hw_ld/vpte will vector to
DTB_MISS_DOUBLE_ALT.
A TB miss on a hw_ld/vpte will vector to
DTB_MISS_DOUBLE.

DTB_MISS_DOUBLE and DTB_MISS_DOUBLE_ALT are in used
in place of the 21264's DTB_MISS_DOUBLE_3 and
DTB_MISS_DOUBLE_4, the distinction from the 21264 being that
PALcode decides which to use.

C-4

Compaq Confidential
Proposed Memory Management IPR Design
5 Januc1ry 2001 ·-Subject To Change

I-Stream {l ____CTl) and D-Stream (M____CTL) Control Registers

C.3.2 M_CTL
The following fields are defined for M_CTL.
Table C-2 M_CTL Field Definitions
Name

Type

Description

PAGE_SIZE

Controls the D-Stream page size:
Value

Meaning

1
D-Stream page size is 64 KB.
0
D-Stream page size is 8 KB.
PAGE_SIZE influences the format of VA_FORM (see Section C.4).
VA_SIZE

Controls the D-Stream virtual address size:
Value

Meaning

1
D-Stream virtual address size is 52 bits.
0
D-Stream virtual address size is 43 bits.
VA_SIZE influences the format of VA_FORM (see Section C.4) and
influences extension checking (see Section C.5).
REDUCED_PAGE_TABLE

Controls reduced page table mode:
Value

Meaning

Quadrant 1 of the virtual address space (VA<n:n-1> = 01)
is the reduced page table region.
No special handling of quadrant 1.

REDUCED_PAGE_TABLE influences the formatofVA_FORM
(see Section C.4). See ECO 129 for information on reduced page
table mode.
SPE<2:0>

D-Stream Superpage mode enable.
Bit

Mnemonic Meaning When Set

SPE<2> SPE52

SPE<l> SPE43

SPE<O> SPE32

Enables superpage mapping when
VA<63:50> = Ox3FFE. In this mode
VA<47:0> are mapped directly to PA<47:0>.
Because the physical address is only 48-bits,
VA<49:48> are ignored.
Enables superpage mapping when
VA<63:41> = Ox7FFFFE. In this mode
VA<40:0> are mapped directly to PA<40:0>
and PA<47:41> are copies of PA<40> (sign
extension).
Enables superpage mapping when
VA<63:30> = Ox3FFFFFFFE. In this mode,
VA<29:0> are mapped directly to PA<29:0>
and PA<47:30> are cleared.

o Any non-kernel mcxle access to an enabled superpage region
must result in an access violation.
o Any combination of these bits is allowed.
o These bits influence sign extension checking (see Section C.5).

5 January 2001 ·- Subject To Change

Compaq Confidential
Proposed Memory Management IPR Design

C-5

VA ....FORM and IVA ___ FORM

C.3.3 PAGE_SIZE, VA_SIZE, and REDUCED_PAGE_TABLE Field Combinations
Combinations of the PAGE_SIZE, VA_SIZE, and REDUCED_PAGE_TABLE fields
are valid or invalid as shown in Table C-3.
Although every valid combination of these bits has PAGE_SIZE and VA_SIZE set the
same way, it is recommended that the bits remain separate since the two controls serve
distinct functions. One of the primary situations which led to this document was the
overloading of bits in previous Alpha implementations.
Table C-3 Valid and Invalid PAGE_SIZE, VA_SIZE, and REDUCED_PAGE_TABLE Combinations
PAGE_ SIZE

VA_SIZE

REDUCED_PAGE_TABLE

Description

43-bit VA with 8 KB pages. This is the addressing
mode used by Tru64 UNIX and OpenVMS today.

43-bit VA with 8 KB pages and reduced page
tables. This mode is invalid 1•

52-bit VA with 8 KB pages. This mode is invalid2 .

43-bit VA with 64 KB pages. This mode is
invalid 3•

43-bit VA with 64 KB pages and reduced page
tables. This mode is invalid 3•

52-bit VA with 64 KB pages.

52-bit VA with 64 KB pages and reduced page
tables.

2
3

52-bit VA with 8 KB pages and reduced page
tables. This mode is invalid 1' 2•

Reduced Page Table mode requires 64 KB pages
3-level page tables with 8 KB pages only allow a 43-bit virtual address
64 KB pages require at least a 44-bit virtual address

C.4 VA_FORM and IVA_FORM
This is a generalized discussion of the impact of the PAGE_SIZE, VA_SIZE, and
REDUCED_PAGE_TABLE IPR bits on the format of VA_FORM and IVA_FORM.
Note:

These bits in l_CTL control the format of IVA_FORM; in M_CTL, they
control the format of VA_FORM. Because the behavior of VA_FORM and
IVA_FORM is the same, VA_FORM represents both VA_FORM and
IVA_FORM throughout this discussion.

The effect of these bits on VA_FORM is:
•

C-6

PAGE_SIZE controls where VA is positioned for inclusion in VA_FORM. If
PAGE_SIZE is set, VA<l6> is positioned at VA_FORM<3>. If PAGE_SIZE is
clear, VA<l3> is positioned at VA_FORM<3>. VA_FORM<2:0> are always 0 for
alignment of the 64-bit PTE address.

Compaq Confidential
Proposed Memory Management IPR Design
5 Janw'fry 2001 -~Subject To Change

VA.___ FORM and IVA.___ FORM

•

VA_SIZE controls how many bits are transferred from VA to VA_FORM. If set,
VA<51:n> are transferred. If clear, VA<42:n> are transferred. n is either 16 if
PAGE_SIZE is set or 13 if PAGE_SIZE is not set.

•

REDUCED_PAGE_TABLE controls how VA_FORM is formed in quadrant 1
(VA<n:n-1> == 01) and is only valid in 52-bit VA/64 KB page mode. If set,
VA_FORM is formed as discussed in the transformations that follow.

The transformations in Section C.4.1 show how VA_FORM is formed in each of the
valid modes.

C.4.1 The Transformation From VA to VA_FORM
The diagrams below show how the transformation from VA to VA_FORM is made for
each of the valid combinations of VA_SIZE, PAGE_SIZE, and
REDUCED_PAGE_TABLE. The addresses in the diagrams are broken down into their
component fields:
L1 - Level 1 PFN
L2 - Level 2 PFN
L3 - Level 3 PFN
BI - Byte Index (Offset within page)
The data in VA_FORM is different from the data in VA. Therefore, the component
fields in VA_FORM are referred to as L1 ', L2', L3', and Bl'. Additional information
indicating where in VA the data came from is listed parenthetically for VA_FORM. So,
L2' (Ll) indicates that this is the Level 2 PFN in VA_FORM and the data came from
the Level 1 PFN of VA.

C.4.2 43-bit VA I 8 KB Page
VA_SIZE =O, PAGE_SIZE =0, REDUCED_PAGE_TABLE =0)
63

43 42

33 32

23 22

13 12

From VPTE_BASE

So:
VA_FORM<63:33> comes from VPTE_BASE
VA_FORM<32: 3> comes from VA<42:13>
VA_FORM< 2: 0> is 0

5 January 2001 ··· Subject To Change

Compaq Confidential
Proposed Memory Management IPR Design

C-7

VA ___.FORM and IVA ___ FORM

C.4.3 52-bit VA/ 64 KB page
VA_SIZE = 1, PAGE_SIZE = 1, REDUCED_PAGE_TABLE= O
52 51

42 41

29 28

16 15

So:
VA_FORM<63:42> comes from VPTE_BASE
VA_FORM<41:39> comes from VA<54:52> or SEXT(VA<51>)
VA_FORM<38: 3> comes from VA<51:16>
VA_FORM< 2: 0> is 0
Note:

There are only 52 bits of VA (VA<51:0>), but VA<54:52>, the sign extension of
VA<51:0>, is used in VA_FORM. This is required because with 64 KB pages, a 52bit address does not fully utilize a 3-level page table. With 64 KB pages, 55 bits of
virtual address are required to fully utilize a 3-level page table. See Assumptions 2
and 3 at the beginning of this document for the full discussion of partially-utilized
page tables.

C-8

Compaq Confidential
Proposed Memory Management IPR Design
5 January 2001 - Subject To Change

VA.___ FORM and IVA.___FORM

C.4.4 52-bit VA I 64 KB Page I Reduced Page Tables
VA_SIZE = 1, PAGE_SIZE = 1, REDUCED_PAGE_TABLE = 1

Quadrant 1 (VA<51:50>=01 2) behaves as described here. VA_FORM is set for all
other virtual addresses as described above under 52-bit VA I 64 KB page.

52 51

42 41

29 28

VA
I

I
I
I

I
I

So:
VA_FORM<63:42> comes from VPTE_BASE
VA_FORM<41: 39> is 0
VA_FORM<38:37> comes from VA<51:50> (01 2)
VA_FORM<36:24> is 0
VA_FORM<23: 3> comes from VA<49:29>
VA_FORM< 2: 0> is 0
Note:

ECO 129, section 4 states:
This reduced page table mode does not modify the format of the PTE's (sic)
from the base 64KB mode. The lower 13 bits of the PFN are unused and the
GH bits must be all ones (a value of 3) in this mode for the VA<47:46> == 1
space.
In reduced page table mode, the OS is required to set the granularity hints in quadrant 1
such that each second-level PfE maps what an entire third level page table would normally map.

5 January 2001 ···Subject To Change

Compaq Confidential
Proposed Memory Management IPR Design

C-9

Sign Extension Checking

C.5 Sign Extension Checking
C.5.1 Previous Implementation
The 21264 implemented sign extension and superpage checking as shown in the following pseudo-code.
As shown in the code, although 48-bit superpage was the only superpage mode that
could directly address the entire physical address space of the processor, it could only
be used if 48-bit addressing was turned on. Unfortunately, enabling 48-bit addressing
lost some sign-extension validation of legitimate addresses in a 43-bit or smaller virtual
addressing environment and broke VA_FORM for 43-bit or 32-bit addressing modes.
if ((VA_48 && (VA<63:0> != SEXT(VA<47:0>)))

(!VA_48 && (VA<63:0> != SEXT(VA<42:0>)))) {
DFAULT;

// improperly sign extended address ...

if (SPE48 && (VA<47:46> == 2))
if (mode == kernel) {
PA<43:0> = VA<43:0>;
else {
DFAULT; // superpage access in non-kernel mode

} else if (SPE43 && (VA<47:41> == Ox7E))
if (mode == kernel) {
PA<43:0> = SEXT(VA<40:0>);
else {
DFAULT; // superpage access in non-kernel mode

} else if (SPE32 && (VA<47:30> == Ox3FFFE))
if (mode == kernel)
PA<43:30> = O;
PA<29:0> = VA<29:0>;
} else {
DFAULT; // superpage access in non-kernel mode

} else
PA<43 : 0> = TBLookup (VA) ;

C-10

Compaq Confidential
Proposed Memory Management IPR Design
5 January 2001 ·-Subject To Change

Sign Extension Checking

C.5.2 Proposed Implementation
To allow any superpage mode (most importantly, the mode that can directly address the
entire physical address space of the processor), the pseudo-code in Section C.5.1 can be
changed to the following (assuming IPR bits as discussed above):
if (SPE52 && (VA<63:50> == Ox3FFE))
if (mode == kernel) {
PA<47:0> = VA<47:0>;
else {
DFAULT; // superpage access in non-kernel mode
else if (SPE43 && (VA<63:41> == Ox7FFFFE))
if {mode == kernel) {
PA<47:0> = SEXT(VA<40:0>);
else {
DFAULT; // superpage access in non-kernel mode
else if {SPE32 && (VA<63:30> == Ox3FFFFFFFE))
if (mode == kernel)
PA<47:30> = O;
PA<29:0> = VA<29:0>;
else {
DFAULT; // superpage access in non-kernel mode
else if (((VA_SIZE == 52-bit) && (VA<63:0> == SEXT(VA<51:0>)))

((VA_SIZE == 43-bit) && (VA<63:0> == SEXT(VA.<42:0>)))) {
PA<47: 0> = TBLookup (VA) ;
else {
DFAULT;

// improperly sign extended address

Note:

Besides including the superpage checks as peers of the traditional virtual address
sign-extension check, the superpage checks are changed to check all the way to bit
63. The superpage checks require that the superpage address be a properly signextended address for the size of the superpage region.

5 January 2001 ·-Subject To Change

Compaq Confidential
Proposed Memory Management IPR Design C-11

Sign Extension Checking

Compaq Confidential
C-12 Proposed Memory Management IPR Design

5 J<1nuary 2001 ·-Subject To Change

Glossary

Bank Conflict
The Dcache is implemented as eight independent, interleaved memories (banks), so that
they can perform multiple operations per cycle. If two instructions need the same bank
at the same time, they cannot both be satisfied. This event is called a bank conflict, and
causes one of the instructions to be retried. Similarly, the Scache has several banks
(probably 16) which may be needed in different pipeline stages for different kinds of
requests. If two requests conflict for use of the same bank in different stages, one
request is retried.

Bbox
BIU (see Cbox) Interface unit which controls the Scache (formerly Bcache), the second-level cache, which is shared by data and instructions.

Bcache
External second-level cache, which has been eliminated from the design in favor of an
internal second-level cache called the Scache.

Blacklist
A cache containing the PC addresses of load and store instructions which have recently
caused memory order traps.
Whenever a load instruction is found to be on the blacklist, it is forced to wait for completion until all older blacklisted stores have been executed.

Block
A contiguous, naturally aligned 64-byte region of the logical memory space. It may be
contained in a single cache which is permitted to modify it, or it may be shared by many
caches which have read access. A block is the unit of interprocessor communication,
and also the unit managed by the coherence protocol.
We don't use the word block to refer to a group of instructions in the pipeline. Groups of
instructions in the pipeline are called "chunks" -- as in "Map Chunk", "Fetch Chunk"
and others.

Cache Coherence
In a system with multiple processors, each having a cache, correct operation of the software requires that the caches maintain a consistent (or coherent) representation of the
contents of memory. This is accomplished by communication among the caches and
memories using a coherence protocol.
Compaq Confidential

5 January 2001 ~· Subject To Change

Glossary-1

CAM
Content addressible memory. A structure that takes an input and compares it with a
number of tags and automatically reads the contents of every location that has a matching tag. Although similar to a cache, in a CAM, the tag check and data lookup is integrated. Typically, a CAM is fully associative.

Cb ox
Secondary cache, external memory, and system interface, including cache coherence.
Documentation in Cbox.

Clean Victim
See Victim.

Coherence Message
In order to ensure that all processors see memory modifications in the same order, the
system must make sure that all sharing processors have invalidated their copies of a
block before the block is passed from the owner to another processor. Sharing proces
ors respond to the invalidate message with a coherence message, and the new owner
counts coherence messages to make sure all sharers have been invalidated before forwarding the block.

Complete
An instruction has been completed (alt. is complete) when it has produced a value that
can be consumed by its dependents and it has passed the point at which it may itself
trigger a trap. (e.g. A STore instruction is not complete until the Mbox has determined
that execution of the STore it will not result in a DTB miss.) Even speculatively executed instructions can complete.

Db ox
Data Cache (Dcache, see Mbox). First-level data cache.

•
•
•
•
•
•

64K Bytes
Writeback
2-way set associative
8 banks, interleaved on bits 5-3 of address (quadword banks)
Double-pumped for two reads or one write per cycle per bank
Write & victim-extract performed in otherwise-idle banks to avoid conflict

DIFT
Directory In-Flight Table. A list of uncompleted requests to the local memory. Any new
request which matches the address of a DIFT entry must wait for release of that entry
before being processed. Requests which do not collide with existing DIFf entries are
eligible for service by the memory, and are scheduled to optimize memory utilization.