Digital PDFs

EC-H1689-10

1992

246 pages

Original

11MB

Document:	Alpha Architecture Handbook
Order Number:	EC-H1689-10
Revision:	0
Pages:	246
Original Filename:

OCR Text

Alpha Architecture
Handbook

Digital believes that the information in this publication is accurate as of its publication date; such
information is subject to change without notice. Digital is not responsible for any inadvertent
errors.
Copyright © 1992 Digital Equipment Corporation
All rights reserved. Printed in U.S.A.
The following are trademarks of Digital Equipment Corporation: PDP-ll, VAX, VMS, ULTRIX,
and the Digital logo.
OSF/1 is a registered trademark of Open Software Foundation, Inc.

UNIX is a registered trademark of UNIX System Laboratories, Inc.

Table of Contents

Preface

Chapter 1 · Introduction
The Alpha Approach to RISC Architecture
Data Format Overview
Instruction Format Overview.....
Instruction Overview
Instruction Set Characteristics
Terminology and Conventions
Numbering
Security Holes
UNPREDICTABLE and UNDEFINED
Ranges and Extents
ALIGNED and UNALIGNED
Must Be Zero (MBZ)
Read As Zero (RAZ)
Should Be Zero (SBZ)
Ignore (IGN) ..
Implementation Dependent (IMP)
Figure Drawing Conventions
Macro Code Example Conventions

1-1
1~3

1-4
1-4
1-6
1-6
1-6
1-6
1-7
1-8
1-8
1-8
1-8
1-8
1-8
1-9
1-9
1-9

Chapter 2 . Basic Architecture
Addressing
Data Types
Byte
Word
Longword
Quadword
VAX Floating-Point Formats
F_floating
G_floating
D_floating
IEEE Floating-Point Formats
S_Floating
T_floating
Longword Integer Format in Floating-Point Unit
Quadword Integer Format in Floating-Point Unit
Data Types with No Hardware Support

.. 2-1
2-1
2-1
2-1
2-2
2-2
2-3
2-3
2-4
2-5
2-6
2-7
2-8
2-9
2-10
2-11

iv • Table 0/ Contents

Chapter 3 • Instruction Formats

Alpha Registers
Program Counter
Integer Registers
Floating-Point Registers
Lock Registers
Optional Registers
Memory Prefetch Registers
VAX. Compatibility Register
Notation
Operand Notation
Instruction Operand Notation
Operators
Notation Conventions
Instruction Formats
Memory Instruction Format
Memory Format Instructions with a Function Code
Memory Format Jump Instructions
Branch Instruction Format
Operate Instruction Format
Floating-Point Operate Instruction Format
Floating-Point Convert Instructions
PALcode Instruction Format

3-1
3-1
3-1
3-2
3-2
3-2
3-2
3-2
3-2
3-3
3-4
3-5
3-8
3-8
3-8
3-9
3-9
3-9
3-10
3-11
3-12
3-12

Chapter 4 • Instruction Descriptions

Instruction Set Overview
Subsetting Rules
Floating-Point Subsets
Software Emulation Rules
Opcode Qualifiers
Memory Integer Load/Store Instructions
Load Address
Load Memory Data into Integer Register
Load Unaligned Memory Data into Integer Register
Load Memory Data into Integer Register Locked
Store Integer Register Data into Memory Conditional...
Store Integer Register Data into Memory....
Store Unaligned Integer Register Data into Memory
Control Instructions
Conditional Branch
Unconditional Branch
Jumps

4-1
4-2
4-2
4-2
4-3
4-4
4-5
4-6
4-7
4-8
4-11
4-13
4-14
4-15
4-17
4-18
4-19

Table of Contents· v

Integer Arithmetic Instructions
Longword Add
Scaled Longword Add
Quadword Add
Scaled Quadword Add
Integer Signed Compare
Integer Unsigned Compare
Longword Multiply
Quadword Multiply
Unsigned Quadword Multiply High
Longword Subtract
Scaled Longword Subtract
Quadword Subtract
Scaled Quadword Subtract
Logical and Shift Instructions
Logical Functions
Conditional Move Integer
Shift Logical
Shift Arithmetic
Byte-Manipulation Instructions
Compare Byte
Extract Byte
Byte Insert
Byte Mask
Zero Bytes
Floating-Point Instructions
Floating Subsets and Floating Faults
Definitions
Encodings
Floating-Point Rounding Modes
Floating-Point Trapping Modes
Imprecise /Software Completion Trap Modes
Invalid Operation Arithmetic Trap
Division by Zero Arithmetic Trap
Overflow Arithmetic Trap
Underflow Arithmetic Trap
Inexact Result Arithmetic Trap
Integer Overflow Arithmetic Trap
Floating-Point Single-Precision Operations
FPCR Register and Dynamic Rounding Mode
Accessing the FPCR
Default Values of the FPCR
Saving and Restoring the FPCR
IEEE Standard

4-21
4-22
4-23
4-24
4-25
4-26
4-27
4-28
4-29
4-30
4-31
4-32
4-33
4-34
4-35
4-36
4-37
4-39
4-40
4-41
4-42
4-44
4-47
4-49
4-52
4-53
4-53
4-54
4-55
4-55
4-57
4-58
4-59
4-60
4-60
4-60
4-60
4-60
4-61
4-61
4-63
4-63
4-64
;...................................... 4-64

vi • Table of Contents

Memory Format Floating-Point Instructions
Load F_floating
Load G_floating
Load S_floating
Load T_floating
Store F_floating
Store G_floating
Store S_floating
Store T_floating
Branch Format Floating-Point Instructions
Conditional Branch
Floating-Point Operate Format Instructions
Copy Sign
Convert Integer to Integer
Floating-Point Conditional Move
Move from/to Floating-Point Control Register
VAX.. Floating Add
IEEE Floating Add
VAX.. Floating Compare
IEEE Floating Compare
Convert VAX Floating to Integer
Convert Integer to VAX.. Floating
Convert VAX Floating to VAX Floating .
Convert IEEE Floating to Integer
Convert Integer to IEEE Floating
Convert IEEE Floating to IEEE Floating
VAX.. Floating Divide
IEEE Floating Divide
VAX.. Floating Multiply
IEEE Floating Multiply
VAX.. Floating Subtract
IEEE Floating Subtract
Miscellaneous Instructions
Call Privileged Architecture Library.......
Prefetch Data
Memory Barrier
Read Process Cycle Counter
Trap Barrier
VAX Compatibility Instructions
VAX.. Compatibility Instructions

4-65
4-66
4-67
4-68
4-69
4-70
4-71
4-72
4-73
4-74
4-75
4-76
4-78
4-79
4-80
4-82
4-83
4-84
4-85
4-86
4-87
4-88
4-89
4-90
4-91
4-92
4-93
4-94
4-95
4-96
4-97
4-98
4-99
4-100
4-101
4-103
4-104
4-105
4-106
4-107

Table of Contents • vii

Chapter 5 • System Architecture and Programming Implications

Introduction
Physical Memory Behavior
Coherency of Memory Access
Granularity of Memory Access
Width of Memory Access
Memory-Like Behavior
Translation Buffers and Virtual Caches
Caches and Write Buffers
Data Sharing
Atomic Change of a Single Datum
Atomic Update of a Single Datum
Atomic Update of Data Structures
Ordering Considerations for Shared Data Structures
ReadlWrite Ordering
Alpha Shared Memory Model
Architectural Definition of Processor Issue Sequence
Definition of Processor Issue Order
Definition of Memory Access Sequence
Definition of Location Access Order
Definition of Storage
Relationship Between Issue Order and Access Order
Definition of Before
Definition of After
Timeliness
Litmus Tests
Litmus Test 1 (Impossible Sequence)
Litmus Test 2 (Impossible Sequence)
Litmus Test 3 (Impossible Sequence)
Litmus Test 4 (Sequence Okay)
Litmus Test 5 (Sequence Okay)
Litmus Test 6 (Sequence Okay)
Litmus Test 7 (Impossible Sequence)
Litmus Test 8 (Impossible Sequence)
Litmus Test 9 (Impossible Sequence)
Implied Barriers
Implications for Software
Single-Processor Data Stream
Single-Processor Instruction Stream
Multiple-Processor Data Stream (Including Single Processor with DMA 1/0)
Multiple-Processor Instruction Stream (Including Single Processor with
DMA 1/0)
Multiple-Processor Context Switch
Multiple-Processor Send/Receive Interrupt
Implications for Hardware
Arithmetic Traps

5-1
5-1
5-1
5-2
5-2
5-3
5-3
5-4
5-5
5-5
5-5
5-6
5-7
5-8
5-9
5-10
5-10
5-11
5-11
5-11
5-12
5-12
5-12
5-12
5-12
5-12
5-13
5-13
5-13
5-13
5-14
5-14
5-14
5-15
5-15
5-15
5-15
5-16
5-16
5-16
5-17
5-19
5-19
5-20

viii • Table 0/ Contents
Chapter 6 • Common PALcode Architecture
PALcode
PALcode Environment
Special Functions Required for PALcode
PALcode Effects on System Code
PALcode Replacement
Required PALcode Instructions
Halt
Instruction Memory Barrier

6-1
6-1
6-2
6-2
6-2
6-3
6-4
6-5

Chapter 7 • Console Subsystem Overview
Chapter 8 • Alpha VMS
Unprivileged VMS PALcode Instructions
Privileged VMS Palcode Instructions

8-1
8-8

Chapter 9 • Alpha OSF/1
Unprivileged aSF/1 PALcode Instructions
Privileged aSF/1 PALcode Instructions

9-1
9-2

Appendix A • Software Considerations
Hardware-Software Compact
Instruction-Stream Considerations
Instruction Alignment
Multiple Instruction Issue-Factor of 3
Branch Prediction and Minimizing Branch-Taken-Factor of 3
Improving I-Stream Density-Factor of 3
Instruction Scheduling-Factor of 3
Data-Stream Considerations
Data Alignment-Factor of 10
Shared Data in Multiple Processors-Factor of 3
Avoiding Cache/TB Conflicts-Factor of 1
Sequential ReadlWrite-Factor of 1
Prefetching-Factor of 3
Code Sequences
Aligned BytelWord Memory Accesses
Division
Stylized Code Forms
NOP
Clear a Register
Load Literal
Register-to-Register Move
Negate
NOT
Booleans

A-I
A-2
A-2
A-2
A-3
A-4
A-5
A-6
A-6
A-7
A-8
A-9
A-I0
A-II
A-ll
A-12
A-12
A-12
A-13
A-13
A-14
A-14
A-14
A-14

Table 0/ Contents • ix

Trap Barrier
Pseudo-Operations (Stylized Code Forms)
Timing Considerations: Atomic Sequences

A-14
A-15
A-17

Appendix B · IEEE Floating-Point Conformance
Alpha Choices for IEEE Options
Alpha Hardware Support of Software Exception Handlers
Mapping to IEEE Standard

B-1
B-2
B-3

Appendix C · Instruction Encodings
Memory Format Instructions
Branch Format Instructions ..
Operate Format Instructions
Floating-Point Operate Format
IEEE Floating-Point Instructions
VAX. Floating-Point Instructions
Required PALcode Function Codes
Opcodes Reserved to PALcode
Opcodes Reserved to Digital
Opcode Summary

C-l
C-2
C-2

C-3
C-3
C-5
C-5

C-6
C-6
C-6

Index
Figures
1-1
2-1
2-2
2-3
2-4
2-5
2-6
2-7
2-8
2-9
2-10
2-11
2-12
2-13
2-14
2-15
2-16
2-17
2-18
3-1
3-2
3-3
3-4
3-5

Instruction Format Overview........
Byte Format
Word Format
Longword Format
Quadword Format
F_floating Datum
F_floating Register Format
G_floating Datum
G_floating Format
D_floating Datum
D_floating Register Format
S_floating Datum
S_floating Register Format
T_floating Datum
T_floating Register Format
Longword Integer Datum
Longword Integer Floating-Register Format
Quadword Integer Datum
Quadword Integer Floating-Register Format
Memory Instruction Format
Memory Instruction with Function Code Format
Branch Instruction Format
Operate Instruction Format
Floating-Point Operate Instruction Format

1-4
2-1
2-1
2-2
2-2
2-3
2-3
2-4
2-4
2-5
2-5
2-7
2-7
2-8
2-9
2-9
2-10
2-10
2-10
3-8
3-9
3-9
3-10
3-11

x • Table 0/ Contents

3-6
4-1
B-1

PALcode Instruction Format
Floating-Point Control Register (FPCR) Format
IEEE Trap Handling Behavior

3-12
4-61
.. B-3

Tables
2-1
2-2
3-1
3-2
3-3
3-4
4-1
4-2

F_floating Load Exponent Mapping
S_floating Load Exponent Mapping
Operand Notation
Operand Value Notation
Expression Operand Notation
Operators
Opcode Qualifiers
Memory Integer Load/Store Instructions
4-3 Control Instructions Summary
4-4 Jump Instructions Branch Prediction
4-5 Integer Arithmetic Instructions Summary
4-6 Logical and Shift Instructions Summary
4-7 Byte-Manipulation Instructions Summary
4-8 Floating-Point Control Register (FPCR) Bit Descriptions
4-9 Memory Format Floating-Point Instructions Summary
4-10 Floating-Point Branch Instructions Summary
4-11 Floating-Point Operate Instructions Summary
4-12 Miscellaneous Instructions Summary
4-13 VAX Compatibility Instructions Summary
5-1 Processor Issue Order
5-2 Location Access Order
6-1 Required PALcode Instructions
8-1 Unprivileged VMS PALcode Instruction Summary
8-2 Privileged VMS PALcode Instructions Summary
9-1 Unprivileged aSF/1 PALcode Instruction Summary
9-2 Privileged aSF/1 PALcode Instruction Summary
A-I Decodable Pseudo-Operations (Stylized Code Forms)
B-1 IEEE Floating-Point Trap Handling
B-2 IEEE Standard Charts
C-l Memory Format Instruction Opcodes
C-2 Memory Format Instructions with a Function Code
C-3 Memory Format Branch Instruction Opcodes
C-4 Branch Format instruction Opcodes
C-5 Operate Format Instruction Opcodes and Function Codes
C-6 Function Codes for Floating Data Type Independent Operations
C-7 IEEE Floating-Point Instruction Function Codes
C-8 VAX Floating-Point Instruction Function Codes
C-9 Required PALcode Function Codes
C-I0 Opcodes Reserved for PALcode
C-11 Opcodes Reserved for Digital
C-12 Opcode Summary
C-13 Key to Opcode Summary (Table C-12)

.. 2-3
. 2-7
. 3-3
. 3-3
.. 3-3
. 3-5
. 4-3
.. 4-4
. 4-16
. 4-20
.. 4-21
.. 4-35
. 4-41
. 4-62
.. 4-65
. 4-74
.. 4-76
.. 4-99
.. 4-106
. 5-10
. 5-11
.. 6-3
. 8-1
.. 8-8
.. 9-1
.. 9-2
. A-15
.. B-4
.. B-9
.. C-l
.. C-l
.. C-2
.. C-2
. C-2
.. C-3
. C-3
.. C-5
.. C-5
.. C-6
. C-6
. C-7
.. C-7

Preface

This book describes Digital's next generation RIse architecture. It is directly derived from
sections of the Alpha System Reference Manual and is an accurate representation of the described
parts of the Alpha architecture.

Chapter 1 · Introduction

Alpha is a 64-bit load/store RIse architecture that is designed with particular emphasis on the
three elements that most affect performance: clock speed, multiple instruction issue, and multiple
processors.
The Alpha architects examined and analyzed current and theoretical RIse architecture design
elements and developed high-performance alternatives for the Alpha architecture. The architects
adopted only those design elements that appeared valuable for a projected 25-year design
horizon. Thus, Alpha becomes the first 21st century computer architecture.
The Alpha architecture is designed to avoid bias toward any particular operating system or
programming language. Alpha initially supports the VAX VMS and OSF/1 (UNIX) operating
systems, and supports simple software migration from applications that run on those operating
systems.
This handbook describes in detail how Alpha is designed to be the leadership 64-bit architecture
of the computer industry.

• The Alpha Approach to RIse Architecture
Alpha Is a True 64-Bit Architecture
Alpha was designed as a 64-bit architecture. All registers are 64 bits in length and all operations
are performed between 64-bit registers. It is not a 32-bit architecture that was later expanded to
64 bits.
Alpha Is Designed for Very High-Speed Implementations
The instructions are very simple. All instructions are 32 bits in length. Memory operations are
either loads or stores. All data manipulation is done between registers.

The Alpha architecture facilitates pipelining multiple instances of the same operations because
there are no special registers and no condition codes.
The instructions interact with each other only by one instruction writing a register or memory and
another instruction reading from the same place. That makes it particularly easy to build
implementations that issue multiple instructions every epu cycle. (The first implementation issues
two instructions per cycle.)
Alpha makes it easy to maintain binary compatibility across multiple implementations and easy to
maintqin full speed on multiple-issue implementations. For example, there are no implementation-specific pipeline timing hazards, no load-delay slots, and no branch-delay slots.
Alpha's Approach to Byte Manipulation
The Alpha: architecture does byte shifting and masking with normal 64-bit register-to-register
instructions, crafted to keep instruction sequences short.

1-2 • Introduction

Alpha does not include single-byte store instructions. This has several advantages:
• Cache and memory implementations need not include byte shift-and-mask logic, and sequencer
logic need not perform read-modify-write on memory locations. Such logic is awkward for
high-speed implementation and tends to slow down cache access to normal 32-bit or 64-bit
aligned quantities.
• Alpha's approach to byte manipulation makes it easier to build a high-speed error-correcting
write-back cache, which is often needed to keep a very fast RISC implementation busy.
• Alpha's approach can make it easier to pipeline multiple byte operations.
Alpha's Approach to Arithmetic Traps

Alpha lets the software implementor determine the precision of arithmetic traps. With the Alpha
architecture, arithmetic traps (such as overflow and underflow) are imprecise-they can be
delivered an arbitrary number of instructions after the instruction that triggered the trap. Also,
traps from many different instructions can be reported at once. That makes implementations that
use pipelining and multiple issue substantially easier to build.
However, if precise arithmetic exceptions are desired, trap barrier instructions can be explicitly
inserted in the program to force traps to be delivered at specific points.
Alpha's Approach to Multiprocessor Shared Memory
As viewed from a second processor (including an I/O device), a sequence of reads and writes
issued by one processor may be arbitrarily reordered by an implementation. This allows implementations to use multibank caches, bypassed write buffers, write merging, pipelined writes with
retry on error, and so forth. If strict ordering between two accesses must be maintained, explicit
memory barrier instructions can be inserted in the program.

The basic multiprocessor interlocking primitive is a RISC-style load_locked, modify,
store_conditional sequence. If the sequence runs without interrupt, exception, an interfering
write from another processor, or a CALL_PAL instruction, then the conditional store succeeds.
Otherwise, the store fails and the program eventually must branch back and retry the sequence.
This style of interlocking scales well with very fast caches, and makes Alpha an especially
attractive architecture for building multiple-processor systems.
Alpha Instructions Include Hints for Achieving Higher Speed

A number of Alpha instructions include hints for implementations, all aimed at achieving higher
speed.
• Calculated jump instructions have a target hint that can allow much faster subroutine calls and
returns.
• There are prefetching hints for the memory system that can allow much higher cache hit rates.
• There are granularity hints for the virtual-address mapping that can allow much more effective
use of translation lookaside buffers for large contiguous structures.

1-3

PALcode-Alpha's Very Flexible Privileged Software Library
A Privileged Architecture Library (PALcode) is a set of subroutines that are specific to a
particular Alpha operating system implementation. These subroutines provide operating-system
primitives for context switching, interrupts, exceptions, and memory management. PALcode is
similar to the BIOS libraries that are provided in personal computers.
PALcode subroutines are invoked by implementation hardware or by software CALL_PAL
instructions.
PALcode is written in standard machine code with some implementation-specific extensions to
provide access to low-level hardware.
One version of PALcode lets Alpha implementations run the full VMS operating system by
mirroring many of the VAX VMS features. The VMS PALcode instructions let Alpha run VMS with
little more hardware than that found on a conventional RISC machine: the PAL mode bit itself,
plus 4 extra protection bits in each Translation Buffer entry.
Another version of PALcode lets Alpha implementations run the OSF/l operating system by
mirroring many of the RISC ULTRIX features. Other versions of PALcode can be developed for
real-time, teaching, and other applications.
PALcode makes Alpha an especially attractive architecture for multiple operating systems.
Alpha and Programming Languages
Alpha is an attractive architecture for compiling a large variety of programming languages. Alpha
has been carefully designed to avoid bias toward one or two programming languages. For
example:
• Alpha does not contain a subroutine call instruction that moves a register window by a fixed
amount. Thus, Alpha is a good match for programming languages with many parameters and
programming languages with no parameters.
• Alpha does not contain a global integer overflow enable bit. Such a bit would need to be changed
at every subroutine boundary when a FORTRAN program calls a C program.

• Data Format Overview
Alpha is a load/store RISC architecture with the following data characteristics:
• All operations are done between 64-bit registers.
• Memory is accessed via 64-bit virtuallittle-endian byte addresses.
• There are 32 integer registers and 32 floating-point registers.
• Longword (32-bit) and quadword (64-bit) integers are supported.
• Four floating-point data types are supported:
- VAX F_floating (32-bit)
- VAX G_floating (64-bit)

- IEEE single (32-bit)
- IEEE double (64-bit)

1-4 • Introduction

• Instruction Format Overview
As shown in Figure 1-1, Alpha instructions are all 32 bits in length. As represented in Figure 1-1,
there are four major instruction format classes that contain 0, 1, 2, or 3 register fields. All formats
have a 6-bit opcode.
31

2625

2120

5 4

1615

Opcode

PALcode Format

Number

Opcode

Disp

Opcode

Branch Format
Disp

Function

Memory Format

Operate Format

Figure 1-1 • Instruction Format Overview
• PALco de instructions specify, in the function code field, one of a few dozen complex operations
to be performed.
• Conditional branch instructions test register Ra and specify a signed 21-bit PC-relative longword
target displacement. Subroutine calls put the return address in register Ra.
• Load and store instructions move longwords or quadwords between register Ra and memory,
using Ra plus a signed 16-bit displacement as the memory address.
• Operate instructions for floating-point and integer operations are both represented in Figure 1-1
by the operate format illustration and are as follows:
- Floating-point operations use Ra and Rb as source registers, and write the result in register Rc.
There is an II-bit extended opcode in the function field.
- Integer operations use Ra and Rb or an 8-bit literal as the source operand, and write the result
in register Rc.
Integer operate instructions can use the Rb field and part of the function field to specify an
8-bit literal. There is a 7-bit extended opcode in the function field.

• Instruction Overview
PALcode Instructions
As described above, a Privileged Architecture Library (PALcode) is a set of subroutines that is
specific to a particular Alpha operating-system implementation. These subroutines can be
invoked by hardware or by software CALL_PAL instructions, which use the function field to
vector to the specified subroutine.
Branch Instructions
Conditional branch instructions can test a register for positive/negative or for zero/nonzero. They
can also test integer registers for even/odd.
Unconditional branch instructions can write a return address into a register.

1-5

There is also a calculated jump instruction that branches to an arbitrary 64-bit address in a
register.
Load/Store Instructions
Load and store instructions move either 32-bit or 64-bit aligned quantities from and to memory.
Memory addresses are flat 64-bit virtual addresses, with no segmentation.

The VAX floating-point load/store instructions swap words to give a consistent register format for
floating-point operations.
A 32-bit integer datum is placed in a register in a canonical form that makes 33 copies of the high
bit of the datum. A 32-bit floating-point datum is placed in a register in a canonical form that
extends the exponent by 3 bits and extends the fraction with 29 low-or'der zeros. The 32-bit
operates preserve these canonical forms.
There are facilities for doing byte manipulation in registers, eliminating the need for 8-bit or
16-bit load/store instructions.
Compilers, as directed by user declarations, can generate any mixture of 32-bit and 64-bit
operations. The Alpha architecture has no 32/64 mode bit.
Integer Operate Instructions
The integer operate instructions manipulate full 64-bit values, and include the usual assortment of
arithmetic, compare, logical, and shift instructions.

There are just three 32-bit integer operates: add, subtract, and multiply. They differ from their
64-bit counterparts only in overflow detection and in producing 32-bit canonical results.
There is no integer divide instruction.
The Alpha architecture also supports the following additional operations:
• Scaled add/subtract instructions for quick subscript calculation
• 128-bit multiply for division by a constant, and multiprecision arithmetic
• Conditional move instructions for avoiding branch instructions
• An extensive set of in-register byte and word manipulation instructions
Integer overflow trap enable is encoded in the function field of each instruction, rather than kept
in a global state bit. Thus, for example, both ADDQ/V and ADDQ opcodes exist for specifying
64-bit ADD with and without overflow checking. That makes it easier to pipeline
implementations.
Floating-Point Operate Instructions

The floating-point operate instructions include four complete sets of VAX and IEEE arithmetic
instructions, plus instructions for performing conversions between floating-point and integer
quantities.

1-6 • Introduction

In addition to the operations found in conventional RIse architectures, Alpha includes conditional move instructions for avoiding branches and merge sign/exponent instructions for simple
field manipulation.
The arithmetic trap enables and rounding mode are encoded in the function field of each
instruction, rather then kept in global state bits. That makes it easier to pipeline implementations.

• Instruction Set Characteristics
Alpha instruction set characteristics are as follows:
• All instructions are 32 bits long and have a regular format.
• There are 32 integer registers (RO through R31), each 64 bits wide. R31 reads as zero, and writes
to R31 are ignored.
• There are 32 floating-point registers (FO through F3l), each 64 bits wide. F31 reads as zero, and
writes to F31 are ignored.
• All integer data manipulation is between integer registers, with up to two variable register source
operands (one may be an 8-bit litera!), and one register destination operand.
• All floating-point data manipulation is between floating-point registers, with up to two register
source operands and one register destination operand.
• All memory reference instructions are of the load/store type that move data between registers and
memory.
• There are no branch condition codes. Branch instructions test an integer or floating-point register
value, which may be the result of a previous compare.
• Integer and logical instructions operate on quadwords.
• Floating-point instructions operate on G_floating, F_floating, IEEE double, and IEEE single
operands. D_floating "format compatibility," in which binary files of D_floating numbers may be
processed, but without the last 3 bits of fraction precision, is also provided.
• A minimal number of VAX compatibility instructions are included.

• Terminology and Conventions
The following sections describe the terminology and conventions used in this book.

Numbering
All numbers are decimal unless otherwise indicated. Where there is ambiguity, numbers other
than decimal are indicated with the name of the base in subscript form, for example, 10 16 ,

Security Holes
A security hole is an error of commission, omission, or oversight in a system that allows
protection mechanisms to be bypassed.

1-7

Security holes exist when unprivileged software (that is, software running outside of kernel mode)
can:
• Affect the operation of another process without authorization from the operating system;
• Amplify its privilege without authorization from the operating system; or
• Communicate with another process, either overtly or covertly, without authorization from the
operating system.
The Alpha architecture has been designed to contain no architectural security holes. Hardware
(processors, buses, controllers, and so on) and software should likewise be designed to avoid
security holes.

UNPREDICTABLE and UNDEFINED
In this book, the terms UNPREDICTABLE and UNDEFINED are used. Their meanings are quite
different and must be carefully distinguished. One key difference is that only privileged software
(that is, software running in kernel mode) may trigger UNDEFINED operations, whereas either
privileged or unprivileged software may trigger UNPREDICTABLE results or occurrences. A
second key difference is that UNPREDICTABLE results and occurrences do not disrupt the basic
operation of the processor; the processor continues to execute instructions in its normal manner.
In contrast, UNDEFINED operation may halt the processor or cause it to lose information.
A result specified as UNPREDICTABLE may acquire an arbitrary value subject to a few constraints. Such a result may be an arbitrary function of the input operands or of any state
information that is accessible to the process in its current access mode. UNPREDICTABLE results
may be unchanged from their previous values. Operations that produce UNPREDICTABLE results
may also produce exceptions.

UNPREDICTABLE results must not be security holes.
Specifically, UNPREDICTABLE results must not:
• Depend upon, or be a function of, the contents of memory locations or registers that are
inaccessible to the current process in the current access mode.
Also, operations that may produce UNPREDICTABLE results must not:
• Write or modify the contents of memory locations or registers to which the current process in the
current access mode does not have access, or
• Halt or hang the system or any of its components.
For example, a security hole would exist if some UNPREDICTABLE result depended on the value
of a register in another process, on the contents of processor temporary registers left behind by
some previously running process, or on a sequence of actions of different processes.

An occurrence specified as UNPREDICTABLE may happen or not based on an arbitrary choice
function. The choice function is subject to the same constraints as are UNPREDICTABLE results
and, in particular, must not constitute a security hole.

1-8 • Introduction

Results or occurrences specified as UNPREDICTABLE may vary from moment to moment,
implementation to implementation, and instruction to instruction within implementations. Software can never depend on results specified as UNPREDICTABLE.
Operations specified as UNDEFINED may vary from moment to moment, implementation to
implementation, and instruction to instruction within implementations. The operation may vary
in effect from nothing, to stopping system operation. UNDEFINED operations must not cause the
processor to hang, that is, reach an unhalted state from which there is no transition to a normal
state in which the machine executes instructions. Only privileged software (that is, software
running in kernel mode) may trigger UNDEFINED operations.

Ranges and Extents
Ranges are specified by a pair of numbers separated by a " .." and are inclusive. For example, a
range of integers 0..4 includes the integers 0, 1, 2, 3, and 4.
Extents are specified by a pair of numbers in angle brackets separated by a colon and are
inclusive. For example, bits <7:3> specify an extent of bits including bits 7, 6, 5, 4, and 3.

ALIGNED and UNALIGNED
In this document the terms ALIGNED and NATURALLY ALIGNED are used interchangeably to
refer to data objects that are powers of two in size. An aligned datum of size 2~'d:N is stored in
memory at a byte address that is a multiple of 2":~':N, that is, one that has N low-order zeros.
Thus, an aligned 64-byte stack frame has a memory address that is a multiple of 64.

If a datum of size 2~'d:N is stored at a byte address that is not a multiple of 2~':":N, it is called
UNALIGNED.

Must Be Zero (MBZ)
Fields specified as Must be Zero (MBZ) must never be filled by software with a non-zero value.
These fields may be used at some future time. If the processor encounters a non-zero value in a
field specified as MBZ, an Illegal Operand exception occurs.

Read As Zero (RAZ)
Fields specified as Read as Zero (RAZ) return a zero when read.

Should Be Zero (SBZ)
Fields specified as Should be Zero (SBZ) should be filled by software with a zero value. Non-zero
values in SBZ fields produce UNPREDICTABLE results and may produce extraneous instruction-issue delays.

Ignore (IGN)
Fields specified as Ignore (IGN) are ignored when written.

1-9

Implementation Dependent (IMP)
Fields specified as Implementation Dependent (IMP) may be used for implementation-specific
purposes. Each implementation must document fully the behavior of all fields marked as IMP by
the Alpha specification.

Figure Drawing Conventions
Figures that depict registers or memory follow the convention that increasing addresses run right
to left and top to bottom.

Macro Code Example Conventions
All instructions in macro code examples are either listed in Chapter 4 or are stylized code forms
found in Appendix A.

Chapter 2 · Basic Architecture

• Addressing
The basic addressable unit in Alpha is the 8-bit byte. Virtual addresses are 64 bits long. An
implementation may support a smaller virtual address space. The minimum virtual address size is
43 bits.
Virtual addresses as seen by the program are translated into physical memory addresses by the
memory management mechanism.

· Data Types
Following are descriptions of the Alpha architecture data types.

Byte
A byte is 8 contiguous bits starting on an addressable byte boundary. The bits are numbered from
right to left, 0 through 7, as shown in Figure 2-1.

Figure 2-1 • Byte Format
A byte is specified by its address A. A byte is an 8-bit value. The byte is only supported in Alpha
by the extract, mask, insert, and zap instructions.

Word
A word is 2 contiguous bytes starting on an arbitrary byte boundary. The bits are numbered from
right to left, 0 through 15, as shown in Figure 2-2.
15

I_I

Figure 2-2 • Word Format
A word is specified by its address, the address of the byte containing bit O.
A word is a 16-bit value. The word is only supported in Alpha by the extract, mask, and insert
instructions.

2-2 • Basic Architecture

Longword
A longword is 4 contiguous bytes starting on an arbitrary byte boundary. The bits are numbered
from right to left, 0 through 31, as shown in Figure 2-3.
31

I:A

Figure 2-3 • Longword Format
A longword is specified by its address A, the address of the byte containing bit O. A longword is a
32-bit value.
When interpreted arithmetically, a longword is a two's-complement integer with bits of increasing
significance from 0 through 30. Bit 31 is the sign bit. The longword is only supported in Alpha by
sign-extended load and store instructions and by longword arithmetic instructions.
Note
Alpha implementations will impose a significant performance penalty
when accessing longword operands that are not naturally aligned. (A
naturally aligned longword has zero as the low-order two bits of its
address.)

Quadword
A quadword is 8 contiguous bytes starting on an arbitrary byte boundary. The bits are numbered
from right to left, 0 through 63, as shown in Figure 2-4.
63

Figure 2-4 • Quadword Format
A quadword is specified by its address A, the address of the byte containing bit O. A quadword is
a 64-bit value. When interpreted arithmetically, a quadword is either a two's-complement integer
with bits of increasing significance from 0 through 62 and bit 63 as the sign bit, or an unsigned
integer with bits of increasing significance from 0 through 63.
Note
Alpha implementations will impose a significant performance penalty
when accessing quadword operands that are not naturally aligned. (A
naturally aligned quadword has zero as the low-order three bits of its
address.)

2-3

VAX Floating-Point Formats
VAX floating-point numbers are stored in one set of formats in memory and in a second set of
formats in registers. The floating-point load and store instructions convert between these formats
purely by rearranging bits; no rounding or range-checking is done by the load and store
instructions.

F-floating
An F_floating datum is 4 contiguous bytes in memory starting on an arbitrary byte boundary. The
bits are labeled from right to left, 0 through 31, as shown in Figure 2-5.
1514

7 6

Exp.

Frac. Hi

Fraction Lo

:A
:A+2

Figure 2-5 • FJloating Datum

An F_floating operand occupies 64 bits in a floating register, left-justified in the 64-bit register, as
shown in Figure 2-6.
6362

52 51

Exp

Frac.

4544

---I:FX

2928

Hi I-Fr-actio-n
Lo-I----o

Figure 2-6 • FJloating Register Format

The F_floating load instruction reorders bits on the way in from memory, expands the exponent
from 8 to 11 bits, and sets the low-order fraction bits to zero. This produces in the register an
equivalent G_floating number suitable for either F_floating or G_floating operations. The
mapping from 8-bit memory-format exponents to 11-bit register-format exponents is shown in
Table 2-1.
Table 2·1 • F_floating Load Exponent Mapping
Memory <14:7>

1 1111111

1 000 1111111

1 xxxxxxx

1 000 xxxxxxx

(xxxxxxx not all 1's)

o xxxxxxx

o 111 xxxxxxx

(xxxxxxx not all O's)

o 0000000

o 000 0000000

This mapping preserves both normal values and exceptional values.
The F_floating store instruction reorders register bits on the way to memory and does no
checking of the low-order fraction bits. Register bits <61:59> and <28:0> are ignored by the store
instruction.

2-4 • Basic Architecture

An F_floating datum is specified by its address A, the address of the byte containing bit o. The
memory form of an F_floating datum is sign magnitude with bit 15 the sign bit, bits <.14:7> an
excess-128 binary exponent, and bits <6:0> and <31:16> a normalized 24-bit fraction with the
redundant most significant fraction bit not represented. Within the fraction, bits of increasing
significance are from 16 through 31 and 0 through 6. The 8-bit exponent field encodes the values
o through 255. An exponent value of 0, together with a sign bit of 0, is taken to indicate that the
F_floating datum has a value of o.

If the result of a VAX floating-point format instruction has a value of zero, the instruction always

produces a datum with a sign bit of 0, an exponent of 0, and all fraction bits of o. Exponent
values of 1..255 indicate true binary exponents of -127 ..127. An exponent value of 0, together
with a sign bit of 1, is taken as a reserved operand. Floating-point instructions processing a
reserved operand take an arithmetic exception. The value of an F_floating datum is in the
approximate range 0.29"<10"<>'<-38.. 1.7>'<10"<>'<38. The precision of an F_floating datum is approximately one part in 2''0'<23, typically 7 decimal digits.
Note
Alpha implementations will impose a significant performance penalty
when accessing F_floating operands that are not naturally aligned. (A
naturally aligned F_floating datum has zero as the low-order two bits of
its address.)

GJloating
A G_floating datum in memory is 8 contiguous bytes starting on an arbitrary byte boundary. The
bits are labeled from right to left, 0 through 63, as shown in Figure 2-7.
1514

4 3

IFrac.Hi :A

Exp.

Fraction Midh

:A+2

Fraction Midi

:A+4

Fraction Lo

:A+6

Figure 2-7 • GJloating Datum
A G_floating operand occupies 64 bits in a floating register, arranged as shown in Figure 2-8.
6362

5251

Exp.

4847

Frac. Hi

1615

3231

Fraction Midh

Figure 2-8 • GJloating Format

Fraction Midi

Fraction Lo

:Fx

2-5

A G_floating datum is specified by its address A, the address of the byte containing bit o. The
form of a G_floating datum is sign magnitude with bit 15 the sign bit, bits <14:4> an excess-1024
binary exponent, and bits <3:0> and <63:16> a normalized 53-bit fraction with the redundant
most significant fraction bit not represented. Within the fraction, bits of increasing significance
are from 48 through 63,32 through 47,16 through 31, and 0 through 3. The ll-bit exponent
field encodes the values 0 through 2047. An exponent value of 0, together with a sign bit of 0, is
taken to indicate that the G_floating datum has a value of o.

If the result of a floating-point instruction has a value of zero, the instruction always produces a
datum with a sign bit of 0, an exponent of 0, and all fraction bits of O. Exponent values of 1..2047
indicate true binary exponents of -1023 ..1 023. An exponent value of 0, together with a sign bit of
1, is taken as a reserved operand. Floating-point instructions processing a reserved operand take a
user-visible arithmetic exception. The value of a G_floating datum is in the approximate range
0.56'"lO'H'-3 08..0.9"'10"0"3 08. The precision of a G_floating datum is approximately one part in
2"0"52, typically 15 decimal digits.
Note
Alpha implementations will impose a significant performance penalty
when accessing G_floating operands that are not naturally aligned. (A
naturally aligned G_floating datum has zero as the low-order three bits
of its address.)

DJloating
A D_floating datum in memory is 8 contiguous bytes starting on an arbitrary byte boundary. The
bits are labeled from right to left, 0 through 63, as shown in Figure 2-9.
1514

7 6

Exp.

1 Frac.Hi

Fraction Midh

:A+2

Fraction Midi

:A+4

Fraction Lo

:A+6

Figure 2-9 • DJ/oating Datum

A D_floating operand occupies 64 bits in a floating register, arranged as shown in Figure 2-10.
6362

5554

Exp.

4847

Frac. Hi

1615

3231

Fraction Midh

Fraction Midi

Fraction Lo

:Fx

Figure 2-10 • DJ/oating Register Format

The reordering of bits required for a D_floating load or store are identical to 'those required for a
G_floating load or store. The G_floating load and store instructions are therefore used for
loading or storing D_floating data.

2-6 • Basic Architecture

A D_floating datum is specified by its address A, the address of the byte containing bit O. The
memory form of a D_floating datum is identical to an F_floating datum except for 32 additional
low significance fraction bits. Within the fraction, bits of increasing significance are from 48
through 63, 32 through 47, 16 through 31, and 0 through 6. The exponent conventions and
approximate range of values is the same for D_floating as F_floating. The precision of a
D_floating datum is approximately one part in 2>'<>'<55, typically 16 decimal digits.
Note
D_floating is not a fully supported data type; no D_floating arithmetic
operations are provided in the architecture. For backward compatibility,
exact D_floating arithmetic may be provided via software emulation.
D_floating "format compatibility" in which binary files of D_floating
numbers may be processed, but without the last 3 bits of fraction precision, can be obtained via conversions to G_floating, G arithmetic operations, then conversion back to D_floating.
Note
Alpha implementations will impose a significant performance penalty on
access to D_floating operands that are not naturally aligned. (A naturally
aligned D_floating datum has zero as the low-order three bits of its
address.)

IEEE Floating-Point Formats
The IEEE standard for binary floating-point arithmetic, ANSI/IEEE 754-1985, defines four floating-point formats in two groups, basic and extended, each having two widths, single and double.
The Alpha architecture supports the basic single and double formats, with the basic double
format serving as the extended single format. The values representable within a format are
specified by using three integer parameters:
1. P-the number of fraction bits
2. Emax-the maximum exponent
3. Emin-the minimum exponent
Within each format, only the following entities are permitted:

1. Numbers of the form (-1)"<>'<S x 2"<>'<E x b(O).b(1)b(2) ..b(P-1) where:
a. S = 0 or 1
b. E = any integer between Emin and Emax, inclusive
c. b(n) = 0 or 1
2. Two infinities-positive and negative
3. At least one Signaling NaN
4. At least one Quiet NaN
NaN is an acronym for Not-a-Number. A NaN is an IEEE floating-point bit pattern that
represents something other than a number. NaNs come in two forms: Signaling NaNs and Quiet
NaNs. Signaling NaNs are used to provide values for uninitialized variables and for arithmetic

2-7

enhancements. Quiet NaNs provide retrospective diagnostic information regarding previous
invalid or unavailable data and results. Signaling NaNs signal an invalid operation when they are
an operand to an arithmetic instruction, and may generate an arithmetic exception. Quiet NaNs
propagate through almost every operation without generating an arithmetic exception.
Arithmetic with the infinities is handled as if the operands were of arbitrarily large magnitude.
Negative infinity is less than every finite number; positive infinity is greater than every finite
number.

S_Floating
An IEEE single-precision, or S_floating, datum occupies 4 contiguous bytes in memory starting
on an arbitrary byte boundary. The bits are labeled from right to left, 0 through 31, as shown in
Figure 2-11.
1514

7 6

Fraction Lo

Exp.

I Frac. Hi

:A+2

Figure 2-11 • SJloating Datum

An S_floating operand occupies 64 bits in a floating register, left-justified in the 64-bit register, as
shown in Figure 2-12.
5251

63 62

Exp.

Froc.

4544

2928

---I:FX
0

Hi I-Fr-actio-n
L o-.----I- - - 0

Figure 2-12 • SJloating Register Format
The S_floating load instruction reorders bits on the way in from memory, expanding the
exponent from 8 to 11 bits, and sets the low-order fraction bits to zero. This produces in the
register an equivalent T_floating number, suitable for either S_floating or T_floating operations.
The mapping from 8-bit memory-format exponents to 11-bit register-format exponents is shown
in Table 2-2.
Table 2-2 • S_floating Load Exponent Mapping
Memory <30:23>

1 1111111

1 111 1111111

1 xxxxxxx

1 000 xxxxxxx

(xxxxxxx not all 1's)

o xxxxxxx

o 111 xxxxxxx

(xxxxxxx not all O's)

o 0000000

o 000 0000000

2-8 • Basic Architecture

This mapping preserves both normal values and exceptional values. Note that the mapping for all
l's differs from that of F_floating load, since for S_floating alII's is an exceptional value and for
F_floating all 1's is a normal value.
The S_floating store instruction reorders register bits on the way to memory and does no
checking of the low-order fraction bits. Register bits <61:59> and <28:0> are ignored by the store
instruction. The S_floating load instruction does no checking of the input.
The S_floating store instruction does no checking of the data; the preceding operation should
have specified an S_floating result.
O. The
An S_floating datum is specified by its address A, the address of the byte containing bit o.
memory form of an S_floating datum is sign magnitude with bit 31 the sign bit, bits <30:23> an
excess-127 binary exponent, and bits <22:0> a 23-bit fraction.
The value (V) of an S_floating number is inferred from its constituent sign (S), exponent (E), and
fraction (F) fields as follows:
1. If E=255 and F<>O, then V is NaN, regardless of S.
2. If E=255 and F=O, then V = (_l)~h':S X Infinity.
3. If 0 < E < 255, then V = (_l)~h':S X 2 id:(E-127) X (1.F).
4. If E=O and F<>O, then V = (_l)~'d:S X 2~h':(-126) X (O.F).
5. If E=O and F=O, then V = (-I)~'d:S X 0 (zero).
Floating-point operations on S_floating numbers may take an arithmetic exception for a variety of
reasons, including invalid operations, overflow, underflow, division by zero, and inexact results.
Note
Alpha implementations will impose a significant performance penalty
when accessing S_floating operands that are not naturally aligned. (A
naturally aligned S_floating datum has zero as the low-order two bits of
its address.)

TJloating
An IEEE double-precision, or T_floating, datum occupies 8 contiguous bytes in memory starting
on an arbitrary byte boundary. The bits are labeled from right to left, 0 through 63, as shown in
Figure 2-13.
1514

4 3

Fraction Lo

Fraction Midi

:A+2

Fraction Midh

:A+4

Exponent

IFrac.Hi :A+6

Figure 2-13 • TJloating Datum

2-9

A T_floating operand occupies 64 bits in a floating register, arranged as shown in Figure 2-14.
6362

5251

Exp.

4847

Frac. Hi

3231

Fraction Midh

1615

Fraction Midi

Fraction Lo

:Fx

Figure 2-14 • TJloating Register Format
The T_floating load instruction performs no bit reordering on input, nor does it perform
checking of the input data.
The T_floating store instruction performs no bit reordering on output. This instruction does no
checking of the data; the preceding operation should have specified aT_floating result.
A T_floating datum is specified by its address A, the address of the byte containing bit O. The
form of a T_floating datum is sign magnitude with bit 63 the sign bit, bits <62:52> an
excess-1023 binary exponent, and bits <51:0> a 52-bit fraction.
The value (V) of a T_floating number is inferred from its constituent sign (5), exponent (E), and
fraction (F) fields as follows:
1. If E=2047 and F<>O, then V is NaN, regardless of 5.
2. If E=2047 and F=O, then V = (-1)>':>':5 X Infinity.
3. If 0 < E < 2047, then V = (-1)":>':5 X 2":>':(E-1023) X (1.F).
4. If E=O and F<>O, then V = (-1)":>':5 X 2":>':(-1022) X (a.F).
5. If E=O and F=O, then V = (-1)":>':5 X 0 (zero).
Floating-point operations on T_floating numbers may take an arithmetic exception for a variety
of reasons, including invalid operations, overflow, underflow, division by zero, and inexact
results.
Note
Alpha implementations will impose a significant performance penalty
when accessing T_floating operands that are not naturally aligned. (A
naturally aligned T_floating datum has zero as the low-order three bits of
its address.)

Longword Integer Format in Floating-Point Unit
A longword integer operand occupies 32 bits in memory, arranged as shown in Figure 2-15.
1514

Integer Lo

Integer Hi

:A+2

Figure 2-15 • Longword Integer Datum

2-10 • Basic Architecture

A longword integer operand occupies 64 bits in a floating register, arranged as shown in
Figure 2-16.
6362615958

4544

2928

--I---I-n-te-ge-r-L-o--~I--------o---------IFX

mr---I-nt-e-ge-r-H-i

Figure 2-16 • Longword Integer Floating-Register Format
There is no explicit longword load or store instruction; the S_floating load/store instructions are
used to move longword data into or out of the floating registers. The register bits <61:59> are set
by the S_floating load exponent mapping. They are ignored by S_floating store. They are also
ignored in operands of a longword integer operate instruction, and they are set to 000 in the
result of a longword operate instruction.
The register format bit <62>, "I", in Figure 2-16 is part of the Integer Hi field in Figure 2-15 and
represents the high-order bit of that field. Bits <58:45> of Figure 2-16 are the remaining bits of
the Integer Hi field of Figure 2-15.
Note

Alpha implementations will impose a significant performance penalty
when accessing longwords that are not naturally aligned. (A naturally
aligned longword datum has zero as the low-order two bits of its
address.)

Quadword Integer Format in Floating-Point Unit
A quadword integer operand occupies 64 bits in memory, arranged as shown in Figure 2-17.
1514

Integer Lo

Integer Midi

:A+2

Integer Midh

:A+4

Integer Hi

:A+6

Figure 2-17 • Quadword Integer Datum
A quadword integer operand occupies 64 bits in a floating register, arranged as shown in
Figure 2-18.
4847

6362

Integer Hi

1615

3231

Integer Midh

Integer Midi

Figure 2-18 • Quadword Integer Floating-Register Format

Integer Lo

:Fx

2-11

There is no explicit quadword load or store instruction; the T_floating load/store instructions are
used to move quadword data into or out of the floating registers.
The T_floating load instruction performs no bit reordering on input. The T_floating store
instruction performs no bit reordering on output. This instruction does no checking of the data;
when used to store quadwords, the preceding operation should have specified a quadword result.
Note
Alpha implementations will impose a significant performance penalty
when accessing quadwords that are not naturally aligned. (A naturally
aligned quadword datum has zero as the low-order three bits of its
address.)

Data Types with No Hardware Support
The following VAX data types are not directly supported in Alpha hardware.
• Octaword
• H_floating
• D_floating (except load/store and convert tolfrom G_floating)
• Variable-Length Bit Field
• Character String
• Trailing Numeric String
• Leading Separate Numeric String
• Packed Decimal String

Chapter 3 · Instruction Formats

• Alpha Registers
Each Alpha processor has a set of registers that hold the current processor state. If an Alpha
system contains multiple Alpha processors, there are multiple per-processor sets of these registers.

Program Counter
The Program Counter (PC) is a special register that addresses the instruction stream. As each
instruction is decoded, the PC is advanced to the next sequential instruction. This is referred to as
the updated Pc. Any instruction that uses the value of the PC will use the updated PC . The PC
includes only bits <63:2> with bits <1:0> treated as RAZ/IGN. This quantity is a
longword-aligned byte address. The PC is an implied operand on conditional branch and subroutine jump instructions. The PC is not accessible as an integer register.

Integer Registers
There are 32 integer registers (RO through R31), each 64 bits wide.
Register R31 is assigned special meaning by the Alpha architecture:
• When R31 is specified as a register source operand, a zero-valued operand is supplied.
For all cases except the Unconditional Branch and Jump instructions, results of an instruction
that specifies R31 as a destination operand are discarded. Also, it is UNPREDICTABLE whether
the other destination operands (implicit and explicit) are changed by the instruction. It is
implementation dependent to what extent the instruction is actually executed once it has been
fetched. It is also UNPREDICTABLE whether exceptions are signaled during the execution of
such an instruction. Note, however, that exceptions associated with the instruction fetch of such
an instruction are always signaled.
There are some interesting cases involving R31 as a destination:
- STx_C R31,disp(Rb)
Although this might seem like a good way to zero out a shared location and reset the lock_flag,
this instruction causes the lock_flag and virtual location {Rbv + SEXT(disp)} to become
UNPREDICTABLE.
- LDx_L R31,disp(Rb)
This instruction produces no useful result since it causes both lock_flag and
locked_physicaLaddress to become UNPREDICTABLE.
Unconditional Branch (BR and BSR) and Jump aMP, JSR, RET, and JSR_COROUTINE) instructions, when R31 is specified as the Ra operand, execute normally and update the PC with the
target virtual address. Of course, no PC value can be saved in R31.

3-2 • Instruction Formats

Floating-Point Registers
There are 32 floating-point registers (FO through F31), each 64 bits wide.
When F31 is specified as a register source operand, a true zero-valued operand is supplied. See

Definitions in Chapter 4 for a definition of true zero.
Results of an instruction that specifies F31 as a destination operand are discarded and it is
UNPREDICTABLE whether the other destination operands (implicit and explicit) are changed by
the instruction. In this case, it is implementation-dependent to what extent the instruction is
actually executed once it has been fetched. It is also UNPREDICTABLE whether exceptions are
signaled during the execution of such an instruction. Note, however, that exceptions associated
with the instruction fetch of such an instruction are always signaled.

A floating-point instruction that operates on single-precision data reads all bits <63 :0> of the
source floating-point register. A floating-point instruction that produces a single-precision result
writes all bits <63:0> of the destination floating-point register.

Lock Registers
There are two per-processor registers associated with the LDx_L and STx_C instructions, the
lock_flag and the locked_physicaLaddress register. The use of these registers is described in
Memory Integer Load/Store Instructions in Chapter 4.

Optional Registers
Some Alpha implementations may include optional memory prefetch or VAX compatibility
processor registers.

Memory Pre/etch Registers
If the prefetch instructions FETCH and FETCH_M are implemented, an implementation will
include two sets of state prefetch registers used by those instructions. The use of these registers is
described in Miscellaneous Instructions in Chapter 4. These registers are not directly accessible by
software and are listed for completeness.

VAX Compatibility Register

The VAX compatibility instructions RC and RS include the intr_flag register, as described in VAX
Compatibility Instructions in Chapter 4.

• Notation
The notation used to describe the operation of each instruction is given as a sequence of control
and assignment statements in an ALGOL-like syntax.

3-3

Operand Notation
Tables Table 3-1, 3-2, and 3-3 list the notation for the operands, the operand values, and the
other expression operands.
Table 3-1 . Operand Notation
Notation

Meaning

An integer register operand in the Ra field of the instruction.

An integer register operand in the Rb field of the instruction.

An integer literal operand in the Rb field of the instruction.

An integer register operand in the Rc field of the instruction.

A floating-point register operand in the Ra field of the instruction.

A floating-point register operand in the Rb field of the instruction.

A floating-point register operand in the Rc field of the instruction.

Table 3-2 · Operand Value Notation
Notation

Meaning

Rav

The value of the Ra operand. This is the contents of register Ra.

Rbv

The value of the Rb operand. This could be the contents of register
Rb, or a zero-extended 8-bit literal in the case of an Operate format
instruction.

Fav

The value of the floating point Fa operand. This is the contents of
register Fa.

Fbv

The value of the floating point Fb operand. This is the contents of
register Fb.

Table 3-3 . Expression Operand Notation
Notation

Meaning

IPR_x

Contents of Internal Processor Register x

IPR_SP[mode]

Contents of the per-mode stack pointer selected by mode

Updated PC value

Contents of integer register n

Contents of floating-point register n

X[m]

Element m of array X

3-4 • Instruction Formats

Instruction Operand Notation
The notation used to describe instruction operands follows from the operand specifier notation
used in the VAX Architecture Standard. Instruction operands are described as follows:

<name>.<access type><data type>
<name>
Specifies the instruction field (Ra, Rb, Rc, or disp) and register type of the operand (integer or
floating). It can be one of the following:
Name

Meaning

disp

The displacement field of the instruction.

fnc

The PAL function field of the instruction.

An integer register operand in the Ra field of the instruction.

An integer register operand in the Rb field of the instruction.

An integer literal operand in the Rb field of the instruction.

An integer register operand in the Rc field of the instruction.

A floating-point register operand in the Ra field of the instruction.

A floating-point register operand in the Rb field of the instruction.

A floating-point register operand in the Rc field of the instruction.

<access type>
Is a letter denoting the operand access type:
Access Type

Meaning

The operand is used in an address calculation to form an effective
address. The data type code that follows indicates the units of
addressability (or scale factor) applied to this operand when the
instruction is decoded.
For example:
".al" means scale by 4 (longwords) to get byte units (used in branch
displacements); ".ab" means the operand is already in byte units (used
in load/store instructions).
The operand is an immediate literal in the instruction.

The operand is read only.

The operand is both read and written.

The operand is write only.

3-5
<data type>

Is a letter denoting the data type of the operand:
Data Type

Meaning

Byte

F_floating

G_floating

Longword

Quadword

IEEE single floating (S_floating)
IEEE double floating (T_floating)

Word

The data type is specified by the instruction

Operators
The operators shown in Table 3 -4 are used:
Table 3-4 • Operators
Operator

Meaning

Comment delimiter

Addition
Subtraction
Signed multiplication
Unsigned multiplication
Exponentiation (left argument raised to right argument)

Division
Replacement

Bit concatenation

{}

Indicates explicit operator precedence

(x)

Contents of memory location whose address is x

x<m:n>

Contents of bit field of x defined by bits n through m

x<m>

M'th bit of x

3-6 • Instruction Formats

Table 3-4 • Operators (Continued)
Operator

Meaning

ACCESS(x,y)

Accessibility of the location whose address is x using the
access mode y. Returns a Boolean value TRUE if the address
is accessible, else FALSE.

AND

Logical product

ARITH_RIGHT_SHIFT(x,y)

Arithmetic right shift of first operand by the second operand.
Y is an unsigned shift value. Bit 63, the sign bit, is copied
into vacated bit positions and shifted out bits are discarded.

X is a quadword, y is an 8-bit vector in which each bit
corresponds to a byte of the result. The y bit to x byte
correspondence is y<n> ~ x<8n+7:8n>. This correspondence also exists between y and the result.
For each bit of y from n = 0 to 7, if y <n> is 0 then byte
<n> of x is copied to byte <n> of result, and if y <n> is 1
then byte <n> of result is forced to all zeros.
CASE

The CASE construct selects one of several actions based on
the value of its argument. The form of a case is:
CASE argument OF
argvaluel: action_l
argvalue2: action_2
argvaluen: action_n
[otherwise: default_action]
ENDCASE

If the value of argument is argvaluel then action_l is
executed; if argument = argvalue2, then action_2 is executed,
and so forth.
Once a single action is executed, the code stream breaks to
the ENDCASE (there is an implicit break as in Pascal). Each
action may nonetheless be a sequence of pseudocode
operations, one operation per line.
Optionally, the last argvalue may be the atom 'otherwise'. The
associated default action will be taken if none of the other
argvalues match the argument.
DIV

Integer division (truncates)

LEFT_SHIFT(x,y)

Logical left shift of first operand by the second operand.
Y is an unsigned shift value. Zeros are moved into the
vacated bit positions, and shifted out bits are discarded.

3-7

Table 3-4 • Operators (Continued)
Operator

Meaning

NOT

Logical (ones) complement

Logical sum

x MOD Y

x modulo y

Relational Operators

Operator

Meaning

LT
LTU
LE
LEU
EQ
NE
GE
GEU
GT
GTU
LBC
LBS

Less than signed
Less than unsigned
Less or equal signed
Less or equal unsigned
Equal signed and unsigned
Not equal signed and unsigned
Greater or equal signed
Greater or equal unsigned
Greater signed
Greater unsigned
Low bit clear
Low bit set

MINU(x,y)
MINU{x,y)

Returns the smaller of x and y, with x and y interpreted as
unsigned integers

PHYSICAL_ADDRESS

Translation of a virtual address

PRIORITY_ENCODE

Returns the bit position of most significant set bit,
interpreting its argument as a positive integer
( = int(
int{ 19(
19{ x ) ) ) .
For example:
priority_encode ( 255 )

Logical right shift of first operand by the second operand. Y
is an unsigned shift value. Zeros are moved into vacated bit
positions, and shifted out bits are discarded.
SEXT(x)

X is sign-extended to the required size.

TEST(x,cond)

The contents of register x are tested for branch condition
(cond) true. TEST returns a Boolean value TRUE if x bears
the specified relation to 0, else FALSE is returned. Integer
and floating test conditions are drawn from the preceding list
of relational operators.

XOR

Logical difference

ZEXT(x)

X is zero-extended to the required size.

3-8 • Instruction Formats

Notation Conventions
The following conventions are used:
1. Only operands that appear on the left side of a replacement operator are modified.
2. No operator precedence is assumed other than that replacement (~) has the lowest precedence. Explicit precedence is indicated by the use of "{}".
3. All arithmetic, logical, and relational operators are defined in the context of their operands.
For example, "+" applied to G_floating operands means a G_floating add, whereas "+"
applied to quadword operands is an integer add. Similarly, "LT" is a G_floating comparison
when applied to G_floating operands and an integer comparison when applied to quadword
operands.

• Instruction Formats
There are five basic Alpha instruction formats:
• Memory
• Branch
• Operate
• Floating-point Operate
• PALcode
All instruction formats are 32 bits long with a 6-bit major opcode field in bits <31:26> of the
instruction.
Any unused register field (Ra, Rb, Fa, Fb) of an instruction must be set to a value of 31.
Software Note
There are several instructions, each formatted as a memory instruction,
that do not use the Ra and/or Rb fields. These instructions are: Memory
Barrier, Fetch, Fetch_M, Read Process Cycle Counter, Read and Clear,
Read and Set, and Trap Barrier.

Memory Instruction Format
The Memory format is used to transfer data between registers and memory, to load an effective
address, and for subroutine jumps. It has the format shown in Figure 3-1.
31

ill

2625

Opcode

2120

1615

Memory_disp

Figure 3-1 • Memory Instruction Format

A Memory format instruction contains a 6-bit opcode field, two 5-bit register address fields, Ra
and Rb, and a 16-bit signed displacement field.

3-9

The displacement field is a byte offset. It is sign-extended and added to the contents of register
Rb to form a virtual address. Overflow is ignored in this calculation.
The virtual address is used as a memory load/store address or a result value, depending on the
specific instruction. The virtual address (va) is computed as follows for all memory format
instructions except the load address high (LDAH):
va (-

{Rbv + SEXT(Memory_disp)}

For LDAH the virtual address (va) is computed as follows:
va (-

{Rbv + SEXT(Memory_disp*65536)}

Memory Format Instructions with a Function Code
Memory format instructions with a function code replace the memory displacement field in the
memory instruction format with a function code that designates a set of miscellaneous instructions. The format is shown in Figure 3-2.

I I~r--Fun-ctio-n-I
31

2625

2120

1615

Opcode

Figure 3-2 • Memory Instruction with Function Code Format
The memory instruction with function code format contains a 6-bit opcode field and a 16-bit
function field. Unused function encodings produce UNPREDICTABLE but not UNDEFINED
results; they are not security holes.
There are two fields, Ra and Rb. The usage of those fields depends on the instruction. See

Miscellaneous Instructions in Chapter 4.

Memory Format Jump Instructions

For computed branch instructions (CALL, RET,JMP, JSR_COROUTINE) the displacement field is
used to provide branch-prediction hints as described in Control Instructions in Chapter 4.

Branch Instruction Format
The Branch format is used for conditional branch instructions and for PC-relative subroutine
jumps. It has the format shown in Figure 3-3.

I
31

2625

Opcode

21 20

8~---B-ra-n-ch-_-d-iS-P----1

Figure 3-3 • Branch Instruction Format

A Branch format instruction contains a 6-bit opcode field, one 5-bit register address field (Ra),
and a 21-bit signed displacement field.

3-10 • Instruction Formats

The displacement is treated as a longword offset. This means it is shifted left two bits (to address
a longword boundary), sign-extended to 64 bits and added to the updated PC to form the target
virtual address. Overflow is ignored in this calculation. The target virtual address (va) is computed as follows:
va

PC + {4*SEXT(Branch_disp)}

Operate Instruction Format
The Operate format is used for instructions that perform integer register to integer register
operations. The Operate format allows the specification of one destination operand and two
source operands. One of the source operands can be a literal constant. The Operate format in
Figure 3-4 shows the two cases when bit <12> of the instruction is 0 and 1.
31

2625

Opcode

2120

1615131211

SBZ 0

5 4

Function

I Gr------8 G
31

2625

Opcode

2120

131211

L1T

5 4

Function

Figure 3-4 • Operate Instruction Format

An Operate format instruction contains a 6-bit opcode field and a 7-bit function field. Unused
function encodings produce UNPREDICTABLE but not UNDEFINED results; they are not security
holes.
There are three operand fields, Ra, Rb, and Rc.
The Ra field specifies a source operand. Symbolically, the integer Rav operand is formed as
follows:
IF inst<25:21> EQ 31 THEN
Rav ~ 0
ELSE
Rav ~ Ra
END

The Rb field specifies a source operand. Integer operands can specify a literal or an integer
register using bit <12> of the instruction.

If bit <12> of the instruction is 0, the Rb field specifies a source register operand.

3-11

If bit <12> of the instruction is 1, an 8-bit zero-extended literal constant is formed by bits
<20:13> of the instruction. The literal is interpreted as a positive integer between 0 and 255 and
is zero-extended to 64 bits. Symbolically, the integer Rbv operand is formed as follows:
IF inst<12> EQ 1 THEN
Rbv f- ZEXT(inst<20:13»
ELSE
IF inst<20:16> EQ 31 THEN
Rbv f- 0
ELSE
Rbv f- Rb
END
END

The Rc field specifies a destination operand.

Floating-Point Operate Instruction Format
The Floating-point Operate format is used for instructions that perform floating-point register to
floating-point register operations. The Floating-point Operate format allows the specification of
one destination operand and two source operands. The Floating-point Operate format is shown
in Figure 3-5.
31

g--Fu-n-ct-io-n--'G

2625

Opcode

21 20

16 15

5 4

Figure 3-5 • Floating-Point Operate Instruction Format
A Floating-point Operate format instruction contains a 6-bit opcode field and an ll-bit function
field. Unused function encodings produce UNPREDICTABLE results, as defined in UNPREDICTABLE and UNDEFINED in Chapter 1.
There are three operand fields, Fa, Fb, and Fe. Each operand field specifies either an integer or
floating-point operand as defined by the instruction.
The Fa field specifies a source operand. Symbolically, the Fav operand is formed as follows:
IF inst<25:21> EQ 31 THEN
Fav f-

Fav f-

ELSE
END

3-12 • Instruction Formats

The Fb field specifies a source operand. Symbolically, the Fbv operand is formed as follows:
IF inst<20:16> EQ 31 THEN
Fbv f-

Fbv f-

ELSE
END

Note
Neither Fa nor Fb can be a literal in Floating-point Operate instructions.

The Fc field specifies a destination operand.

Floating-Point Convert Instructions
Floating-point Convert instructions use a subset of the Floating-point Operate format and
perform register-to-register conversion operations. The Fb operand specifies the source; the Fa
field must be F31.
The floating-point register to be used is specified by the Fa, Fb, and Fc fields all pointing to the
same floating-point register. If the Fa, Fb, and Fc fields do not all point to the same floating-point
register, then it is UNPREDICTABLE which register is used.

PALco de Instruction Format
The Privileged Architecture Library (PALcode) format is used to specify extended processor
functions. It has the format shown in Figure 3-6.
31

2625

Opcode I-----PA-L-c-od-e-F-u-n-ct-io-n----I

Figure 3-6 • PALcode Instruction Format
The 26-bit PALcode function field specifies the operation.
The source and destination operands for PALcode instructions are supplied in fixed registers that
are specified in the individual instruction descriptions.

An opcode of zero and a PALcode function of zero specify the HALT instruction.

Chapter 4 · Instruction Descriptions

• Instruction Set Overview
This chapter describes the instructions implemented by the Alpha architecture. The instruction
set is divided into the following sections:
Instruction Type

Section

Integer load and store

Memory Integer Load/Store Instructions

Integer control

Control Instructions

Integer arithmetic

Integer Arithmetic Instructions

Logical and shift

Logical and Shift Instructions

Byte manipulation

Byte-Manipulation Instructions

Floating-point load and store

Memory Format Floating-Point Instructions

Floating-point control

Branch Format Floating-Point Instructions

Floating-point operate

Floating-Point Operate Format Instructions

Miscellaneous

Miscellaneous Instructions

Within each major section, closely related instructions are combined into groups and described
together. The instruction group description is composed of the following:
• The group name
• The format of each instruction in the group, which includes the name, access type, and data type
of each instruction operand
• The operation of the instruction
• Exceptions specific to the instruction
• The instruction mnemonic and name of each instruction in the group
• Qualifiers specific to the instructions in the group
• A description of the instruction operation
• Optional programming examples and optional notes on the instruction

4-2 • Instruction Descriptions

Subsetting Rules
An instruction that is omitted in a subset implementation of the Alpha architecture is not
performed in either hardware or PALcode. System software may provide emulation routines for
subsetted instructions.

Floating-Point Subsets
Floating-point support is optional on an Alpha processor. An implementation that supports
floating-point must implement the 32 floating-point registers, the Floating-point Control Register
(FPCR) and the instructions to access it, floating-point branch instructions, floating-point copy
sign (CPYSx) instructions, floating-point convert instructions, floating-point conditional move
instruction (FCMOV), and the S_floating and T_floating memory operations.
Software Note
A system that will not support floating-point operations is still required
to provide the 32 floating-point registers, the Floating-point Control
Register (FPCR) and the instructions to access it, and the T_floating
memory operations if the system intends to support VMS. This requirement facilitates the implementation of a floating-point emulator and
simplifies context-switching.

In addition, floating-point support requires at least one of the following subset groups:
1. VAX Floating-point Operate and Memory instructions (F_ and G_floating).
2. IEEE Floating-point Operate instructions (S_ and f_floating). Within this group, an implementation can choose to include or omit separately the ability to perform IEEE rounding to
plus infinity and minus infinity.

Note: if one instruction in a group is provided, all other instructions in that group must be
provided. An implementation with full floating-point support includes both groups; a subset
floating-point implementation supports only one of these groups. The individual instruction
descriptions indicate whether an instruction can be subsetted.

Software Emulation Rules
General-purpose layered and application software that executes in User mode may assume that
certain loads (LDL, LDQ, LDF, LDG, LDS, and LDT) and certain stores (STL, STQ, STF, STG, STL
and STT) of unaligned data are emulated by system software. General-purpose layered and
application software that executes in User mode may assume that subsetted instructions are
emulated by system software. Frequent use of emulation may be significantly slower than using
alternative code sequences.
Emulation of loads and stores of unaligned data and subsetted instructions need not be provided
in privileged access modes. System software that supports special-purpose dedicated applications
need not provide emulation in User mode if emulation is not needed for correct execution of the
special-purpose applications.

4-3

Opcode Qualifiers
Some Operate format and Floating-point Operate format instructions have several variants. For
example, for the VAX formats, Add F_floating (ADDF) is supported with and without floating
underflow enabled, and with either chopped or VAX rounding. For IEEE formats, IEEE unbiased
rounding, chopped, round toward plus infinity, and round toward minus infinity can be selected.
The different variants of such instructions are denoted by opcode qualifiers, which consist of a
slash (I) followed by a string of selected qualifiers. Each qualifier is denoted by a single character
as shown in Table 4-1. The opcodes for each qualifier are listed in Appendix C.

Table 4-1 · Opcode Qualifiers
Qualifier

Meaning

Chopped rounding

Rounding mode dynamic

Round toward minus infinity

Inexact result enable

Software completion enable

Floating underflow enable

Integer overflow enable

The default values are normal rounding, software completion disabled, inexact result disabled,
floating underflow disabled, and integer overflow disabled.

4-4 • Instruction Descriptions

• Memory Integer Load/Store Instructions
The instructions in this section move data between the integer registers and memory.
They use the Memory instruction format. The instructions are summarized in Table 4-2.

Table 4-2 · Memory Integer Load/Store Instructions
Mnemonic

Operation

LDA

Load Address

LDAH

Load Address High

LDL

Load Sign-Extended Longword

LDL_L

Load Sign-Extended Longword Locked

LDQ

Load Quadword

LDQ_L

Load Quadword Locked

LDQ_U

Load Quadword Unaligned

STL

Store Longword

STL_C

Store Longword Conditional

STQ

Store Quadword

STQ_C

Store Quadword Conditional

STQ_U

Store Quadword Unaligned

4-5

Load Address
Format:
LDAx

Ra.wq,disp.ab(Rb.ab)

!Memory format

Operation:
Ra f-

Rbv + SEXT(disp)

!LDA

Ra f-

Rbv + SEXT(disp*65536)

!LDAH

Exceptions:

None
Instruction mnemonics:

LDA

Load Address

LDAH

Load Address High

Qualifiers:

None
Description:

The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement
for LDA, and 65536 times the sign-extended 16-bit displacement for LDAH. The 64-bit result is
written to register Ra.

4-6 • Instruction Descriptions

Load Memory Data into Integer Register
Format:
LDx

Ra.wq,disp.ab(Rb.ab)

!Memory format

Operation:
va

Ra
Ra

{Rbv + SEXT(disp)}

SEXT ( (va) <31: 0»
(va)<63:0>

!LDL
!LDQ

Exceptions:
Access Violation
Alignment
Fault on Read
Translation Not Valid

Instruction mnemonics:
LDL
Load Sign-Extended Longword from Memory to Register
LDQ

Load Quadword from Memory to Register

Qualifiers:
None
Description:
The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.
The source operand is fetched from memory, sign-extended, and written to register Ra. If the
data is not naturally aligned, an alignment exception is generated.

4-7

Load Unaligned Memory Data into Integer Register

Ra.wq,disp.ab(Rb.ab)

!Memory format

Operation:
va

{{Rbv + SEXT(disp)} AND NOT 7}

(va)<63:0>

Exceptions:

Access Violation
Fault on Read
Translation Not Valid
Instruction mnemonics:

Load Unaligned Quadword from Memory to Register
Qualifiers:

None
Description:

The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement,
then the low-order three bits are cleared. The source operand is fetched from memory and
written to register Ra.

4-8 • Instruction Descriptions

Load Memory Data into Integer Register Locked
Format:
Ra.wq,disp.ab(Rb.ab)

!Memory format

Operation:
va

{Rbv + SEXT(disp)}

lock_flag f- 1
locked_physical_address
Ra f- SEXT ( (va) <31: 0»
Ra f(va)<63:0>

PHYSICAL_ADDRESS (va)
!LDL_L
!LDQ_L

Exceptions:

Access Violation
Alignment
Fault on Read
Translation Not Valid
Instruction mnemonics:

LDL_L

Load Sign-Extended Longword from Memory to Register Locked

LDQ_L

Load Quadword from Memory to Register Locked

Qualifiers:

None
Description:

The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.
The source operand is fetched from memory, sign-extended for LDL_L, and written to register
Ra.
When a LDx_L instruction is executed without faulting, the processor records the target physical
address in a per-processor locked_physical_address register and sets the per-processor lock_flag.

If the per-processor lock_flag is (still) set when a STx_C instruction is executed, the store occurs;
otherwise, it does not occur, as described for the STx_C instructions.

If processor A's lock_flag is set and processor B successfully does a store within A's locked range
of physical addresses, then A's lock_flag is cleared. A processor's locked range is the aligned
block of Th"N bytes that includes the locked_physicaLaddress. The 2 i n"N value is implementation dependent. It is at least 8 (minimum lock range is an aligned quadword) and is at most the
page size for that implementation (maximum lock range is one physical page).

4-9

A processor's lock_flag is also cleared if that processor encounters any exception, interrupt, or
CALL_PAL instruction. It is UNPREDICTABLE whether a processor's lock_flag is cleared by that
processor's executing a normal load or store instruction. It is UNPREDICTABLE whether a
processor's lock_flag is cleared by that processor's executing a taken branch (including BR, BSR,
and Jumps); conditional branches that fall through do not clear the lock_flag.
The sequence LDx_L, modify, STx_C, BEQ xxx executed on a given processor does an atomic
read-modify-write of a datum in shared memory if the branch falls through; if the branch is taken,
the store did not modify memory and the sequence may be repeated until it succeeds.
Notes:
• LDx_L instructions do not check for write access; hence a matching STx_C may take an
access-violation or fault-on-write exception.

Executing a LDx_L instruction on one processor does not affect any architecturally visible state
on another processor, and in particular cannot cause a STx_C on another processor to fail.
LDx_L and STx_C instructions need not be paired. In particular, an LDx_L may be followed by a
conditional branch: on the fall-through path an STx_C is done, whereas on the taken path no
matching STx_C is done.

If two LDx_L instructions execute with no intervening STx_C, the second one overwrites the
state of the first one. If two STx_C instructions execute with no intervening LDx_L, the second
one always fails because the first clears lock_flag.
• Software will not emulate unaligned LDx_L instructions.

• If any other memory access (LDx, LDQ_U, STx, STQ_U) is done on the given processor between
the LDx_L and the STx_C, the sequence above may always fail on some implementations; hence,
no useful program should do this.

• If a branch is taken between the LDx_L and the STx_C, the sequence above may always fail on
some implementations; hence, no useful program should do this. (CMOVxx may be used to avoid
branching.)

• If a subsetted instruction (for example, floating-point) is done between the LDx_L and the
STx_C, the sequence above may always fail on some implementations, because of the Illegal
Instruction Trap; hence, no useful program should do this.

• If a large number of instructions are executed between the LDx_L and the STx_C, the sequence
above may always fail on some implementations, because of a timer interrupt always clearing the
lock_flag before the sequence completes; hence, no useful program should do this.

4-10 • Instruction Descriptions

• Hardware implementations are encouraged to lock no more than 128 bytes. Software implementations are encouraged to separate locked locations by at least 128 bytes from other locations that
could potentially be written by another processor while the first location is locked.
Implementation Notes
Implementations that impede the mobility of a cache block on LDx_L,
such as that which may occur in a Read for Ownership cache coherency
protocol, may release the cache block and make the subsequent STx_C
fail if a branch-taken or memory instruction is executed on that
processor.

All implementations should guarantee that at least 40 non-subsetted
operate instructions can be executed between timer interrupts.

4-11

Store Integer Register Data into Memory Conditional

Ra.mq,disp.ab(Rb.ab)

!Memory format

Operation:
va f-

{Rbv + SEXT(disp)}

IF lock_flag EQ 1 THEN
(va)<31:0> f - Rav<31:0>
(va) f- Rav
Ra f - lock_flag
lock_flag f - 0

Exceptions:

Access Violation
Fault on Write
Alignment
Translation Not Valid
Instruction mnemonics:

Store Longword from Register to Memory Conditional
Store Quadword from Register to Memory Conditional
Qualifiers:

None
Description:

The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.

If the lock_flag is set, the Ra operand is written to memory at this address. (See the LDx_L
description for conditions that clear the lock_flag.) The lock_flag is returned in RA and then set
to a zero.
Notes:

• Software will not emulate unaligned STx_C instructions.
• Each implementation must do the test and store atomically, so that if two processors execute
store conditionals within the same lock range, exactly one of the stores succeeds.

4-12 • Instruction Descriptions

• The following sequence should not be used:
try_again: LDQ_L
Rl,x
<modify Rl>
STQ_C
Rl,x
BEQ
Rl, try_again

That sequence penalizes performance when the STQ_C succeeds, because the sequence contains a
backward branch, which is predicted to be taken in the Alpha architecture. In the case where the
STQ_C succeeds and the branch will actually fall through, that sequence incurs unnecessary delay
due to a mispredicted backward branch. Instead, a forward branch should be used to handle the
failure case as shown in Atomic Update of a Single Datum in Chapter 5.
Software Note
Although this is not recommended, the address specified by a STx_C
instruction need not match that given in a preceding LDx_L. Further,
specifying unmatched addresses for those instructions requires an MB in
between to guarantee ordering.
Implementation Notes
A STx_C must propagate to the point of coherency, where it is guaranteed to prevent any other store from changing the state of the lock bit,
before its outcome can be determined.

If an implementation could encounter a TB or cache miss on the data
reference of the STx_C in the sequence above (as might occur in some
shared 1- and D-stream direct-mapped TBs/caches), it must be able to
resolve the miss and complete the store without always failing.

4-13

Store Integer Register Data into Memory
Format:
STx

Ra.rq,disp.ab(Rb.ab)

lMemory format

Operation:
va f - {Rbv + SEXT(disp)}
(va)<31:0> f - Rav<31: 0>
(va) f- Rav

!STL
lSTQ

Exceptions:

Access Violation
Fault on Write
Alignment
Translation Not Valid
Instruction mnemonics:

STL

Store Longword from Register to Memory

STQ

Store Quadword from Register to Memory

Qualifiers:

None
Description:

The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.
The Ra operand is written to memory at this address. If the data is not naturally aligned, an
alignment exception is generated.

4-14 • Instruction Descriptions

Store Unaligned Integer Register Data into Memory

Ra.rq,disp.ab(Rb.ab)

!Memory format

Operation:
va

{{Rbv + SEXT(disp)J AND NOT 7}

(va)<63:0>

Rav<63:0>

Exceptions:

Access Violation
Fault on Write
Translation Not Valid
Instruction mnemonics:

Store Unaligned Quadword from Register to Memory
Qualifiers:

None
Description:

The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement,
then clearing the low order three bits. The Ra operand is written to memory at this address.

4-15

• Control Instructions
Alpha provides integer conditional branch, unconditional branch, Branch to Subroutine, and
Jump to Subroutine instructions. The PC used in these instructions is the updated PC, as
described in Program Counter in Chapter 3.
To allow implementations to achieve high performance, the Alpha architecture includes explicit
hints based on a branch-prediction model:
1. For many implementations of computed branches (JSRlRET/JMP) , there is a substantial
performance gain in forming a good guess of the expected target I-cache address before
register Rb is accessed.
2. For many implementations, the first-level (or only) I-cache is no bigger than a page (8 KB to
64 KB).

3. Correctly predicting subroutine returns is important for good performance. Some implementations will therefore keep a small stack of predicted subroutine return I-cache addresses.
The Alpha architecture provides three kinds of branch-prediction hints: likely target address,
return-address stack action, and conditional branch-taken.
For computed branches (JSRlRET/JMP), otherwise unused displacement bits are used to specify
the low 16 bits of the most likely target address. The PC-relative calculation using these bits can
be exactly the PC-relative calculation used in unconditional branches. The low 16 bits are enough
to specify an I-cache block within the largest possible Alpha page and hence are expected to be
enough for branch-prediction logic to start an early I-cache access for the most likely target.
For all branches, hint or opcode bits are used to distinguish simple branches, subroutine calls,
subroutine returns, and coroutine links. These distinctions allow branch-predict logic to maintain
an accurate stack of predicted return addresses.
For conditional branches, the sign of the target displacement is used as a takenlfall-through hint.
The instructions are summarized in Table 4-3.

4-16 • Instruction Descriptions

Table 4-3 · Control Instructions Summary
Mnemonic

Operation

BEQ

Branch if Register Equal to Zero

BGE

Branch if Register Greater Than or Equal to Zero

BGT

Branch if Register Greater Than Zero

BLBC

Branch if Register Low Bit Is Clear

BLBS

Branch if Register Low Bit Is Set

BLE

Branch if Register Less Than or Equal to Zero

BLT

Branch if Register Less Than Zero

BNE

Branch if Register Not Equal to Zero

Unconditional Branch

BSR

Branch to Subroutine

JMP

Jump

JSR

Jump to Subroutine

RET

Return from Subroutine

JSR_COROUTINE

Jump to Subroutine Return

4-17

Conditional Branch
Format:
Bxx
Ra.rq,disp.al
Operation:
{update PC}
va ~ PC + {4*SEXT(disp)}
IF TEST (Rav, Condition_based_on_Opcode)
PC ~ va

!Branch format

THEN

Exceptions:

None
Instruction mnemonics:

BEQ

Branch if Register Equal to Zero

BGE

Branch if Register Greater Than or Equal to Zero

BGT

Branch if Register Greater Than Zero

BLBC

Branch if Register Low Bit Is Clear

BLBS

Branch if Register Low Bit Is Set

BLE

Branch if Register Less Than or Equal to Zero

BLT

Branch if Register Less Than Zero

BNE

Branch if Register Not Equal to Zero

Qualifiers:

None
Description:

Register Ra is tested. If the specified relationship is true, the PC is loaded with the target virtual
address; otherwise, execution continues with the next sequential instruction.
The displacement is treated as a signed longword offset. This means it is shifted left two bits (to
address a longword boundary), sign-extended to 64 bits, and added to the updated PC to form
the target virtual address.
The conditional branch instructions are PC-relative only. The 21-bit signed displacement gives a
forward/backward branch distance of +/- 1M instructions.
The test is on the signed quadword integer interpretation of the register contents; all 64 bits are
tested.
Notes:

• Forward conditional branches (positive displacement) are predicted to fall through. Backward
conditional branches (negative displacement) are predicted to be taken. Conditional branches do
not affect a predicted return address stack.

4-18 • Instruction Descriptions

Unconditional Branch
Format:
BxR
Ra.wq,disp.al

!Branch format

Operation:
{update PC}
Ra ~ PC
PC ~ PC + {4*SEXT(disp)}
Exceptions:

None
Instruction mnemonics:

Unconditional Branch

BSR

Branch to Subroutine

Qualifiers:

None
Description:

The PC of the following instruction (the updated PC) is written to register Ra, and then the PC is
loaded with the target address.
The displacement is treated as a signed longword offset. This means it is shifted left two bits (to
address a longword boundary), sign-extended to 64 bits, and added to the updated PC to form
the target virtual address.
The unconditional branch instructions are PC-relative. The 21-bit signed displacement gives a
forward/backward branch distance of +/- 1M instructions.
PC-relative addressability can be established by:
BR

RX,Ll

Ll:

Notes:

• BR and BSR do identical operations. They only differ in hints to possible branch-prediction logic.
BSR is predicted as a subroutine call (pushes the return address on a branch-prediction stack),
whereas BR is predicted as a branch (no push).

·/·19

Jumps
Format:
mnemonic
Operation:
{update PC}
va f - Rbv AND
Ra f- PC
PC f va

Ra. wq, (Rb. ab) ,hint

!Memory format

{NOT 3}

Exceptions:

None
Instruction mnemonics:

JMP

Jump

JSR

Jump to Subroutine

RET

Return from Subroutine

JSR_COROUTINE

Jump to Subroutine Return

Qualifiers:

None
Description:

The PC of the instruction following the Jump instruction (the updated PC) is written to register
Ra, and then the PC is loaded with the target virtual address.
The new PC is supplied from register Rb. The low two bits of Rb are ignored. Ra and Rb may
specify the same register; the target calculation using the old value is done before the new value is
assigned.
All Jump instructions do identical operations. They only differ in hints to possible
branch-prediction logic. The displacement field of the instruction is used to pass this information.
The four different "opcodes" set different bit patterns in disp<15:14>, and the hint operand sets
disp<13:0>.

4-2U • Instruction Descriptions

These bits are intended to be used as shown in Table 4-4.

Table 4-4 · Jump Instructions Branch Prediction
Predicted
Meaning
disp<15:14>
Target<15:0>
PC + {4~'~disp<13:0>}
00
JMP

Prediction
Stack Action

JSR

PC + {4"~disp<13:0>}

Push PC

RET

Prediction stack

Pop

JSR_COROUTINE

Prediction stack

Pop, push PC

The design in Table 4-4 allows specification of the low 16 bits of a likely longword target address
(enough bits to start a useful I-cache access early), and also allows distinguishing call from return
(and from the other two less frequent operations).
NQte that the above information is used only as a hint; correct setting of these bits can improve
performance but is not needed for correct operation. See Appendix A for more information on
branch prediction.

An unconditional long jump can be performed by:
JMP

R31, (Rb) ,hint

Coroutine linkage can be performed by specifying the same register in both the Ra and Rb
operands. When disp<15:14> equals '10' (RET) or '11' (JSR_COROUTINE) (that is, the target
address prediction, if any, would come from a predictor implementation stack), then bits <13:0>
are reserved for software and must be ignored by all implementations. All encodings for bits
<13:0> are used by Digital software or Reserved to Digital, as follows:

Encoding

Meaning

0000 16

Indicates non-procedure return

0001 16

Indicates procedure return
All other encodings are reserved to Digital.

4-21

• Integer Arithmetic Instructions
The integer arithmetic instructions perform add, subtract, multiply, and signed and unsigned
compare operations.
The integer instructions are summarized in Table 4-5.

Table 4-5 · Integer Arithmetic Instructions Summary
Mnemonic

Operation

ADD

Add Quadword/Longword

S4ADD

Scaled Add by 4

S8ADD

Scaled Add by 8

CMPEQ

Compare Signed Quadword Equal

CMPLT

Compare Signed Quadword Less Than

CMPLE

Compare Signed Quadword Less Than or Equal

CMPULT

Compare Unsigned Quadword Less Than

CMPULE

Compare Unsigned Quadword Less Than or Equal

MUL

Multiply Quadword/Longword

UMULH

Multiply Quadword Unsigned High

SUB

Subtract QuadwordiLongword

S4SUB

Scaled Subtract by 4

S8SUB

Scaled Subtract by 8

There is no integer divide instruction. Division by a constant can be done via UMULH; division
by a variable can be done via a subroutine. See Appendix A.

4-22 • Instruction Descriptions

Longword Add
Format:
ADDL
ADDL

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
Rc

SEXT ( (Rav + Rbv) <31: 0»

Exceptions:

Integer Overflow
Instruction mnemonics:

ADDL

Add Longword

Qualifiers:

Integer Overflow Enable (IV)
Description:

Register Ra is added to register Rb or a literal, and the sign-extended 32-bit sum is written to Rc.
The high order 32 bits of Ra and Rb are ignored. Rc is a proper sign extension of the truncated
32-bit sum. Overflow detection is based on the longword sum Rav<31:0> + Rbv<31:0> .

4-23

Scaled Longword Add
Format:
SxADDL
SxADDL
Operation:
CASE
S4ADDL:
S8ADDL:
ENDCASE

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

Rc fRc f-

!Operate format
!Operate format

SEXT (( (LEFT_SHIFT(Rav,2)) + Rbv)<31:0»
SEXT (( (LEFT_SHIFT (Rav,3) ) + Rbv)<31:0»

Exceptions:

None
Instruction mnemonics:

S4ADDL

Scaled Add Longword by 4

S8ADDL

Scaled Add Longword by 8

Qualifiers:

None
Description:

Register Ra is scaled by 4 (for S4ADDL) or 8 (for S8ADDL) and is added to register Rb or a literal,
and the sign-extended 32-bit sum is written to Rc.
The high 32 bits of Ra and Rb are ignored. Rc is a proper sign extension of the truncated 32-bit
sum.

4-24 • Instruction Descriptions

Quadword Add
Format:
ADDQ
ADDQ

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

Operation:
Rc ~ Rav + Rbv

!Operate format
!Operate format

Quadword

Exceptions:

Integer Overflow
Instruction mnemonics:

ADDQ

Add Quadword

Qualifiers:

Integer Overflow Enable (IV)
Description:

Register Ra is added to register Rb or a literal, and the 64-bit sum is written to Rc.
On overflow, the least significant 64 bits of the true result are written to the destination register.
The unsigned compare instructions can be used to generate carry. After adding two values, if the
sum is less unsigned than either one of the inputs, there was a carry out of the most significant
bit.

4-25

Scaled Quadword Add
Format:
SxADDQ
SxADDQ

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
CASE
S4ADDQ:
S8ADDQ:
ENDCASE

Rc
Rc

ff-

LEFT_SHIFT (Rav,2) + Rbv
LEFT_SHIFT (Rav,3) + Rbv

Exceptions:

None
Instruction mnemonics:

S4ADDQ

Scaled Add Quadword by 4

S8ADDQ

Scaled Add Quadword by 8

Qualifiers:

None
Description:

Register Ra is scaled by 4 (for S4ADDQ) or 8 (for S8ADDQ) and is added to register Rb or a
literal, and the 64-bit sum is written to Rc.
On overflow, the least significant 64 bits of the true result are written to the destination register.

4-26 • Instruction Descriptions

Integer Signed Compare
Format:
CMPxx
CMPxx

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

Operation:
IF Rav SIGNED_RELATION Rbv
Rc f- 1
ELSE
Rc f- a

!Operate format
!Operate format

THEN

Exceptions:

None
Instruction mnemonics:

CMPEQ

Compare Signed Quadword Equal

CMPLE

Compare Signed Quadword Less Than or Equal

CMPLT

Compare Signed Quadword Less Than

Qualifiers:

None
Description:

Register Ra is compared to Register Rb or a literal. If the specified relationship is true, the value
one is written to register Rc; otherwise, zero is written to Rc.
Notes:

• Compare Less Than A,B is the same as Compare Greater Than B,A; Compare Less Than or
Equal A,B is the same as Compare Greater Than or Equal B,A. Therefore, only the less-than
operations are included.

4-27

Integer Unsigned Compare
Format:
CMPUxx
CMPUxx

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

Operation:
IF Rav UNSIGNED_RELATION Rbv
Rc f- 1
ELSE
Rc f - 0

!Operate format
!Operate format

THEN

Exceptions:

None
Instruction mnemonics:

CMPULE

Compare Unsigned Quadword Less Than or Equal

CMPULT

Compare Unsigned Quadword Less Than

Qualifiers:

None
Description:

Register Ra is compared to Register Rb or a literal. If the specified relationship is true, the value
one is written to register Rc; otherwise, zero is written to Rc.

4-28 • Instruction Descriptions

Longword Multiply
Format:
MULL
MULL

Ra.rq,Rb.rq,Rc.wq
Ra.Rq,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
Rc

SEXT ((Rav * Rbv) <31: 0»

Exceptions:
Integer Overflow
Instruction mnemonics:
MULL
Multiply Longword
Qualifiers:
Integer Overflow Enable (IV)
Description:
Register Ra is multiplied by register Rb or a literal, and the sign-extended 32-bit product is
written to Rc.
The high 32 bits of Ra and Rb are ignored. Rc is a proper sign extension of the truncated 32-bit
product. Overflow detection is based on the longword product Rav<31:0> ~< Rbv<31:0>. On
overflow, the proper sign extension of the least significant 32 bits of the true result are written to
the destination register.
The MULQ instruction can be used to return the full 64-bit product.

4-29

Quadword Multiply
Format:
MULQ
MULQ

Ra.rq,Rb.rq,Rc.wq
Ra.Rq,#b.ib,Rc.wq

Operation:
Rc f - Rav * Rbv

!Operate format
!Operate format

!MUL

Exceptions:

Integer Overflow
Instruction mnemonics:
MULQ
Multiply Quadword
Qualifiers:

Integer Overflow Enable (IV)
Description:

Register Ra is multiplied by register Rb or a literal, and the 64-bit product is written to register
Rc. Overflow detection is based on considering the operands and the result as signed quantities.
On ,?verflow, the least significant 64 bits of the true result are written to the destination register.
The UMULH instruction can be used to generate the upper 64 bits of the 128-bit result when an
overflow occurs.

4-30 • Instruction Descriptions

Unsigned Quadword Multiply High
Format:
UMULH
UMULH

Ra.rq,Rb.rq,Rc.wq
Ra.Rq,#b.ib,Rc.wq

Operation:
Rc ~ {Rav *U Rbv}<127:64>

!Operate format
!Operate format

!UMULH

Exceptions:

None
Instruction mnemonics:
UMULH
Unsigned Multiply Quadword High
Qualifiers:

None
Description:

Register Ra and Rb or a literal are multiplied as unsigned numbers to produce a 128-bit result.
The high-order 64-bits are written to register Rc.
The UMULH instruction can be used to generate the upper 64 bits of a 128-bit result as follows:
Ra and Rb are unsigned: result of UMULH
Ra and Rb are signed:

(result of UMULH) - Ra<63>":Rb - Rb<63>;:Ra

The MULQ instruction gives the low 64 bits of the result in either case.

4-31

Longword Subtract
Format:
SUBL
SUBL

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
Rc

SEXT ((Rav - Rbv)<31:0»

Exceptions:

Integer Overflow
Instruction mnemonics:

SUBL

Subtract Longword

Qualifiers:

Integer Overflow Enable (IV)
Description:

Register Rb or a literal is subtracted from register Ra, and the sign-extended 32-bit difference is
written to Rc.
The high 32 bits of Ra and Rb are ignored. Rc is a proper sign extension of the truncated 32-bit
difference. Overflow detection is based on the longword difference Rav<31 :0> - Rbv<31 :0>.

4-32 • Instruction Descriptions

Scaled Longword Subtract
Format:
SxSUBL
SxSUBL
Operation:
CASE
S4SUBL:
S8SUBL:
ENDCASE

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

Rc
Rc

ff-

!Operate format
!Operate format

SEXT (((LEFT_SHIFT(Rav,2)) - Rbv)<31:0»
SEXT (((LEFT_SHIFT(Rav,3)) - Rbv)<31:0»

Exceptions:

None
Instruction mnemonics:

S4SUBL

Scaled Subtract Longword by 4

S8SUBL

Scaled Subtract Longword by 8

Qualifiers:

None
Description:

Register Rb or a literal is subtracted from the scaled value of register Ra, which is scaled by 4 (for
S4SUBL) or 8 (for S8SUBL), and the sign-extended 32-bit difference is written to Rc.
The high 32 bits of Ra and Rb are ignored. Rc is a proper sign extension of the truncated 32-bit
difference.

4-33

Quadword Subtract
Format:
Ra.rq,Rb.rq,Rc.wq
SUBQ
Ra.rq,#b.ib,Rc.wq
SUBQ

!Operate format
!Operate format

Operation:
Rc ~ Rav - Rbv
Exceptions:

Integer Overflow
Instruction mnemonics:
SUBQ
Subtract Quadword
Qualifiers:

Integer Overflow Enable (IV)
Description:

Register Rb or a literal is subtracted from register Ra, and the 64-bit difference is written to
register Rc. On overflow, the least significant 64 bits of the true result are written to the
destination register.
The unsigned compare instructions can be used to generate borrow. If the minuend (Rav) is less
unsigned than the subtrahend (Rbv) , there will be a borrow.

4-34 • Instruction Descriptions

Scaled Quadword Subtract
Format:
SxSUBQ
SxSUBQ

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
CASE
S4SUBQ:
S8SUBQ:
ENDCASE

Rc
Rc

LEFT_SHIFT (Rav,2)
LEFT_SHIFT (Rav,3)

- Rbv
- Rbv

Exceptions:

None
Instruction mnemonics:

S4SUBQ

Scaled Subtract Quadword by 4

S8SUBQ

Scaled Subtract Quadword by 8

Qualifiers:

None
Description:

Register Rb or a literal is subtracted from the scaled value of register Ra, which is· scaled by 4 (for
S4SUBQ) or 8 (for S8SUBQ), and the 64-bit difference is written to Rc.

4-35

• Logical and Shift Instructions
The logical instructions perform quadword Boolean operations. The conditional move integer
instructions perform conditionals without a branch. The shift instructions perform left and right
logical shift and right arithmetic shift. These are summarized in Table 4-6.
Table 4-6 · Logical and Shift Instructions Summary
Mnemonic

Operation

AND

Logical Product

BIC

Logical Product with Complement

BIS

Logical Sum (OR)

EQV

Logical Equivalence (XORNOT)

ORNOT

Logical Sum with Complement

XOR

Logical Difference

CMOVxx

Conditional Move Integer

SLL

Shift Left Logical

SRA

Shift Right Arithmetic

SRL

Shift Right Logical
Software Note
There is no arithmetic left shift instruction. Where an arithmetic left shift
would be used, a logical shift will do. For multiplying by a small power
of two in address computations, logical left shift is acceptable.

Integer multiply should be used to perform an arithmetic left shift with overflow checking.
Bit field extracts can be done with two logical shifts. Sign extension can be done with left logical
shift and a right arithmetic shift.

4-36 • Instruction Descriptions

Logical Functions
Format:
mnemonic
mnemonic

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
Rc
Rc
Rc
Rc
Rc
Rc

~
~
~
~
~
~

Rav AND Rbv
Rav OR Rbv
Rav XOR Rbv
Rav AND {NOT Rbv}
Rav OR {NOT Rbv}
Rav XOR {NOT Rbv}

!AND

!BIS
!XOR
!BIe
!ORNOT
!EQV

Exceptions:

None
Instruction mnemonics:

AND

Logical Product

BIe

Logical Product with Complement

BIS

Logical Sum (OR)

EQV

Logical Equivalence (XORNOT)

ORNOT

Logical Sum with Complement

XOR

Logical Difference

Qualifiers:

None
Description:

These instructions perform the designated Boolean function between register Ra and register Rb
or a literal. The result is written to register Rc.
The "NOT" function can be performed by doing an ORNOT with zero (Ra = R31).

4-37

Conditional Move Integer
Format:
CMOVxx
CMOVxx

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
IF TEST (Rav, Condition_based_on_Opcode) THEN
Rc

Rbv

Exceptions:

None
Instruction mnemonics:

CMOVEQ

CMOVE if Register Equal to Zero

CMOVGE

CMOVE if Register Greater Than or Equal to Zero

CMOVGT

CMOVE if Register Greater Than Zero

CMOVLBC

CMOVE if Register Low Bit Clear

CMOVLBS

CMOVE if Register Low Bit Set

CMOVLE

CMOVE if Register Less Than or Equal to Zero

CMOVLT

CMOVE if Register Less Than Zero

CMOVNE

CMOVE if Register Not Equal to Zero

Qualifiers:

None
Description:

4-38 • Instruction Descriptions

Notes:
Except that it is likely in many implementations to be substantially faster, the instruction:
CMOVEQ Ra,Rb,Rc

is exactly equivalent to:
BNE
OR

Ra,label
Rb,Rb,Rc

label:

For example, a branchless sequence for:
Rl=MAX(Rl,R2)

is:
CMPLT
CMOVNE

Rl,R2,R3
R3,R2,Rl

R3=1 if Rl<R2
Move R2 to Rl if Rl<R2

4-39

Shift Logical
Format:
SxL
SxL

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

Operation:
Rc ~ LEFT_SHIFT (Rav,
Rbv<S:O»
Rc ~ RIGHT_SHIFT (Rav, Rbv<S:O»

!Operate format
!Operate format

!SLL
!SRL

Exceptions:

None
Instruction mnemonics:

SLL

Shift Left Logical

SRL

Shift Right Logical

Qualifiers:

None
Description:

Register Ra is shifted logically left or right 0 to 63 bits by the count in register Rb or a literal. The
result is written to register Rc. Zero bits are propagated into the vacated bit positions.

4-40 • Instruction Descriptions

Shift Arithmetic
Format:
SRA
SRA

Ra.rq,Rb.rq,Rc.wq
Ra.rb,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
Rc f-

ARITH_RIGHT_SHIFT(Rav, Rbv<S:O»

Exceptions:
None
Instruction mnemonics:
SRA
Shift Right Arithmetic
Qualifiers:
None
Description:
Register Ra is right shifted arithmetically 0 to 63 bits by the count in register Rb or a literal. The
result is written to register Rc. The sign bit (Rav<63» is propagated into the vacated bit
positions.

4-41

• Byte-Manipulation Instructions
Alpha provides instructions for operating on byte operands within registers. These instructions
allow full-width memory accesses in the load/store instructions combined with powerful
in-register byte manipulation.
The instructions are summarized in Table 4-7.
Table 4-7 • Byte-Manipulation Instructions Summary
Mnemonic

Operation

CMPBGE

Compare Byte

EXTBL

Extract Byte Low

EXTWL

Extract Word Low

EXTLL

Extract Longword Low

EXTQL

Extract Quadword Low

EXTWH

Extract Word High

EXTLH

Extract Longword High

EXTQH

Extract Quadword High

INSBL

Insert Byte Low

INSWL

Insert Word Low

INSLL

Insert Longword Low

INSQL

Insert Quadword Low

INSWH

Insert Word High

INSLH

Insert Longword High

INSQH

Insert Quadword High

MSKBL

Mask Byte Low

MSKWL

Mask Word Low

MSKLL

Mask Longword Low

MSKQL

Mask Quadword Low

MSKWH

Mask Word High

MSKLH,

Mask Longword High

MSKQH

Mask Quadword High

ZAP

Zero Bytes

ZAPNOT

Zero Bytes Not

4-42 • Instruction Descriptions

Compare Byte
Format:
CMPBGE
CMPBGE

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
FOR i

FROM 0 TO 7

t emp< 8 : 0>
Rc fEND
Rc<63: 8> f-

{O I I
Rav } +
{O I I NOT Rbv} + 1
temp<8>
f-

Exceptions:

None
Instruction mnemonics:

CMPBGE

Compare Byte

Qualifiers:

None
Description:

CMPBGE does eight parallel unsigned byte comparisons between corresponding bytes of Rav and
Rbv, storing the eight results in the low eight bits of Rc. The high 56 bits of Rc are set to zero. Bit
of Rc corresponds to byte 0, bit 1 of Rc corresponds to byte 1, and so forth. A result bit is set in
Rc if the corresponding byte of Rav is greater than or equal to Rbv (unsigned).

Notes:

The result of CMPBGE can be used as an input to ZAP and ZAPNOT.
To scan for a byte of zeros in a character string:
<initialize Rl to aligned QW address of string>
LOOP:
LDQ
LDA
CMPBGE
BEQ

R2,O(Rl)
Rl,8(Rl)
R31, R2, R3
R3,LOOP

pick up 8 bytes
Increment string pointer
If NO bytes of zero, R3<7:0>=O
Loop if no terminator byte found
At this point, R3 can be used to
determine which byte terminated

4-43
To compare two character strings for greater/less:
<initialize Rl to aligned QW address of stringl>
<initialize R2 to aligned QW address of string2>
LOOP:
LDQ
LDA
LDQ
LDA
XOR
BEQ
CMPBGE

R3,O(Rl)
Rl, 8 (Rl)
R4,O(R2)
R2,8(R2)
R3,R4,R5
R5,LOOP
R31, R5, R5

pick up 8 bytes of stringl
Increment stringl pointer
pick up 8 bytes of string2
Increment string2 pointer
Test for all equal bytes
Loop if all equal
At this point, R5 can be used to
determine the first not-equal
byte position.

To range-check a string of characters in Rl for '0' ..'9':
LDQ

R2,litOs

LDQ

R3,lit9s

CMPBGE
CMPBGE
BNE
BNE

R2,Rl,R4
Rl,R3,R5
R4,ERROR
RS,ERROR

pick up 8 bytes of the character
BELOW ' 0'
'11111111'
pick up 8 bytes of the character
I:::::::: I
ABOVE ' 9'
Some R4=1 if character is LT ' 0'
Some RS=l if character is GT ' 9'
Branch if some char too low
Branch if some char too high

4-44 • Instruction Descriptions

Extract Byte
Format:
EXTxx
EXTxx

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
CASE
EXTBL: byte_mask ~
EXTWx: byte_mask ~
EXTLx: byte_mask ~
EXTQx: byte_mask ~
ENDCASE

0000
0000
0000
1111

0001 2
0011 2
1111 2
1111 2

CASE
EXTxL:
byte_loc ~ Rbv<2:0>*8
temp ~ RIGHT_SHIFT (Rav, byte_loc<5:0»
Rc ~ BYTE_ZAP (temp, NOT (byte_mask) )
EXTxH:
byte_loc ~ 64 - Rbv<2:0>*8
temp ~ LEFT_SHIFT (Rav, byte_loc<5:0»
Rc ~ BYTE_ZAP (temp, NOT (byte_mask) )
ENDCASE

Exceptions:

None
Instruction mnemonics:

EXTBL

Extract Byte Low

EXTWL

Extract Word Low

EXTLL

Extract Longword Low

EXTQL

Extract Quadword Low

EXTWH

Extract Word High

EXTLH

Extract Longword High

EXTQH

Extract Quadword High

Qualifiers:

None

4-45

Description:

EXTxL shifts register Ra right by 0 to 7 bytes, inserts zeros into vacated bit positions, and then
extracts 1, 2, 4, or 8 bytes into register Rc. EXTxH shifts register Ra left by 0 to 7 bytes, inserts
zeros into vacated bit positions, and then extracts 2,4, or 8 bytes into register Rc. The number of
bytes to shift is specified by Rbv<2:0>. The number of bytes to extract is specified in the function
code. Remaining bytes are filled with zeros.
Notes:

The comments in the examples below assume that the effective address (ea) of X(R11) is such that
(ea mod 8) = 5 , the value of the aligned quadword containing X(Rll) is CBAx xxxx , and the
value of the aligned quadword containing X+7(Rll) is yyyH GFED .
The examples below are the most general case unless otherwise noted; if more information is
known about the value or intended alignment of X, shorter sequences can be used.
The intended se.quence for loading a quadword from unaligned address X(R11) is:
LDQ_U
LDQ_U
LDA
EXTQL
EXTQH
OR

Rl, x (Rll)
R2,X+7(Rll)
R3,X(Rll)
Rl,R3,Rl
R2,R3,R2
R2,Rl,Rl

Ignores va<2:0>, Rl
Ignores va<2:0>, R2
R3<2:0> = (X mod 8 )
Rl
0000 OCBA
R2 = HGFE DOOO
Rl = HGFE DCBA

CBAx xxxx
yyyH GFED
5

The intended sequence for loading and zero-extending a longword from unaligned address X is:
LDQ_U
LDQ_U
LDA
EXTLL
EXTLH
OR

Rl,X(Rll)
R2,X+3(Rll)
R3,X(Rll)
Rl,R3,Rl
R2,R3,R2
R2, Rl, Rl

Ignores va<2:0>, Rl
Ignores va<2:0>, R2
R3<2:0> = (X mod 8)
Rl
0000 OCBA
R2 = 0000 DOOO
Rl = 0000 DCBA

CBAx xxxx
yyyy yyyD
5

The intended sequence for loading and sign-extending a longword from unaligned address X is:
LDQ_U
LDQ_U
LDA
EXTLL
EXTLH
OR
SLL
SRA

Rl,X(Rll)
R2,X+3(Rll)
R3,X(Rll)
Rl, R3, Rl
R2,R3,R2
R2,Rl,Rl
Rl,#32,Rl
Rl,#32,Rl

Ignores va<2:0>, Rl
Ignores va<2:0>, R2
R3<2:0> = (X mod 8)
Rl
0000 OCBA
0000 DOOO
R2
Rl
0000 DCBA
DCBA 0000
Rl
ssss DCBA
Rl

CBAx xxxx
yyyy yyyD
5

The intended sequence for loading and zero-extending a word from unaligned address X is:
LDQ_U
LDQ_U
LDA
EXTWL
EXTWH
OR

Rl,X(Rll)
R2,X+l(Rll)
R3,X(Rll)
Rl,R3,Rl
R2,R3,R2
R2,Rl,Rl

Ignores va<2:0>, Rl
Ignores va<2:0>, R2
R3<2:0> = (X mod 8)
0000 OOBA
Rl
R2
0000 0000
Rl
0000 OOBA

yBAx xxxx
yBAx xxxx
5

4-46 • Instruction Descriptions

The intended sequence for loading and sign-extending a word from unaligned address X is:
LDQ_U
LDQ_U
LDA
EXTWL
EXTWH
OR
SLL
SRA

R1,X(R11)
R2,X+1(R11)
R3,X(R11)
R1,R3,R1
R2,R3,R2
R2,R1,R1
R1,#48,R1
R1,#48,Rl

Ignores va<2:0>, R1
Ignores va<2:0>, R2
R3<2:0>
(X mod 8)
R1
0000 OOBA
R2
0000 0000
R1
0000 OOBA
R1
BAOO 0000
R1
ssss ssBA

yBAx xxxx
yBAx xxxx
5

The intended sequence for loading and zero-extending a byte from address X is:
LDQ_U
LDA
EXTBL

R1,X(R11)
R3,X(R11)
R1,R3,R1

Ignores va<2:0>, Rl
R3<2:0> = (X mod 8)
R1 = 0000 OOOA

yyAx xxxx
5

The intended sequence for loading and sign-extending a byte from address X ~s:
Ignores va<2:0>, Rl = yyAx xxxx
R3<2:0> = (X + 1) mod 8, i.e.,
convert byte position within
quadword to one-origin based
Places the desired byte into byte 7
of R1.final by left shifting
R1.initial by ( 8 - R3<2:0> ) byte
positions
Arithmetic Shift of byte 7 down
into byte 0,

R1, X(Rl1)
R3, X+1(Rl1)

EXTQH

R1, R3, R1

SRA

R1,

#56, R1

Optimized examples:

Assume that a word fetch is needed from 1O(R3), where R3 is intended to contain a
longword-aligned address. The optimized sequences below take advantage of the known constant
offset, and the longword alignment (hence a single aligned longword contains the entire word).
The sequences generate a Data Alignment Fault if R3 does not contain a longword-aligned
address.
The intended sequence for loading and zero-extending an aligned word from lO(R3) is:
LDL

Rl,8(R3)

R1 = ssss BAxx
Faults if R3 is not longword aligned
i Rl = 0000 OOBA

EXTWL

R1,#2,Rl

The intended sequence for loading and sign-extending an aligned word from lO(R3) is:
LDL

R1,8(R3)

SRA

R1,#16,R1

R1 = ssss BAxx
Faults if R3 is not longword aligned
R1 = ssss ssBA

4-47

Byte Insert
Format:
INSxx
INSxx

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
CASE
INSBL: byte_mask ~ 0000 0000 0000 0001 2
INSWx: byte_mask ~ 0000 0000 0000 0011 2
INSLx: byte_mask ~ 0000 0000 0000 1111 2
INSQx: byte_mask ~ 0000 0000 1111 1111 2
ENDCASE
byte_mask ~ LEFT_SHIFT (byte_mask, rbv<2:0»
CASE
INSxL:
byte_loc ~ Rbv<2:0>*8
temp ~ LEFT_SHIFT (Rav, byte_loc<5:0»
Rc ~ BYTE_ZAP (temp, NOT(byte_mask<7:0»)
INSxH:
byte_loc ~ 64 - Rbv<2:0>*8
temp ~ RIGHT_SHIFT (Rav, byte_loc<5:0»
Rc ~ BYTE_ZAP (temp, NOT(byte_mask<15:8»)
ENDCASE

Exceptions:
None
Instruction mnemonics:

INSBL

Insert Byte Low

INSWL

Insert Word Low

INSLL

Insert Longword Low

INSQL

Insert Quadword Low

INSWH

Insert Word High

INSLH

Insert Longword High

INSQH

Insert Quadword High

Qualifiers:
None

4-48 • Instruction Descriptions

Description:

INSxL and INSxH shift bytes from register Ra and insert them into a field of zeros, storing the
result in register Rc. Register Rb<2:0> selects the shift amount, and the function code selects the
maximum field width: 1,2,4, or 8 bytes. The instructions can generate a byte, word, longword, or
quadword datum that is spread across two registers at an arbitrary byte alignment.

4-49

Byte Mask
Format:
MSKxx
MSKxx

Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

!Operate format
!Operate format

Operation:
CASE
MSKBL: byte_mask r
0000 0000 0000 0001 2
MSKWx: byte_mask r
0000 0000 0000 0011 2
MSKLx: byte_mask r
0000 0000 0000 1111 2
0000 0000 1111 1111 2
MSKQx: byte_mask r
ENDCASE
byte_mask r
LEFT_SHIFT (byte_mask, rbv<2:0»
CASE
MSKxL:
Rc r

BYTE_ZAP (Rav, byte_mask<7:0»

MSKxH:
Rc r

BYTE_ZAP (Rav, byte_mask<15:8»

ENDCASE

Exceptions:

None
Instruction mnemonics:

MSKBL

Mask Byte Low

MSKWL

Mask Word Low

MSKLL

Mask Longword Low

MSKQL

Mask Quadword Low

MSKWH

Mask Word High

MSKLH

Mask Longword High

MSKQH

Mask Quadword High

Qualifiers:

None

4-50 • Instruction Descriptions
Description:

MSKxL and MSKxH set selected bytes of register Ra to zero, storing the result in register Rc.
Register Rb<2:0> selects the starting position of the field of zero bytes, and the function code
selects the maximum width: 1,2,4, or 8 bytes. The instructions generate a byte, word, longword,
or quadword field of zeros that can spread across two registers at an arbitrary byte alignment.
Notes:
The comments in the examples below assume that the effective address (ea) of X(Rll) is such that
(ea mod 8) = 5, the value of the aligned quadword containing X(Rll) is CBAx xxxx , the value of
the aligned quadword containing X+7(Rll) is yyyH GFED , and the value to be stored from R5 is
hgfe dcba .

The examples below are the most general case; if more information is known about the value or
intended alignment of X, shorter sequences can be used.
The intended sequence for storing an unaligned quadword R5 at address X(Rll) is:
LDA
LDQ_U
LDQ_U
INSQH
INSQL
MSKQH
MSKQL
OR
OR
STQ_U
STQ_U

R6,X(Rll)
R2,X+7(Rll)
Rl, X (Rll)
R5,R6,R4
R5,R6,R3
R2,R6,R2
Rl,R6,Rl
R2,R4,R2
Rl,R3,Rl
R2,X+7(Rll)
Rl,X(Rll)

5
R6<2:0> = (X mod 8)
yyyH GFED
Ignores va<2:0>, R2
Ignores va<2:0>, Rl
CBAx xxxx
R4
OOOh gfed
R3
cbaO 0000
R2
yyyO 0000
Rl
OOOx xxxx
yyyh gfed
R2
Rl
cbax xxxx
Must store high then low for
degenerate case of aligned QW

The intended sequence for storing an unaligned longword R5 at X is:
LDA
LDQ_U
LDQ_U
INSLH
INSLL
MSKLH
MSKLL
OR
OR
STQ_U
STQ_U

R6,X(Rll)
R2,X+3(Rll)
Rl,X(Rll)
R5,R6,R4
R5,R6,R3
R2,R6,R2
Rl,R6,Rl
R2,R4,R2
Rl,R3,Rl
R2,X+3(Rll)
Rl, X (Rll)

5
R6<2:0> = (X mod 8 )
yyyy yyyD
Ignores va<2:0>, R2
Ignores va<2:0>, Rl
CBAx xxxx
R4
0000 OOOd
R3
cbaO 0000
R2
yyyy yyyO
Rl
OOOx xxxx
yyyy yyyd
R2
Rl
cbax xxxx
Must store high then low for
degenerate case of aligned

4-51

The intended sequence for storing an unaligned word R5 at X is:
LDA
LDQ_U
LDQ_U
INSWH
INSWL
MSKWH
MSKWL
OR
OR
STQ_U
STQ_U

R6,X(Rll)
R2,X+l(Rll)
Rl, X (Rll)
R5,R6,R4
R5,R6,R3
R2,R6,R2
Rl,R6,Rl
R2,R4,R2
Rl,R3,Rl
R2, X+l (Rll)
Rl,X(Rll)

R6<2:0> = (X mod 8)
5
Ignores va<2:0>, R2
yBAx xxxx
yBAx xxxx
Ignores va<2:0>, Rl
R4
0000 0000
R3
ObaO 0000
yBAx xxxx
R2
Rl
yOOx xxxx
yBAx xxxx
R2
ybax xxxx
Rl
Must store high then low for
degenerate case of aligned

The intended sequence for storing a byte R5 at X is:
LDA
LDQ_U
INSBL
MSKBL
OR
STQ_U

R6,X(Rll)
Rl,X(Rll)
R5,R6,R3
Rl,R6,Rl
Rl,R3,Rl
Rl, X (Rll)

R6<2:0> = (X mod 8 )
Ignores va<2:0>, Rl
R3
OOaO 0000
yyOx xxxx
Rl
yyax xxxx
Rl

5
yyAx xxxx

4-52 • Instruction Descriptions

Zero Bytes
Format:
Ra.rq,Rb.rq,Rc.wq
Ra.rq,#b.ib,Rc.wq

ZAPx
ZAPx

!Operate format
!Operate format

Operation:
CASE
ZAP:
Rc

ZAPNOT:
Rc ~
ENDCASE

BYTE_ZAP (Rav, rbv<7:0»

BYTE_ZAP (Rav, NOT rbv<7:0»

Exceptions:

None
Instruction mnemonics:

ZAP

Zero Bytes

ZAPNOT

Zero Bytes Not

Qualifiers:

None
Description:

ZAP and ZAPNOT set selected bytes of register Ra to zero, and store the result in register Rc.
Register Rb<7 :0> selects the bytes to be zeroed; bit 0 of Rbv corresponds to byte 0, bit 1 of Rbv
corresponds to byte 1, and so on. A result byte is set to zero if the corresponding bit of Rbv is a
one for ZAP and a zero for ZAPNOT.

4-53

• Floating-Point Instructions
Alpha provides instructions for operating on floating-point operands in each of four data formats:
• F_floating (VAX single)
• G_floating (VAX double, II-bit exponent)
• S_floating (IEEE single)
• T_floating (IEEE double, II-bit exponent)
Data conversion instructions are also provided to convert operands between floating-point and
quadword integer formats, between double and single floating, and between quadword and
longword integers.
Note
D_floating is a partially supported datatype; no D_floating arithmetic
operations are provided in the architecture. For backward compatibility,
exact D_floating arithmetic may be provided via software emulation.
D_floating "format compatibility," in which binary files of D_floating
numbers may be processed but without the last 3 bits of fraction precision, can be obtained via conversions to G_floating, G arithmetic operations, then conversion back to D_floating.

The choice of data formats is encoded in each instruction. Each instruction also encodes the
choice of rounding mode and the choice of trapping mode.
All floating-point operate instructions (that is, not including loads or stores) that yield an F_ or
G_floating zero result must materialize a true zero.

Floating Subsets and Floating Faults
All floating-point operations may take floating disabled faults. Any subsetted floating-point
instruction may take an Illegal Instruction Trap. These faults are not explicitly listed in the
description of each instruction.
All floating-point loads and stores may take memory management faults (access control violation,
translation not valid, fault on read/write, data alignment).
The Floating-point Enable (FEN) internal processor register (IPR) allows system software to
restrict access to the floating registers.

If a floating instruction is implemented and FEN = 0 , attempts to execute the instruction cause a
floating disabled fault.

If a floating instruction is not implemented, attempts to execute the instruction cause an Illegal
Instruction Trap. This rule holds regardless of the value of FEN.
An Alpha implementation may provide both VAX and IEEE floating-point operations, either, or
none.

4-54 • Instruction Descriptions

Some floating-point instructions are common to the VAX and IEEE subsets, some are VAX only,
and some are IEEE only. These are designated in the descriptions that follow. If either subset is
implemented, all the common instructions must be implemented.
An implementation including IEEE floating-point may subset the ability to perform rounding to
plus infinity and minus infinity. If not implemented, instructions requesting these rounding
modes take Illegal Instruction Trap.

Definitions
The following definitions apply to Alpha floating-point support.
true result
The mathematically correct result of an operation, assuming that the input operand values are
exact. The true result is typically rounded to the nearest representable result.
representable result
a real number that can be represented exactly as a VAX or IEEE floating-point number, with finite
precision and bounded exponent range.

LSB
The least significant bit. For a positive representable number A whose fraction is not all ones,
A + 1 LSB is the next larger representable number, and A + 1/2 LSB is exactly halfway between A
and the next larger representable number.
true zero
The value +0, represented as exactly 64 zeros in a floating-point register.
Alpha finite number
A floating-point number with a definite, in-range value. Specifically, all numbers in the inclusive
ranges -MAX..-MIN, zero, +MIN..+MAX, where MAX is the largest non-infinite representable
floating-point number and MIN is the smallest non-zero representable normalized floating-point
number.

For VAX floating-point, finites do not include reserved operands or dirty zeros (this differs from
the usual VAX interpretation of dirty zeros as finite). For IEEE floating-point, finites do not
include infinites, NaNs, or denormals, but do include minus zero.
Not-a-Number
An IEEE floating-point bit pattern that represents something other than a number. This comes in
two forms: signaling NaNs (for Alpha, those with an initial fraction bit of 1) and quiet NaNs (for
Alpha, those with initial fraction bit of 0).
infinity
An IEEE floating-point bit pattern that represents plus or minus infinity.
denormal
An IEEE floating-point bit pattern that represents a number whose magnitude lies between zero
and the smallest finite number.
dirty zero
A VAX floating-point bit pattern that represents a zero value, but not in true-zero form.

4-55

reserved operand
A VAX floating-point bit pattern that represents an illegal value.
trap shadow
The set of instructions potentially executed after an instruction that signals an arithmetic trap but
before the trap is actually taken.

Encodings
Floating-point numbers are represented with three fields: sign, exponent, and fraction. The sign is
1 bit; the exponent is 8 or 11 bits; and the fraction is 23, 52, or 55 bits. Some encodings represent
special values:

VAX

IEEE

Meaning

Finite

Meaning

Finite

Non-zero

Finite

Yes

+/-NaN

AlI-1's

Finite

Yes

+/- Infinity

Non-zero

Dirty zero

+Denormal

Non-zero

Resv. operand

-Denormal

True zero

Yes

Resv. operand

-0

Yes

Other

Finite

Yes

finite

Yes

Sign

Exponent

Fraction

All-1's

The values of MIN and MAX for each of the four floating-point data formats are:
Data Format

MIN

MAX

2"0"-127 ,', 0.5
(0.294e-38)

2''''''127 ,', (1.0 - 2"0"-24)
(1.70e38)

2"0"-1023 ", 0.5
(0.56e-308)

2"0"1023 ,', (1.0 - 2"<>'<-53)
(O.89ge308)

2"0"-126 ,', 1.0
(l.l75e-38)

2''''''127 ,', (2.0 - 2"""-23)
(3.40e38)

2''''''-1022 ", 1.0
(2.225e-308)

2"0"1023 ,', (2.0 - 2"<>'<-52)
(1.798e308)

Floating-Point Rounding Modes
All rounding modes map a true result that is exactly representable to that representable value.

4-56 • Instruction Descriptions

VAX Rounding Modes
For VAX floating-point operations, two rounding modes are provided and are specified in each
instruction: normal (biased) rounding and chopped rounding.
Normal VAX rounding maps the true result to the nearest of two representable results, with true
results exactly halfway between mapped to the larger in absolute value (sometimes called biased
rounding away from zero); maps true results ~ MAX + 1/2 LSB in magnitude to an overflow; and
maps true results < MIN - 1/2 LSB in magnitude to an underflow.
Chopped VAX. rounding maps the true result to the smaller in magnitude of two surrounding
representable results; maps true results ~ MAX + 1 LSB in magnitude to an overflow; and maps
true results < MIN in magnitude to an underflow.

IEEE Rounding Modes
For IEEE floating-point operations, four rounding modes are provided: normal rounding (unbiased round to nearest), rounding toward minus infinity, round toward zero, and rounding toward
plus infinity. The first three can be specified in the instruction. Rounding toward plus infinity can
be obtained by setting the Floating-point Control Register (FPCR) to select it and then specifying
dynamic rounding mode in the instruction (See FPCR Register and Dynamic Rounding Mode in
this chapter). Alpha IEEE arithmetic does rounding before detecting overflow/underflow.
Normal IEEE rounding maps the true result to the nearest of two representable results, with true
results exactly halfway between mapped to the one whose fraction ends in 0 (sometimes called
unbiased rounding to even); maps true results ~ MAX + 1/2 LSB in magnitude to an overflow;
and maps true results < MIN - 1/2 LSB in magnitude to an underflow.
Plus infinity IEEE rounding maps the true result to the larger of two surrounding representable
results; maps true results> MAX in magnitude to an overflow; maps positive true results ~ +MIN
- 1 LSB to an underflow; and maps negative true results> -MIN to an underflow.
Minus infinity IEEE rounding maps the true result to the smaller of two surrounding representable results; maps true results > MAX in magnitude to an overflow; maps positive true results
< +MIN to an underflow; and maps negative true results ~ -MIN + 1 LSB to an underflow.
Chopped IEEE rounding maps the true result to the smaller in magnitude of two surrounding
representable results; maps true results ~ MAX + 1 LSB in magnitude to an overflow; and maps
non-zero true results < MIN in magnitude to an underflow.
Dynamic rounding mode uses the IEEE rounding mode selected by the FPCR register and is
described in more detail in FPCR Register and Dynamic Rounding Mode in this chapter.
The following tables summarize the floating-point rounding modes:

VAX Rounding Mode

Instruction Notation

Normal rounding

(No modifier)

Chopped

4-57

IEEE Rounding Mode

Instruction Notation

Normal rounding

(No modifier)

Dynamic rounding

Plus infinity

/D and ensure that FPCR<DYN> = '11'

Minus infinity

Chopped

Floating-Point Trapping Modes
There are six exceptions that can be generated by floating-point operate instructions, all signaled
by an arithmetic exception trap. These exceptions are:
• Invalid operation
• Division by zero
• Overflow
• Underflow, may be disabled
• Inexact result, may be disabled
• Integer overflow (conversion to integer only), may be disabled

VAX Trapping Modes
For VAX floating-point operations other than CVTxQ, four trapping modes are provided. They
specify software completion and whether traps are enabled for underflow.
For VAX conversions from floating-point to integer, four trapping modes are provided. They
specify software completion and whether traps are enabled for integer overflow.

IEEE Trapping Modes
For IEEE floating-point operations other than CVTxQ, four trapping modes are provided. They
specify software completion and whether traps are enabled for underflow and inexact results.
For IEEE conversions from floating-point to integer, four trapping modes are provided. They
specify software completion, and whether traps are enabled for integer overflow and inexact
results.
The modes and instruction notation are:

VAX Trap Mode

Instruction Notation

Imprecise, underflow disabled

(No modifier)

Imprecise, underflow enabled

Software, underflow disabled

Software, underflow enabled

ISU

4-58 • Instruction Descriptions

VAX Convert-to-Integer Trap Mode

Instruction Notation

Imprecise, integer overflow disabled

(No modifier)

Imprecise, integer overflow enabled

Software, integer overflow disabled

Software, integer overflow enabled

ISV

IEEE Trap Mode

Instruction Notation

Imprecise, unfl disabled, inexact disabled

(No modifier)

Imprecise, unfl enabled, inexact disabled

Software, unfl enabled, inexact disabled

ISU

Software, unfl enabled, inexact enabled

ISUI

IEEE Convert-to-Integer Trap Mode

Instruction Notation

Imprecise, int.ovfl disabled, inexact disabled

(No modifier)

Imprecise, int.ovfl enabled, inexact disabled

Software, int.ovfl enabled, inexact disabled

ISV

Software, int.ovfl enabled, inexact enabled

ISVI

Imprecise ISo/tware Completion Trap Modes
Floating-point instructions may be pipelined, and all exceptions are imprecise traps:
• The trapping instruction may write an UNPREDICTABLE result value.
• The trap PC is an arbitrary number of instructions past the one triggering the trap. The trigger
instruction plus all intervening executed instructions are collectively referred to as the trap
shadow of the trigger instruction.
• The extent of the trap shadow is bounded only by a TRAPB instruction (or the implicit TRAPB
within a CALL_PAL instruction).
• Input operand values may have been overwritten in the trap shadow.
• Result values may have been overwritten in the trap shadow.

• An UNPREDICTABLE result value may have been used as an input operand in the trap shadow.
• Additional traps may occur in the trap shadow.
• In general, it is not feasible to fix up the result value or to continue from the trap.

4-59

This behavior is ideal for operations on finite operands that give finite results. For programs that
deliberately operate outside the overflow/underflow range, or use IEEE NaNs, software assistance
is required to complete floating-point operations correctly. This assistance can be provided by a
software arithmetic trap handler, plus constraints on the instructions surrounding the trap.
For a trap handler to complete non-finite arithmetic, the following conditions must hold:
1. On entry to the trap shadow, if any Alpha register or memory location contains a value that is
used as an operand value by some instruction in the trap shadow (live on entry), then no
instruction in the trap shadow may modify the register or memory location.
2. Within the trap shadow, the computation of the base register for a memory load or store
instruction may not involve using the result of an instruction that might generate an UNPREDICTABLE result.
3. Within the trap shadow, no register may be used more than once as a destination register.
4. The trap shadow may not include any branch instructions.
5. Each floating instruction to be completed must be so marked, by specifying the /S software
completion modifier.
The first condition allows a software trap handler to emulate the trigger instruction with its
original input operand values and then to reexecute the rest of the trap shadow.
The second condition prevents memory accesses at unpredictable addresses.
The remaining conditions make it possible for a software trap handler to find the trigger
instruction via a linear scan backwards from the trap Pc.
Note
The /S modifier does not affect instruction operation or trap behavior; it
is an informational bit passed to a software trap handler. It allows a trap
handler to test easily whether an instruction is intended to be completed.
(The /S bits of instructions signaling traps are carried into the trap
summary.) The handler may then assume that the other conditions are
met without examining the code stream.

If a software trap handler is provided, it must handle the completion of all floating-point
operations marked /S that follow the rules above. In effect, one TRAPB instruction per basic
block can be used.

Invalid Operation Arithmetic Trap
An invalid operation arithmetic trap is signaled if any operand of a floating arithmetic-operate
instruction is non-finite. (CMPTxy is an exception to the rule and operates normally with plus and
minus infinity and does not trap in this case.) This trap is always enabled. If this trap occurs, an
UNPREDICTABLE value is stored in the result register. (IEEE-compliant system software must
also supply an invalid operation indication to the user for SQRT of a negative non-zero number,
0/0, x REM a , and conversions to integer that take an integer overflow trap.)

4-60 • Instruction Descriptions

Division by Zero Arithmetic Trap
A division by zero arithmetic trap is taken if the numerator does not cause an invalid operation
trap and the denominator is zero. This trap is always enabled. If this trap occurs, an UNPREDICTABLE value is stored in the result register.

Overflow Arithmetic Trap
An overflow arithmetic trap is signaled if the rounded result exceeds in magnitude the largest
finite number of the destination format. This trap is always enabled. If this trap occurs, an
UNPREDICTABLE value is stored in the result register.

Underflow Arithmetic Trap
An underflow occurs if the rounded result is smaller in magnitude than the smallest finite number
of the destination format.

If an underflow occurs, a true zero (64 bits of zero) is always stored in the result register, even if
the proper IEEE result would have been -0 (underflow below the negative denormal range).

If an underflow occurs and underflow traps are enabled by the instruction, an underflow
arithmetic trap is signaled.

Inexact Result Arithmetic Trap
An inexact result occurs if the infinitely precise result differs from the rounded result.

If an inexact result occurs, the normal rounded result is still stored in the result register.
If an inexact result occurs and inexact result traps are enabled by the instruction, an inexact
result arithmetic trap is signaled.

Integer Overflow Arithmetic Trap
In conversions from floating to quadword integer, an integer overflow occurs if the rounded
result is outside the range -2~'d:63 . .2~'d:63-1 . In conversions from quadword integer to longword
integer, an integer overflow occurs if the result is outside the range -2~·d:31..2~·d:31-1 .

If an integer overflow occurs in CVTxQ or CVTQL, the true result truncated to the low-order 64
or 32 bits respectively is stored in the result register.

If an integer overflow occurs and integer overflow traps are enabled by the instruction, an integer
overflow arithmetic trap is signaled.

4-61

Floating-Point Single-Precision Operations
Single-precision values (F_floating or S_floating) are stored in the floating registers in canonical
form, as subsets of double-precision values, with II-bit exponents restricted to the corresponding
single-precision range, and with the 29 low-order fraction bits restricted to be all zero.
Single-precision operations applied to canonical single-precision values give single-precision
results. Single-precision operations applied to non-canonical operands give UNPREDICTABLE
results.
Longword integer values in floating registers are stored in bits <63:62,58:29>, with bits <61:59>
ignored and zeros in bits <28:0>.

FPCR Register and Dynamic Rounding Mode
When an IEEE floating-point operate instruction specifies dynamic mode (lD) in its function field
(function code bits <7:6> = 11), the rounding mode to be used for the instruction is derived from
the FPCR register. The layout of the rounding mode bits and their assignments matches exactly
the format used in the 11-bit function field of the floating-point operate instructions.
In addition, the FPCR gives a summary for each exception type of the exceptions conditions
detected by all IEEE floating-point operates thus far as well as an overall summary bit that
indicates whether any of these exception conditions has been detected. The individual exception
bits match exactly in purpose and order the exceptions bits found in the exception summary
quadword that is pushed for arithmetic traps. However, for each instruction, these exceptions
bits are set independent of the trapping mode specified for the instruction. Therefore, even
though trapping may be disabled for a certain exceptional condition, the fact that the exceptional
condition was encountered by an instruction will still be recorded in the FPCR.
Floating-point operates that belong to the IEEE subset and CVTQL, which belongs to both VAX
and IEEE subsets, appropriately set the FPCR exception bits. It is UNPREDICTABLE whether
floating-point operates that belong only to the VAX floating-point subset set the FPCR exception
bits.
Alpha floating-point hardware only transitions these exception bits from zero to one. Once set to
one, these exception bits are only cleared when software writes zero into these bits by writing a
new value into the FPCR.
The format of the FPCR is shown in Figure 4-1 and described in Table 4-8.
6362

605958575655545352 51

B RAZI

M IGN

D I 1 UO DI
y ON NV ZN
N VE F F EV

RAZ/IGN

Figure 4-1 • Floating-Point Control Register (FPCR) Format

4-62 • Instruction Descriptions
Table 4-8 . Floating-Point Control Register (FPCR) Bit Descriptions
Bit

Description

Summary Bit (SUM). Records bitwise OR of FPCR exception bits. Equal to
(FPCR[57] I FPCR[56] I FPCR[55] I FPCR[54] I FPCR[53] I FPCR[52]).

62-60

Reserved. Read As Zero; Ignored when written.

59-58

Dynamic Rounding Mode (DYN). Indicates the rounding mode to be used by an
IEEE floating-point operate instruction when the instruction's function field specifies dynamic mode (lD). Assignments are:

DYN

IEEE Rounding Mode Selected

Chopped rounding mode

Minus infinity

Normal rounding

Plus infinity

Integer Overflow (IOV). An integer arithmetic operation or a conversion from
floating to integer overflowed the destination precision.

Inexact Result (INE). A floating arithmetic or conversion operation gave a result
that differed from the mathematically exact result.

Underflow (UNF). A floating arithmetic or conversion operation underflowed the
destination exponent.

Overflow (OVF). A floating arithmetic or conversion operation overflowed the
destination exponent.

Division by Zero (DZE). An attempt was made to perform a floating divide operation with a divisor of zero.

Invalid Operation (INV). An attempt was made to perform a floating arithmetic,
conversion, or comparison operation, and one or more of the operand values were
illegal.

51-0

Reserved. Read As Zero; Ignored when written.

FPCR is read from and written to the floating-point registers by the MT_FPCR and MF_FPCR
instructions respectively, which are described in Accessing the FPCR in this chapter.
FPCR and the instructions to access it are required for an implementation that supports floating-point (see Floating-Point Subsets in this chapter). On implementations that do not support
floating-point, the instructions that access FPCR (MF_FPCR and MT_FPCR) take an Illegal
Instruction Trap.
Software Note
As noted in Floating-Point Subsets in this chapter, support for FPCR is
required on a system that supports VMS even if that system does not
support floating-point.

4-63

Accessing the FPCR
Because Alpha floating-point hardware can overlap the execution of a number of floating-point
instructions, accessing the FPCR must be synchronized with other floating-point instructions. A
TRAPB must be issued both prior to and after accessing the FPCR to ensure that the FPCR access
is synchronized with the execution of previous and subsequent floating-point instructions; otherwise synchronization is not ensured.
Issuing a TRAPB followed by an MT_FPCR followed by another TRAPB ensures that only
floating-point instructions issued after the second TRAPB are affected by and affect the new value
of the FPCR. Issuing a TRAPB followed by an MF_FPCR followed by another TRAPB ensures that
the value read from the FPCR only records the exception information for floating-point instructions issued prior to the first TRAPB.
Consider the following example:
ADDT/D
TRAPB
MT_FPCR Fl,Fl,Fl
TRAPB
SUBT/D

;1
;2

Without the first TRAPB, it is possible in an implementation for the ADDT/D to execute in
parallel with the MT_FPCR. Thus, it would be UNPREDICTABLE whether the ADDT/D was
affected by the new rounding mode set by the MT_FPCR and whether fields cleared by the
MT_FPCR in the exception summary were subsequently set by the ADDT/D.
Without the second TRAPB, it is possible in an implementation for the MT_FPCR to execute in
parallel with the SUBT/D. Thus, it would be UNPREDICTABLE whether the SUBT/D was affected
by the new rounding mode set by the MT_FPCR and whether fields cleared by the MT_FPCR in
the exception summary field of FPCR were previously set by the SUBT/D.

Default Values of the FPCR
Processor initialization leaves the value of FPCR UNPREDICTABLE.
Software Note
Digital software should initialize FPCR<DYN> = 11 during program
activation. Using this default, interval arithmetic code can switch from
plus to minus infinity rounding with no penalty in performance by using
/M and /D qualifiers.

Program activation should clear all other fields of the FPCR.

4-64 • Instruction Descriptions

Saving and Restoring the FPCR
The FPCR must be saved and restored across context switches so that the FPCR value of one
process does not affect the rounding behavior and exception summary of another process.
The dynamic rounding mode put into effect by the programmer (or initialized by image activation) is valid for the entirety of the program and remains in effect until subsequently changed by
the programmer or until image run-down occurs.
Software Note
The IEEE standard precludes saving and restoring the FPCR across
subroutine calls.

IEEE Standard
The IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Standard 754-1985) is
included by reference.

4-65

• Memory Format Floating-Point Instructions
The instructions in this section move data between the floating-point registers and memory. They
use the Memory instruction format. They do not interpret the bits moved in any way; specifically,
they do not trap on non-finite values.
The instructions are summarized in Table 4-9.
Table 4-9 • Memory Format Floating-Point Instructions Summary
Mnemonic

Operation

Subset

LDP

Load F_floating

VAX

LDG

Load G_floating (Load D_floating)

VAX

LDS

Load S_floating (Load Longword Integer)

Both

LDT

Load T_floating (Load Quadword Integer)

Both

STP

Store F_floating

VAX

STG

Store G_floating (Store D_floating)

VAX

STS

Store S_floating (Store Longword Integer)

Both

STT

Store T_floating (Store Quadword Integer)

Both

4-66 • Instruction Descriptions

Load F_floating
Format:
LDF

Fa.wf,disp.ab(Rb.ab)

!Memory format

Operation:
va f-

{Rbv + SEXT(disp)}

Fa f-

(va)<lS>
(va)<6:0>

II MAP_F((va)<14:7» II
II (va)<31:16> II 0<28:0>

Exceptions:

Access Violation
Fault on Read
Alignment
Translation Not Valid
Instruction mnemonics:

LDF

Load F_floating

Qualifiers:

None
Description:

LDF fetches an F_floating datum from memory and writes it to register Fa. If the data is not
naturally aligned, an alignment exception is generated.
The 8-bit memory-format exponent is expanded to an ll-bit register-format exponent according
to Table 2-1.
The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.
The source operand is fetched from memory and the bytes are reordered to conform to the
F_floating register format. The result is then zero-extended in the low-order longword and
written to register Fa.

4-67

Load G_floating
Format:
LDG

Fa.wg,disp.ab(Rb.ab)

!Memory format

Operation:
va ~

{Rbv + SEXT(disp)}

Fa ~

(va)<15:0>
(va)<47:32>

II
II

(va)<31:16>
(va)<63:48>

Exceptions:

Access Violation
Fault on Read
Alignment
Translation Not Valid
Instruction mnemonics:

LDG

Load G_floating (Load D_floating)

Qualifiers:

None
Description:

LDG fetches a G_floating (or D_floating) datum from memory and writes it to register Fa. If the
data is not naturally aligned, an alignment exception is generated.
The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.
The source operand is fetched from memory, the bytes are reordered to conform to the
G_floating register format (also conforming to the D_floating register format), and the result is
then written to register Fa.

4-68 • Instruction Descriptions

Load S_floating
Format:
Fa.ws,disp.ab(Rb.ab)

LDS

!Mernory format

Operation:
va

Fa ~

{Rbv + SEXT(disp)}
(va)<31>
(va)<22:0>

II MAP_S( (va)<30:23»
" 0<28:0>

Exceptions:
Access Violation
Fault on Read
Alignment
Translation Not Valid

Instruction mnemonics:
Load S_floating (Load Longword Integer)
LDS
Qualifiers:
None
Description:
LDS fetches a longword (integer or S_floating) from memory and writes it to register Fa. If the
data is not naturally aligned, an alignment exception is generated.
The 8-bit memory-format exponent is expanded to an ll-bit register-format exponent according
to Table 2-2.
The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.
The source operand is fetched from memory, is zero-extended in the low-order longword, and
then written to register Fa.

Notes:
• Longword integers in floating registers are stored in bits <63:62,58:29>, with bits <61:59>
ignored and zeros in bits <28:0>.

4-69

Load T_floating
Format:
LDT

Fa.wt,disp.ab(Rb.ab)

!Memory format

Operation:
va

Fa f-

{Rbv + SEXT(disp)}
(va)<63:0>

Exceptions:

Access Violation
Fault on Read
Alignment
Translation Not Valid
Instruction mnemonics:

LDT

Load T_floating (Load Quadword Integer)

Qualifiers:

None
Description:

LDT fetches a quadword (integer or T_floating) from memory and writes it to register Fa. If the
data is not naturally aligned, an alignment exception is generated.
The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.
The source operand is fetched from memory and written to register Fa.

4-70 • Instruction Descriptions

Store F_floating
Format:
STF

Fa.rf,disp.ab(Rb.ab)

!Memory format

Operation:
va f -

{Rbv + SEXT(disp)}

(va)<31:0>

Fav<44:29> II Fav<63:62>11 Fav<58:45>
Fav<S8:4S>

Exceptions:

Access Violation
Fault on Write
Alignment
Translation Not Valid
Instruction mnemonics:

STP

Store F_floating

Qualifiers:

None
Description:

STP stores an F_floating datum from Fa to memory. If the data is not naturally aligned, an
alignment exception is generated.
The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.
The bits of the source operand are fetched from register Fa, the bits are reordered to conform to
F_floating memory format, and the result is then written to memory. Bits <61:59> and <28:0> of
Fa are ignored. No checking is done.

4-71

Store G_floating
Format:
Fa.rg,disp.ab(Rb.ab)

STG

!Memory format

Operation:
va

{Rbv + SEXT(disp)}

(va) <63: 0>

Fav<15: 0>
I I Fav<31: 16> I I
Fav<47:32> II Fav<63:48>

Exceptions:

Access Violation
Fault on Write
Alignment
Translation Not Valid
Instruction mnemonics:

STG

Store G_floating (Store D_floating)

Qualifiers:

None
Description:

STG stores a G_floating (or D_floating) datum from Fa to memory. If the data is not naturally
aligned, an alignment exception is generated.
The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.
The source operand is fetched from register Fa, the bytes are reordered to conform to the
G_floating memory format (also conforming to the D_floating memory format), and the result is
then written to memory.

4-72 • Instruction Descriptions

Store S_floating
Format:
Fa.rs,disp.ab(Rb.ab)

STS

!Memory format

Operation:
va

{Rbv + SEXT(disp)}

(va) <31: 0>

Fav<63: 62> I I Fav<S8: 29>

Exceptions:

Access Violation
Fault on Write
Alignment
Translation Not Valid
Instruction mnemonics:

STS

Store S_floating (Store Longword Integer)

Qualifiers:

None
Description:

STS stores a longword (integer or S_floating) datum from Fa to memory. If the data is not
naturally aligned, an alignment exception is generated.
The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.
The bits of the source operand are fetched from register Fa, the bits are reordered to conform to
S_floating memory format, and the result is then written to memory. Bits <61:59> and <28:0> of
Fa are ignored. No checking is done.

4-73

Store T_floating
Format:
STT

Fa.rt,disp.ab(Rb.ab)

!Memory format

Operation:
va ~

{Rbv + SEXT(disp)}

(va)<63:0> ~

Fav<63:0>

Exceptions:

Access Violation
Fault on Write
Alignment
Translation Not Valid
Instruction mnemonics:

STT

Store T_floating (Store Quadword Integer)

Qualifiers:

None
Description:

STT stores a quadword (integer or T_floating) datum from Fa to memory. If the data is not
naturally aligned, an alignment exception is generated.
The virtual address is computed by adding register Rb to the sign-extended 16-bit displacement.
The source operand is fetched from register Fa and written to memory.

4-74 • Instruction Descriptions

• Branch Format Floating-Point Instructions
Alpha provides six floating conditional branch instructions. These branch-format instructions test
the value of a floating-point register and conditionally change the Pc.
They do not interpret the bits tested in any way; specifically, they do not trap on non-finite values.
The test is based on the sign bit and whether the rest of the register is all zero bits. All 64 bits of
the register are tested. The test is independent of the format of the operand in the register. Both
plus and minus zero are equal to zero. A non-zero value with a sign of zero is greater than zero. A
non-zero value with a sign of one is less than zero. No reserved operand or non-finite checking is
done.
The floating-point branch operations are summarized in Table 4-10.
Table 4·10 • Floating·Point Branch Instructions Summary
Mnemonic

Operation

Subset

FBEQ

Floating Branch if Equal

Both

FBGE

Floating Branch if Greater Than or Equal

Both

FBGT

Floating Branch if Greater Than

Both

FBLE

Floating Branch if Less Than or Equal

Both

FBLT

Floating Branch if Less Than

Both

FBNE

Floating Branch if Not Equal

Both

4-75

Conditional Branch
Format:
FBxx

!Branch format

Fa.rq,disp.al

Operation:
{update PC}
va f- PC + {4*SEXT(disp)}
IF TEST(Fav, Condition_based_on_Opcode)
PC f- va

THEN

Exceptions:

None
Instruction mnemonics:

FBEQ

Floating Branch if Equal

FBGE

Floating Branch if Greater Than or Equal

FBGT

Floating Branch if Greater Than

FBLE

Floating Branch if Less Than or Equal

FBLT

Floating Branch if Less Than

FBNE

Floating Branch if Not Equal

Qualifiers:

None
Description:

Register Fa is tested. If the specified relationship is true, the PC is loaded with the target virtual
address; otherwise, execution continues with the next sequential instruction.
The displacement is treated as a signed longword offset. This means it is shifted left two bits (to
address a longword boundary), sign-extended to 64 bits, and added to the updated PC to form
the target virtual address.
The conditional branch instructions are PC-relative only. The 21-bit signed displacement gives a
forward/backward branch distance of +/- 1M instructions.
Notes:

• To branch properly on non-finite operands, compare to F31, then branch on the result of the
compare.
• The largest negative integer (8000 0000 0000 0000 16 ) is the same bit pattern as floating minus
zero, so it is treated as equal to zero by the branch instructions. To branch properly on the largest
negative integer, convert it to floating or move it to an integer register and do an integer branch.

4-76 • Instruction Descriptions

• Floating-Point Operate Format Instructions
The floating-point bit-operate instructions perform copy and integer convert operations on 64-bit
register values. The bit-operate instructions do not interpret the bits moved in any way; specifically, they do not trap on non-finite values.
The floating-point arithmetic-operate instructions perform add, subtract, multiply, divide, compare, and floating convert operations on 64-bit register values in one of the four specified floating
formats.
Each instruction specifies the source and destination formats of the values, as well as the
rounding mode and trapping mode to be used. These instructions use the Floating-point Operate
format.
The floating-point operate instructions are summarized in Table 4-11.
Table 4-11 • Floating.Point Operate Instructions Summary
Mnemonic

Operation

Subset

Bit and FPCR Operations

CPYS

Copy Sign

Both

CPYSE

Copy Sign and Exponent

Both

CPYSN

Copy Sign Negate

Both

CVTLQ

Convert Longword to Quadword

Both

CVTQL

Convert Quadword to Longword

Both

FCMOVxx

Floating Conditional Move

Both

MF_FPCR

Move from Floating-point Control Register

Both

MT_FPCR

Move to Floating-point Control Register

Both

4-77

Table 4-11 • Floating-Point Operate Instructions Summary (Continued)
Mnemonic

Operation

Subset

ADDF

Add F_floating

VAX

ADDG

Add G_floating

VAX

ADDS

Add S_floating

IEEE

ADDT

Add T_floating

IEEE

CMPGxx

Compare G_floating

VAX

CMPTxx

Compare T_floating

IEEE

CVTDG

Convert D_floating to G_floating

VAX

CVTGD

Convert G_floating to D_floating

VAX

CVTGF

Convert G_floating to F_floating

VAX

CVTGQ

Convert G_floating to Quadword

VAX

CVTQF

Convert Quadword to F_floating

VAX

CVTQG

Convert Quadword to G_floating

VAX

CVTQS

Convert Quadword to S_floating

IEEE

CVTQT

Convert Quadword to T_floating

IEEE

CVTTQ

Convert T_floating to Quadword

IEEE

CVTTS

Convert T_floating to S_floating

IEEE

DIVF

Divide F_floating

VAX

DIVG

Divide G_floating

VAX

DIVS

Divide S_floating

IEEE

DIVT

Divide T_floating

IEEE

MULF

Multiply F_floating

VAX

MULG

Multiply G_floating

VAX

MULS

Multiply S_floating

IEEE

MULT

Multiply T_floating

IEEE

SUBF

Subtract F_floating

VAX

SUBG

Subtract G_floating

VAX

SUBS

Subtract S_floating

IEEE

SUBT

Subtract T_floating

IEEE

Arithmetic Operations

4-78 • Instruction Descriptions

Copy Sign
Format:
CPYSy

Fa.rq,Fb.rq,Fe.wq

!Floating-point Operate format

Operation:
CASE
CPYS:
CPYSN:
CPYSE:

Fe
Fe
Fe

~
~
~

Fav<63> I I Fbv<62: 0>
NOT(Fav<63»
II Fbv<62:0>
Fav<63: 52> I I Fbv<51: 0>

ENDCASE

Exceptions:

None
Instruction mnemonics:

CPYS

Copy Sign

CPYSE

Copy Sign and Exponent

CPYSN

Copy Sign Negate

Qualifiers:

None
Description:

For CPYS and CPYSN, the sign bit of Fa is fetched (and complemented in the case of CPYSN) and
concatenated with the exponent and fraction bits from Fb; the result is stored in Fe.
For CPYSE, the sign and exponent bits from Fa are fetched and concatenated with the fraction
bits from Fb; the result is stored in Fc.
No checking of the operands is performed.
Notes:

• Register moves can be performed using CPYS Fx,Fx,Fy . Floating-point absolute value can be
done using CPYS F31,Fx,Fy . Floating-point negation can be done using CPYSN Fx,Fx,Fy .
Floating values can be scaled to a known range by using CPYSE.

4-79

Convert Integer to Integer
Format:
Fb.rq,Fe.wx

CVTxy

!Floating-point operate
Operate format

Operation:
CASE
CVTQL:

CVTLQ:
ENDCASE

Fe ~

Fbv<31:30>
Fbv< 2 9 : 0 >

I I' 0<2:0> I I
I I 0<2 8 : 0 >

SEXT(Fbv<63:62>

II Fbv<58:29»
Fbv<S8:29»

Exceptions:

Integer Overflow, CVTQL only
Instruction mnemonics:

CVTLQ

Convert Longword to Quadword

CVTQL

Convert Quadword to Longword

Qualifiers:

Trapping:

Software (IS)
Integer Overflow Enable (IV) (CVTQL only)

Description:

The two's-complement operand in register Fb is converted to a two's-complement result and
written to register Fc.
The conversion from quadword to longword is a repositioning of the low 32 bits of the operand,
with zero fill and optional integer overflow checking. Integer overflow occurs if Fb is outside the
range -2;'n'<31..2~h'<31-1. If integer overflow occurs, the truncated result is stored in Fc, and an
arithmetic trap is taken if enabled.
The conversion from longword to quadword is a repositioning of 32 bits of the operand, with
sign extension.

4-80 • Instruction Descriptions

Floating-Point Conditional Move
Format:
FCMOVxx Fa.rq,Fb.rq,Fc.wq

!Floating-point Operate format

Operation:
IF

TEST(Fav, Condition_based_on_Opcode)
Fc (-

THEN

Fbv

Exceptions:

None
Instruction mnemonics:

FCMOVEQ

FCMOVE if Register Equal to Zero

FCMOVGE

FCMOVE if Register Greater Than or Equal to Zero

FCMOVGT

FCMOVE if Register Greater Than Zero

FCMOVLE

FCMOVE if Register Less Than or Equal to Zero

FCMOVLT

FCMOVE if Register Less Than Zero

FCMOVNE

FCMOVE if Register Not Equal to Zero

Qualifiers:

None
Description:

Register Fa is tested. If the specified relationship is true, register Fb is written to register Fc;
otherwise, the move is suppressed and register Fc is unchanged. The test is based on the sign bit
and whether the rest of the register is all zero bits, as described for floating branches in Branch
Format Floating-Point Instructions in this chapter.

4-81

Notes:
Except that it is likely in many implementations to be substantially faster, the instruction:
FCMOVxx Fa,Fb,Fc

is exactly equivalent to:
FByy Fa,label
CPYS Fb,Fb,Fc

NOT xx

label:

For example, a branchless sequence for:
Fl=MAX(Fl,F2)

is:
CMPxLT Fl,F2,F3
FCMOVNE F3,F2,Fl

F3=one if Fl<F2; x=F/G/S/T
Move F2 to Fl if Fl<F2

4-82 • Instruction Descriptions

Move from/to Floating-Point Control Register
Format:
Mx_FPCR Fa.rq,Fa.rq,Fa.wq

!Floating-point operate
Operate format

Operation:
CASE
MT_FPCR:
MF_FPCR:
ENDCASE

FPCR fFa
f-

Fav
FPCR

Exceptions:

None
Instruction mnemonics:

MF_FPCR

Move from Floating-point Control Register

MT_FPCR

Move to Floating-point Control Register

Qualifiers:

None
Description:

The Floating-point Control Register (FPCR) is read from (MF_FPCR) or written to (MT_FPCR), a
floating-point register. The floating-point register to be used is specified by the Fa, Fb, and Fc
fields all pointing to the same floating-point register. If the Fa, Fb, and Fc fields do not all point
to the same floating-point register, then it is UNPREDICTABLE which register is used.
The use of these instructions and the FPCR are described in FPCR Register and Dynamic Rounding

Mode in this chapter.

4-83

VAX Floating Add
Format:
ADDx

Fa.rx,Fb.rx,Fe.wx

!Floating-point operate
Operate format

Operation:
Fe f-

Fav + Fbv

Exceptions:

Invalid Operation
Overflow
Underflow
Instruction mnemonics:

ADDF

Add F_floating

ADDG

Add G_floating

Qualifiers:

Rounding:

Chopped (lC)

Trapping:

Software (IS)
Underflow Enable (lU)

Description:

Register Fa is added to register Fb, and the sum is written to register Fc.
The sum is rounded or chopped to the specified precision, and then the corresponding range is
checked for overflow/underflow. The single-precision operation on canonical single-precision
values produces a canonical single-precision result.

An invalid operation trap is signaled if either operand has exp=O and is not a true zero (that is,
VAX reserved operands and dirty zeros trap). The contents of Fc are UNPREDICTABLE if this
occurs.
See Floating-Point Trapping Modes in this chapter for details of the stored result on overflow or
underflow.

4-84 • Instruction Descriptions

IEEE Floating Add
Format:
Fa.rx,Fb.rx,Fe.wx
Fa.rx,Fb.rx,Fc.wx

ADDx

!Floating-point Operate format

Operation:
Fe
Fc

Fav + Fbv

Exceptions:

Invalid Operation
Overflow
Underflow
Inexact Result
Instruction mnemonics:

ADDS

Add S_floating

ADDT

Add T_floating

Qualifiers:

Rounding:

Dynamic (lD)
Minus infinity (1M)
Chopped (lC)

Trapping:

Software (IS)
Underflow Enable (lU)
Inexact Enable (II)

Description:

Register Fa is added to register Fb, and the sum is written to register Fc.
The sum is rounded to the specified precision, and then the corresponding range is checked for
overflow/underflow. The single-precision operation on canonical single-precision values produces
a canonical single-precision result.
An invalid operation trap is signaled if either operand has exp=O and a non-zero fraction (IEEE
denormals trap), or if exp=all-ones (IEEE NaNs and infinities trap).
The contents of Fc are UNPREDICTABLE if this occurs.
See Floating-Point Trapping Modes in this chapter for details of the stored result on overflow,
underflow, or inexact result.

4-85

VAX Floating Compare
Format:
CMPGyy

Fa.rg,Fb.rg,Fe.wq

!Floating-point Operate format

Operation:
IF

Fav SIGNED_RELATION Fbv THEN
Fe f- 4000 0000 0000 0000 16

ELSE
Fe f-

0000 0000 0000 0000 16

Exceptions:

Invalid Operation
Instruction mnemonics:

CMPGEQ

Compare G_floating Equal

CMPGLE

Compare G_floating Less Than or Equal

CMPGLT

Compare G_floating Less Than

Qualifiers:

Trapping:

Software (IS)

Description:

The two operands in Fa and Fb are compared. If the relationship specified by the qualifier is true,
a non-zero floating value (0.5) is written to register Fc; otherwise, a true zero is written to Fc.
Comparisons are exact and never overflow or underflow. Three mutually exclusive relations are
possible: less than, equal, and greater than.
An invalid operation trap is signaled if either operand has exp=O and is not a true zero (that is,
VAX reserved operands and dirty zeros trap). The contents of Fc are UNPREDICTABLE if this
occurs.
Notes:

• Compare Less Than A,B is the same as Compare Greater Than B,A; Compare Less Than or
Equal A,B is the same as Compare Greater Than or Equal B,A. Therefore, only the l~ss-than
operations are included.

4-86 • Instruction Descriptions

IEEE Floating Compare
Format:
CMPTyy

Fa.rx,Fb.rx,Fe.wq

!Floating-point Operate format

Operation:
IF

Fav SIGNED_RELATION Fbv THEN
Fe f - 4000 0000 0000 0000 16

ELSE
Fe

0000 0000 0000 0000 16

Exceptions:

Invalid Operation
Instruction mnemonics:

CMPTEQ

Compare T_floating Equal

CMPTLE

Compare T_floating Less Than or Equal

CMPTLT

Compare T_floating Less Than

CMPTUN

Compare T_floating Unordered

Qualifiers:

Trapping:

Software (IS)

Description:

The two operands in Fa and Fb are compared. If the relationship speCified by the qualifier is true,
a non-zero floating value (2.0) is written to register Fc; otherwise, a true zero is written to Fc.
Comparisons are exact and never overflow or underflow. Four mutually exclusive relations are
possible: less than, equal, greater than, and unordered. The unordered relation is true if one or
both operands are NaN. (This behavior must be provided by a software trap handler, since NaNs
trap.) Comparisons ignore the sign of zero, so +0 = -0 .
An invalid operation trap is signaled if either operand has exp=O and a non-zero fraction (IEEE
denormals trap), or if exp=all-ones and a non-zero fraction (IEEE NaNs). The contents of Fc are
UNPREDICTABLE if this occurs.
Comparisons with plus and minus infinity execute normally and do not take an invalid operation
trap.
Notes:

4-87

Convert VAX Floating to Integer
Format:
CVTGQ

Fb.rx,Fc.wq

!Floating-point Operate format

Operation:
Fc f-

{conversion of Fbv}

Exceptions:

Invalid Operation
Integer Overflow
Instruction mnemonics:

CVTGQ

Convert G_floating to Quadword

Qualifiers:

Rounding:

Chopped (lC)

Trapping:

Software (IS)
Integer Overflow Enable (IV)

Description:

The floating operand in register Fb is converted to a two's-complement quadword number and
written to register Fc. The conversion aligns the operand fraction with the binary point just to the
right of bit zero, rounds as specified, and complements the result if negative.

An invalid operation trap is signaled if the operand has exp=O and is not a true zero (that is, VAX
reserved operands and dirty zeros trap). The contents of Fc are UNPREDICTABLE if this occurs.
See Floating-Point Trapping Modes in this chapter for details of the stored result on integer
overflow.

4-88 • Instruction Descriptions

Convert Integer to VAX Floating
Format:
CVTQy

Fb.rq,Fc.wx

!Floating-point Operate format

Operation:
Fc

{conversion of Fbv<63:0>}

Exceptions:

None
Instruction mnemonics:

CVTQF

Convert Quadword to F_floating

CVTQG

Convert Quadword to G_floating

Qualifiers:

Rounding:

Chopped (lC)

Description:

The two's-complement quadword operand in register Fb is converted to a single- or
double-precision floating result and written to register Fe. The conversion complements a
number if negative, normalizes it, rounds to the target precision, and packs the result with an
appropriate sign and exponent field.

4-89

Convert VAX Floating to VAX Floating
Format:
CVTxy

Fb.rx,Fc.wx

!Floating-point Operate format

Operation:
Fe

{conversion of Fbv}

Exceptions:

Invalid Operation
Overflow
Underflow
Instruction mnemonics:

CVrDG

Convert D_floating to G_floating

CVrGD

Convert G_floating to D_floating

cvrGP

Convert G_floating to F_floating

Qualifiers:

Rounding:

Chopped (lC)

Trapping:

Software (IS)
Underflow Enable (IV)

Description:

The floating operand in register Fb is converted to the specified alternate floating format and
written to register Fe.
An invalid operation trap is signaled if the operand has exp=O and is not a true zero (that is, VAX
reserved operands and dirty zeros trap). The contents of Fe are UNPREDICTABLE if this occurs.
See Floating-Point Trapping Modes in this chapter for details of the stored result on overflow or
underflow.
Notes:

• The only arithmetic operations on D_floating values are conversions to and from G_floating. The
conversion to G_floating rounds or chops as specified, removing three fraction bits. The conversion from G_floating to D_floating adds three low-order zeros as fraction bits, then the 8-bit
exponent range is checked for overflow/underflow.
• The conversion from G_floating to F_floating rounds or chops to single precision, then the 8-bit
exponent range is checked for overflow/underflow.
• No conversion from F_floating to G_floating is required, since F_floating values are always
stored in registers as equivalent G_floating values.

4-90 • Instruction Descriptions

Convert IEEE Floating to Integer
Format:
CVTTQ

Fb.rx,Fc.wq

!Floating-point Operate format

Operation:
Fc f-

{conversion of Fbv}

Exceptions:

Invalid Operation
Inexact Result
Integer Overflow
Instruction mnemonics:

CVTTQ

Convert T_floating to Quadword

Qualifiers:

Rounding:

Dynamic (lD)
Minus infinity (1M)
Chopped (lC)

Trapping:

Software (IS)
Integer Overflow Enable (IV)
Inexact Enable (II)

Description:

The floating operand in register Fb is converted to a two's-complement number and written to
register Fc. The conversion aligns the operand fraction with the binary point just to the right of
bit zero, rounds as specified, and complements the result if negative.

An invalid operation trap is signaled if either operand has exp=O and a non-zero fraction (IEEE
denormals trap), or if exp=all-ones (IEEE NaNs and infinities trap).
The contents of Fc are UNPREDICTABLE if this occurs.
See Floating-Point Trapping Modes in this chapter for details of the stored result on integer
overflow and inexact result.

4-91

Convert Integer to IEEE Floating
Format:
CVTQy

Fb.rq,Fc.wx

!Floating-point Operate format

Operation:
Fc f-

{conversion of Fbv<63:0>}

Exceptions:

Inexact Result
Instruction mnemonics:

CVTQS

Convert Quadword to S_floating

CVTQT

Convert Quadword to T_floating

Qualifiers:

Rounding:

Dynamic (lD)
Minus infinity (1M)
Chopped (lC)

Trapping:

Software (IS)
Inexact Enable (II)

Description:

The two's-complement operand in register Fb is converted to a single- or double-precision
floating result and written to register Fe. The conversion complements a number if negative,
normalizes it, rounds to the target precision, and packs the result with an appropriate sign and
exponent field.
See Floating-Point Trapping Modes in this chapter for details of the stored result on inexact result.

4-92 • Instruction Descriptions

Convert IEEE Floating to IEEE Floating
Format:
CVTTS

Fb.rx,Fc.wx

!Floating-point Operate format

Operation:
Fc

{conversion of Fbv}

Exceptions:

Invalid Operation
Overflow
Underflow
Inexact Result
Instruction mnemonics:

CVTTS

Convert T_floating to S_floating

Qualifiers:

Rounding:

Dynamic (lD)
Minus infinity (1M)
Chopped (lC)

Trapping:

Software (IS)
Underflow Enable (lU)
Inexact Enable (II)

Description:

The floating operand in register Fb is converted to the specified alternate floating format and
written to register Fc.
An invalid operation trap is signaled if either operand has exp=O and a non-zero fraction (IEEE
denormals trap), or if exp=all-ones (IEEE NaNs and infinities trap).
The contents of Fc are UNPREDICTABLE if this occurs.
See Floating-Point Trapping Modes in this chapter for details of the stored result on overflow,
underflow, or inexact result.
Notes:

• No conversion from S_floating to T_floating is required, since S_floating values are always stored
in registers as equivalent T_floating values.

4-93

VAX Floating Divide
Format:

Drvx

Fa.rx,Fb.rx,Fe.wx

!Floating-point Operate format

Operation:
Fe

Fav / Fbv

Exceptions:

Invalid Operation
Division by Zero
Overflow
Underflow
Instruction mnemonics:

DIVF

Divide F_floating

DIVG

Divide G_floating

Qualifiers:

Rounding:

Chopped (lC)

Trapping:

Software (IS)
Underflow Enable (lU)

Description:

The dividend operand in register Fa is divided by the divisor operand in register Fb, and the
quotient is written to register Fc.
The quotient is rounded or chopped to the specified precision and then the corresponding range
is checked for overflow/underflow. The single-precision operation on canonical single-precision
values produces a canonical single-precision result.

An invalid operation trap is signaled if either operand has exp=O and is not a true zero (that is,
VAX reserved operands and dirty zeros trap). The contents of Fc are UNPREDICTABLE if this
occurs.
A division by zero trap is signaled if Fbv is zero. The contents of Fc are UNPREDICTABLE if this
occurs.
See Floating-Point Trapping Modes in this chapter for details of the stored result on overflow or
underflow.

4-94 • Instruction Descriptions

IEEE Floating Divide
Format:

Drvx

Fa.rx,Fb.rx,Fe.wx

!Floating-point Operate format

Operation:
Fe f- Fav / Fbv
Exceptions:

Invalid Operation
Division by Zero
Overflow
Underflow
Inexact Result
Instruction mnemonics:

DIVS

Divide S_floating

DIVT

Divide T_floating

Qualifiers:

Rounding:

Dynamic (lD)
Minus infinity (1M)
Chopped (lC)

Trapping:

Software (IS)
Underflow Enable (lU)
Inexact Enable (II)

Description:

The dividend operand in register Fa is divided by the divisor operand in register Fb, and the
quotient is written to register Fc.
The quotient is rounded to the specified precision, and then the corresponding range is checked
for overflow/underflow. The single-precision operation on canonical single-precision values produces a canonical single-precision result.

An invalid operation trap is signaled if either operand has exp=O and a non-zero fraction (IEEE
denormals trap), or if exp=all-ones (IEEE NaNs and infinities trap).
The contents of Fc are UNPREDICTABLE if this occurs.
A division by zero trap is signaled if Fbv is zero. The contents of Fc are UNPREDICTABLE if this
occurs.
See Floating-Point Trapping Modes in this chapter for details of the stored result on overflow,
underflow, or inexact result.

4-95

VAX Floating Multiply
Format:
MULx

Fa.rx,Fb.rx,Fe.wx

!Floating-point Operate format

Operation:
Fe

f--

Fav * Fbv

Exceptions:

Invalid Operation
Overflow
Underflow
Instruction mnemonics:

MULF

Multiply F_floating

MULG

Multiply G_floating

Qualifiers:

Rounding:

Chopped (lC)

Trapping:

Software (IS)
Underflow Enable (lU)

Description:

The multiplicand operand in register Fb is multiplied by the multiplier operand in register Fa,
and the product is written to register Fe.
The product is rounded or chopped to the specified precision, and then the corresponding range
is checked for overflow/underflow. The single-precision operation on canonical single-precision
values produces a canonical single-precision result.

An invalid operation trap is signaled if either operand has exp=O and is not a true zero (that is,
VAX reserved operands and dirty zeros trap). The contents of Fe are UNPREDICTABLE if this
occurs.
See Floating-Point Trapping Modes in this chapter for details of the stored result on overflow or
underflow.

4-96 • Instruction Descriptions

IEEE Floating Multiply
Format:
Fa.rx,Fb.rx,Fe.wx

MULx

!Floating-point Operate format

Operation:
Fe

Fav * Fbv

Exceptions:

Invalid Operation
Overflow
Underflow
Inexact Result
Instruction mnemonics:

MULS

Multiply S_floating

MULT

Multiply T_floating

Qualifiers:

Rounding:

Dynamic (lD)
Minus infinity (1M)
Chopped (lC)

Trapping:

Software (IS)
Underflow Eenable (lU)
Inexact Enable (II)

Description:

The multiplicand operand in register Fb is multiplied by the multiplier operand in register Fa,
and the product is written to register Fc.
The product is rounded to the specified precision, and then the corresponding range is checked
for overflow/underflow. The single-precision operation on canonical single-precision values produces a canonical single-precision result.

4-97

VAX Floating Subtract
Format:
SUBx

Fa.rx,Fb.rx,Fe.wx

!Floating-point Operate format

Operation:
Fe

Fav - Fbv

Exceptions:

Invalid Operation
Overflow
Underflow
Instruction mnemonics:

SUBF

Subtract F_floating

SUBG

Subtract G_floating

Qualifiers:

Rounding:

Chopped (lC)

Trapping:

Software (IS)
Underflow Enable (lU)

Description:

The subtrahend operand in register Fb is subtracted from the minuend operand in register Fa,
and the difference is written to register Fc.
The difference is rounded or chopped to the specified precision, and then the corresponding
range is checked for overflow/underflow. The single-precision operation on canonical single-precision values produces a canonical single-precision result.

4-98 • Instruction Descriptions

IEEE Floating Subtract
Format:
Fa.rx,Fb.rx,Fe.wx

SUBx

!Floating-point Operate format

Operation:
Fe

Fav - Fbv

Exceptions:
Invalid Operation
Overflow
Underflow
Inexact Result

Instruction mnemonics:
SUBS
Subtract S_floating

SUBT
Qualifiers:
Rounding:

Trapping:

Subtract T_floating

Dynamic (lD)
Minus infinity (1M)
Chopped (lC)
Software (IS)
Underflow Enable (lU)
Inexact Enable (II)

Description:
The subtrahend operand in register Fb is subtracted from the minuend operand in register Fa,
and the difference is written to register Fc.
The difference is rounded to the specified precision, and then the corresponding range is checked
for overflow/underflow. The single-precision operation on canonical single-precision values produces a canonical single-precision result.

4-99

• Miscellaneous Instructions
Alpha provides the miscellaneous instructions shown in Table 4-12.
Table 4-12 · Miscellaneous Instructions Summary
Mnemonic

Operation

CALL_PAL

Call Privileged Architecture Library Routine

FETCH

Prefetch Data

FETCH_M

Prefetch Data, Modify Intent

Memory Barrier

RPCC

Read Process Cycle Counter

TRAPB

Trap Barrier

4-100 • Instruction Descriptions

Call Privileged Architecture Library
Format:
CALL_PAL fnc.ir

!PAL format

Operation:
{Stall instruction issuing until all
prior instructions are guaranteed to
complete without incurring exceptions.}
{Trap to PAL code.}

Exceptions:

None
Instruction mnemonics:

Call Privileged Architecture Library
Qualifiers:

None
Description:

The CALL_PAL instruction is not issued until all previous instructions are guaranteed to complete without exceptions. If an exception occurs, the continuation PC in the exception stack
frame points to the CALL_PAL instruction. The CALL_PAL instruction causes a trap to PAL
code.

4-101

Prefetch Data
Format:
FETCHx

O(Rb.ab)

!Memory format

Operation:
va ~ {Rbv}
{Optionally prefetch aligned 512-byte block surrounding va.}

Exceptions:

None
Instruction mnemonics:

Prefetch Data
Prefetch Data, Modify Intent
Qualifiers:

None
Description:

The virtual address is given by Rbv. This address is used to designate an aligned 512-byte block of
data. An implementation may optionally attempt to move all or part of this block (or a larger
surrounding block) of data to a faster-access part of the memory hierarchy, in anticipation of
subsequent Load or Store instructions that access that data.
The FETCH instruction is a hint to the implementation that may allow faster execution. An
implementation is free to ignore the hint. If prefetching is done in an implementation, the order
of fetch within the designated block is UNPREDICTABLE.
The FETCH_M instruction gives the additional hint that modifications (stores) to some or all of
the data block are anticipated.
No exceptions are generated by FETCHx. If a Load (or Store in the case of FETCH_M) that uses
the same address would fault, the prefetch request is ignored. It is UNPREDICTABLE whether a
TB-miss fault is ever taken by FETCHx.
Implementation Note
Implementations are encouraged to take the TB-miss fault, then continue
the prefetch.

4-102 • Instruction Descriptions

The programming model for effective use of FETCH and FETCH_M is given in Appendix A.
Software Note
FETCH is intended to help software overlap memory latencies on the
order of 100 cycles. FETCH is unlikely to help (or be implemented) for
memory latencies on the order of 10 cycles. Code scheduling should be
used to overlap such short latencies.

4-103

Memory Barrier
Format:
!Memory format

Operation:
{Guarantee that all subsequent loads or stores
will not access memory until after all previous
loads and stores have accessed memory, as
observed by other processors.}

Exceptions:

None
Instruction mnemonics:

Memory Barrier

Qualifiers:

None
Description:

The use of the Memory Barrier (MB) instruction is required only in multiprocessor systems.
In the absence of an MB instruction, loads and stores to different physical locations are allowed to
complete out of order on the issuing processor as observed by other processors. The MB
instruction allows memory accesses to be serialized on the issuing processor as observed by other
processors. See Chapter 5 for details on using the MB instruction to serialize these accesses.
Chapter 5 also details coordinating memory accesses across processors.
Note that MB ensures serialization only; it does not necessarily accelerate the progress of memory
operations.

4-104 • Instruction Descriptions

Read Process Cycle Counter
Format:
Ra.wq

RPCC

!Memory format

Operation:
Ra

{cycle counter}

Exceptions:
None
Instruction mnemonics:
RPCC
Read Process Cycle Counter
Qualifiers:
None
Description:
Register Ra is written with the process cycle counter (PCC).
The low-order 32 bits of the process cycle counter is an unsigned 32-bit integer that increments
once per N CPU cycles, where N is an implementation-specific integer in the range 1..16. The
cycle counter frequency is the number of times the process cycle counter gets incremented per
second, rounded to a 64-bit integer. The integer count wraps to 0 from a count of FFFF FFFF 16 .
The counter wraps no more frequently than 1.5 times the implementation's interval clock
interrupt period (which is two thirds of the interval clock interrupt frequency). The high-order
32 bits of the process cycle counter are an offset that when added to the low-order 32 bits gives
the cycle count for this process.
The process cycle counter is suitable for timing intervals on the order of nanoseconds and may be
used for detailed performance characterization. It is required on all implementations. PCC is
required for every processor, and each processor in a multiprocessor system has its own private,
independent Pcc.
As an example, consider the following code that returns in RO the current cycle count
MOD 2""-'32.
RPCC
SLL
ADDQ
SRL

RO
RO, #32, Rl
RO, Rl, RO
RO, #32, RO

Read the process cycle counter
line up the offset and count fields
do add
zero extend the cycle count to 64 bits

4-105

Trap Barrier
Format:
!Memory format

TRAPB

Operation:
{Stall instruction issuing until all prior instructions are
guaranteed to complete without incurring arithmetic traps.}

Exceptions:

None
Instruction mnemonics:

TRAPB

Trap Barrier

Qualifiers:

None
Description:

The TRAPB instruction allows software to guarantee that in a pipelined implementation, all
previous arithmetic instructions will complete without incurring any arithmetic traps before any
instructions after the TRAPB are issued. For example, TRAPB should be used before changing an
exception handler to ensure that all exceptions on previous instructions are processed in the
current exception-handling environment.

4-106 • Instruction Descriptions

• VAX Compatibility Instructions
Alpha provides the instructions shown in Table 4-13 for use in translated VAX code. These
instructions are not a permanent part of the architecture and will not be available in some future
implementations. They are intended to preserve customer assumptions about VAX instruction
atomicity in porting code from VAX to Alpha.
These instructions should be generated only by the VAX-to-Alpha software translator; they should
never be used in native Alpha code. Any native code that uses them may cease to work.

Table 4-13 · VAX Compatibility Instructions Summary
Mnemonic

Operation

Read and Clear

Read and Set

4-107

VAX Compatibility Instructions
Format:
Rx

Ra.wq

!Memory format

Operation:
Ra ~ intr_flag
intr_flag ~ 0
intr_flag ~ 1

!RC
!RS

Exceptions:

None
Instruction mnemonics:

Read and Clear

Read and Set

Qualifiers:

None
Description:
The intr_flag is returned in Ra and then cleared to zero (RC) or set to one (RS).

These instructions may be used to determine whether the sequence of Alpha instructions between
RS and RC (corresponding to a single VAX instruction) was executed without interruption or
exception.
Intr_flag is a per-processor state bit. The intr_flag is cleared if that processor encounters any
exception or interrupt.
It is UNPREDICTABLE whether a processor's intr_flag is affected when that processor executes
an LDx_L or STx_C instruction. A processor's intr_flag is not affected when that processor
executes a normal load or store instruction.

A processor's intr_flag is not affected when that processor executes a taken branch.
Note
These instructions are intended only for use by the VAX-to-Alpha software translator; they should never be used by native code.

Chapter 5 · System Architecture and Programming
Implications

• Introduction
Portions of the Alpha architecture have implications for programming, and the system structure,
of both uniprocessor and multiprocessor implementations. Architectural implications considered
in the following sections are:
• Physical memory behavior
• Caches and write buffers
• Translation buffers and virtual caches
• Data sharing
• Readlwrite ordering
• Stacks
• Arithmetic traps
To meet the requirements of the Alpha architecture, software and hardware implementors need
to take these issues into consideration.

• Physical Memory Behavior
Alpha physical memory space is divided into four regions, based on the two most significant,
implemented, physical address bits. Each region's behavior can be described in terms of its
coherency, granularity, width, and memory-like behavior.

Coherency of Memory Access
Alpha implementations must provide a coherent view of memory, in which each write by a
processor or I/O device (hereafter, called "processor") becomes visible to all other processors. No
distinction is made between coherency of "memory space" and "1/0 space".
Memory coherency may be provided in different ways, for each of the four physical address
regions.

5-2 • System Architecture and Programming Implications

Possible per-region policies include, but are not restricted to:
1. No caching
No copies are kept of data in a region; all reads and writes access the actual data location
(memory or I/O register).
2. Write-through caching
Copies are kept of any data in the region; reads may use the copies, but writes update the
actual data location and either update or invalidate all copies.
3. Write-back caching
Copies are kept of any data in the region; reads and writes may use the copies, and writes use
additional state to determine whether there are other copies to invalidate or update.
Part of the coherency policy implemented for a given physical address region may include
restrictions on excess data transfers (performing more accesses to a location than is necessary to
acquire or change the location's value), or may specify data transfer widths (the granularity used
to access a location).
Independent of coherency policy, a processor may use different hardware or different hardware
resource policies for caching or buffering different physical address regions.

Granularity of Memory Access
For each region, an implementation must support aligned quadword access and may optionally
support aligned longword access.
For a quadword access region, accesses to physical memory must be implemented such that
independent accesses to adjacent aligned quadwords produce the same results regardless of the
order of execution. Further, an access to an aligned quadword must be done in a single atomic
operation.
For a longword access region, accesses to physical memory must be implemented such that
independent accesses to adjacent aligned longwords produce the same results regardless of the
order of execution. Further, an access to an aligned longword must be done in a single atomic
operation, and an access to an aligned quadword must also be done in a single atomic operation.
In this context, "atomic" means that if different processors do simultaneous reads and writes of
the same data, it must not be possible to observe a partial write of the subject longword or
quadword.

Width of Memory Access
Subject to the granularity, ordering, and coherency constraints given in the sections of this
chapter entitled Coherency 0/ Memory Access, Granularity 0/ Memory Access, and Read/Write
Ordering, accesses to physical memory may be freely cached, buffered, and prefetched.
A processor may read more physical memory data (such as a full cache block) than is actually
accessed, writes may trigger reads, and writes may write back more data than is actually updated.
A processor may elide multiple reads and/or writes to the same data.

5-3

Memory-Like Behavior
A memory-like region obeys the following rules:
• Each page frame in the region either exists in its entirety or does not exist in its entirety; there are
no holes within a page frame.
• All locations that exist are read/write.

• A write to location followed by a read from that location returns precisely the bits written; all
bits act as memory.
• A write to one location does not change any other location.
• Reads have no side effects.
• Longword access granularity is provided.
• Instruction-fetch is supported.
• Load-locked and store-conditional are supported.
Non-memory-like regions may have much more arbitrary behavior:
• Unimplemented locations or bits may exist anywhere.
• Some locations or bits may be read-only and others write-only.
• Address ranges may overlap, such that a write to one location changes the bits read from a
different location.
• Reads may have side effects, although this is strongly discouraged.
• Longword granularity need not be supported.
• Instruction-fetch need not be supported.
• Load-locked and store-conditional need not be supported.
Hardware/Software Coordination Note
The details of such behavior are outside the scope of the Alpha architecture. Specific processor and I/O adapter implementations may choose
and document whatever behavior they need. It is the responsibility of
system designers to impose enough consistency to allow processors successfully to access matching non-memory devices in a coherent way.

• Translation Buffers and Virtual Caches
A system may choose to include a Translation Buffer (TB), a virtual instruction cache (virtual
I-cache), or a virtual data cache (virtual D-cache). The contents of these caches and/or translation
buffers may become invalid, depending upon what operating system activity is being performed.
Whenever a nonsoftware field of a valid Page Table Entry (PTE) is modified, copies of that PTE
must be made coherent. Translation Buffer (TB) entries and virtual D-cache entries can be made
coherent by calling the appropriate PALcode routine to invalidate the TB. Virtual I-cache entries
can be made coherent via the IMB PAL call.

5-4 • System Architecture and Programming Implications

If a processor implements address space numbers (ASNs), and the old PTE has the address space
match (ASM) bit clear (ASNs in use) and the valid bit set, then entries can also effectively be made
coherent by assigning a new, unused ASN to the currently running process and not reusing the
previous ASN before calling the appropriate PALcode routine to invalidate the Translation Buffer
(TB).
In a multiprocessor environment, making the TBs and/or caches coherent on only one processor
is not always sufficient. An operating system must arrange to perform the above actions on each
processor that could possibly have copies of the PTE or data for any affected page.

• Caches and Write Buffers
A hardware implementation may include mechanisms to reduce memory access time by making
local copies of recently used memory contents (or those expected to be used) or by buffering
writes to complete at a later time. Caches and write buffers are examples of these mechanisms.
They must be implemented so that their existence is transparent to software (except for timing,
error reporting/control/recovery, and modification to the I-stream).
The following requirements must be met by all cache/write-buffer implementations. All processors must provide a coherent view of memory.
1. Write buffers may be used to delay and aggregate writes. From the viewpoint of another
processor, buffered writes appear not to have happened yet. (Write buffers must not delay
writes indefinitely. See Timeliness.)
2. Write-back caches must be able to detect a later write from another processor and invalidate
or update the cache contents.
3. A processor must guarantee that a data store to a location followed by a data load from the
same location must read the updated value.
4. Cache prefetching is allowed, but virtual caches must not prefetch from invalid pages.
5. A processor must guarantee that all of its previous writes are visible to all other processors
before a HALT instruction completes. A processor must guarantee that its caches are coherent
with the rest of the system before continuing from a HALT.
6. If battery backup is supplied, a processor must guarantee that the memory system remains
coherent across a powerfail/recovery sequence. Data that was written by the processor before
the powerfail may not be lost, and any caches must be in a valid state before (and if) normal
instruction processing is continued after power is restored.
7. Virtual instruction caches are not required to notice modifications of the virtual I-stream (they
need not be coherent with the rest of memory). Software that creates or modifies the instruction stream must execute an 1MB PAL call before trying to execute the new instructions.
For example, if two different virtual addresses, VAl and VA2, map to the same page frame, a
store to VAl modifies the virtual I-stream fetched via VA2.

5-5

However, the sequence:
-Change the mapping of an I-stream page from valid to invalid, then
- Copy the corresponding page frame to a new page frame, then
-Change the original mapping to be valid and point to the new page frame
does not modify the virtual I-stream (this might happen in soft page faults).
8. Physical instruction caches are not required to notice modifications of the physical I-stream
(they need not be coherent with the rest of memory), except for certain paging activity. (See
Timeliness.) Software that creates or modifies the instruction stream must execute an 1MB PAL
call before trying to execute the new instructions.
In this context, to "modify the physical I-stream" means any Store to the same physical
address that is subsequently fetched as an instruction.
In this context, to "modify the virtual I-stream" means any Store to the same physical address
that is subsequently fetched as an instruction via some corresponding (virtual address, ASN) pair,
or to change the virtual-to-physical address mapping so that different values are fetched.

• Data Sharing
In a multiprocessor environment, writes to shared data must be synchronized by the programmer.

Atomic Change of a Single Datum
The ordinary STL and STQ instructions can be used to perform an atomic change of a shared
aligned longword or quadword. ("Change" means that the new value is not a function of the old
value.) In particular, an ordinary STL or STQ instruction can be used to change a variable that
could be simultaneously accessed via an LDx_L/STx_C sequence.

Atomic Update of a Single Datum
The load-Iocked/store-conditional instructions may be used to perform an atomic update of a
shared aligned longword or quadword. ("Update" means that the new value is a function of the
old value.)
The following sequence performs a read-modify-write operation on location x. Only register-to-register operate instructions and branch fall-throughs may occur in the sequence:
try_again:
LDQ_L
<modify
STQ_C
BEQ

Rl,x
Rl>
Rl,x
Rl,no_store

no_store:
<code to check for excessive iterations>
BR
try_again

5-6 • System Architecture and Programming Implications

If this sequence runs with no exceptions or interrupts, and no other processor writes to location x
(more precisely, the locked range including x) between the LDQ_L and STQ_C instructions, then
the STQ_C shown in the example stores the modified value in x and sets Rl to 1. If, however, the
sequence encounters exceptions or interrupts that eventually continue the sequence, or another
processor writes to x, then the STQ_C does not store and sets Rl to O. In this case, the sequence is
repeated via the branches to no_store and try_again. This repetition continues until the reasons
for exceptions or interrupts are removed, and no interfering store is encountered.
To be useful, the sequence must be constructed so that it can be replayed an arbitrary number of
times, giving the same result values each time. A sufficient (but not necessary) condition is that,
within the sequence, the set of operand destinations and the set of operand sources are disjoint.
Note
A sufficiently long instruction sequence between LDQ_L and STQ_C will
never complete, because periodic timer interrupts will always occur
before the sequence completes. The rules in Appendix A describe
sequences that will eventually complete in all Alpha implementations.

This load-Iocked/store-conditional paradigm may be used whenever an atomic update of a shared
aligned quadword is desired, including getting the effect of atomic byte writes.

Atomic Update of Data Structures
Before accessing shared writable data structures (those that are not a single aligned longword or
quadword), the programmer can acquire control of the data structure by using an atomic update
to set a software lock variable. Such a software lock can be cleared with an ordinary store
instruction.
A software-critical section, therefore, may look like the sequence:
stCL-c_loop:
spin_loop:
LDQ_L Rl,lock_variable
BLBS Rl,already_set
Rl, # 1, R2
OR
STQ_C R2,lock_variable
R2,stq_c_fail
BEQ

\
\

> Set lock bit
/
/

MB
<critical section: updates various data structures>
MB
STQ

already_set:
<code
BR
stCL-c_fail:
<code
BR

R31,lock_variable

Clear lock bit

to block or reschedule or test for too many iterations>
spin_loop
to test for too many iterations>
stCL-c_loop

5-7

This code has a number of subtleties:
1. If the lock_variable is already set, the spin loop is done without doing any stores. This
avoidance of stores improves memory subsystem performance. and avoids the deadlock
described below.
2. If the lock_variable is actually being changed from 0 to 1, and the STQ_C fails (due to an
interrupt, or because another processor simultaneously changed lock_variable), the entire
process starts over by reading the lock_variable again.
3. Only the fall-through path of the BLBS does a STx_C; some implementations may not allow a
successful STx_C after a branch-taken.
4. Only register-to-register operate instructions are used to do the modify.
5. Both conditional branches are forward branches, so they are properly predicted not to be
taken (to match the common case of no contention for the lock).
6. The OR writes its result to a second register; this allows the OR and the BLBS to be
interchanged if that would give a faster instruction schedule.
7. Other operate instructions (from the critical section) may be scheduled into the
LDQ_L..STQ_C sequence, so long as they do not fault or trap, and they give correct results if
repeated; other memory or operate instructions may be scheduled between the STQ_C and
BEQ.
8. The MB instructions are discussed in Ordering Considerations for Shared Data Structures.
9. An ordinary STQ instruction is used to clear the lock_variable.
It would be a performance mistake to spin-wait by repeating the full LDQ_L..STQ_C sequence (to
move the BLBS after the BEQ) because that sequence may repeatedly change the software
lock_variable from "locked" to "locked," with each write causing extra access delays in all other
caches that contain the lock_variable. In the extreme, spin-waits that contain writes may deadlock
as follows:

If, when one processor spins with writes, another processor is modifying (not changing) the
lock_variable, then the writes on the first processor may cause the STx_C of the modify on the
second processor always to fail.
This deadlock situation is avoided by:
• Having only one processor do a store (no STx_C), or
• Having no write in the spin loop, or
• Doing a write only if the shared variable actually changes state (1 ~ 1 does not change state).

Ordering Considerations for Shared Data Structures
A critical section sequence, such as shown in Atomic Update of Data Structures, is conceptually
only three steps:

1. Acquire software lock
2. Critical section-read/write shared data
3. Clear software lock

5-8 • System Architecture and Programming Implications

In the absence of explicit instructions to the contrary, the Alpha architecture allows reads and
writes to be reordered. While this may allow more implementation speed and overlap, it can also
create undesired side effects on shared data structures. Normally, the critical section just
described would have two 'instructions added to it:
<acquire software lock>
MB (memory barrier #1)
<critical section -- read/write shared data>
MB (memory barrier #2)
<clear software lock>

The first memory barrier prevents any reads (from within the critical section) from being
prefetched before the software lock is acquired; such prefetched reads would potentially contain
stale data.
The second memory barrier prevents any reads or writes (from within the critical section) from
being delayed past the clearing of the software'lock; such delayed accesses could interact with the
next user of the shared data, defeating the purpose of the software lock entirely.
Software Note
In the VAX architecture, many instructions provide noninterruptable
read-modify-write sequences to memory variables. Most programmers
never regard data sharing as an issue.

In the Alpha architecture, programmers must pay more attention to
synchronizing access to shared data; for example, to AST routines. In the
VAX, a programmer can use an ADDL2 to update a variable that is
shared between a "MAIN" routine and an AST routine, if running on a
single processor. In the Alpha architecture, a programmer must deal with
AST shared data by using multiprocessor shared data sequences.

• ReadlWrite Ordering
This section does not apply to programs that run on a single processor and do not write to the
instruction stream. On a single processor, all memory accesses appear to happen in the order
specified by the programmer. This section deals entirely with predictable read/write ordering
across multiple processors.
The order of reads and writes done in an Alpha implementation may differ from that specified by
the programmer.
For any two memory references A and B, either A must occur before B in all Alpha implementations, B must occur before A, or they are UNORDERED. In the last case, software cannot depend
upon one occurring first: the order may vary from implementation to implementation, and even
from run to run or moment to moment on a single implementation.

If two references cannot be shown to be ordered by the rules given, they are UNORDERED and
implementations are free to do them in any order that is convenient. Implementations may take
advantage of this freedom to deliver substantially higher performance.

5-9

The discussion that follows first defines the architectural issue sequence of memory references on
a single processor, then defines the (partial) ordering on this issue sequence that all Alpha
implementations are required to maintain.
The individual issue sequences on multiple processors are merged into access sequences at each
shared memory location. The discussion defines the (partial) ordering on the individual access
sequences that all Alpha implementations are required to maintain.
The net result is that for any code that executes on multiple processors, one can determine which
memory accesses are required to occur before others on all Alpha implementations and hence can
write useful shared-variable software.
Software writers can force one reference to occur before another by inserting a memory barrier
instruction (MB or 1MB) between the references.

Alpha Shared Memory Model
An Alpha system consists of a collection of processors and shared coherent memories that are
accessible by all processors. (There may also be unshared memories, but they are outside the
scope of this section.)
A processor is an Alpha CPU or an I/O device (or anything else that gets added).
A shared memory is the primary storage place for one or more locations.
A location is an aligned quadword, specified by its physical address. Multiple virtual addresses
may map to the same physical address. Ordering considerations are based only on the physical
address.
Implementation Note
An implementation may allow a location to have multiple physical
addresses, but the rules for accesses via mixtures of the addresses are
implementation-specific and outside the scope of this section. Accesses
via exactly one of the physical addresses follow the rules described next.
Each processor may generate accesses to shared memory locations. There are five types of
accesses:
1. Instruction fetch by processor i to location x, returning value a, denoted Pi:I(x,a) .
2. Data read by processor i to location x, returning value a, denoted Pi:R(x,a) .
3. Data write by processor i to location x, storing value a, denoted Pi:W(x,a) .
4. Memory barrier instruction issued by processor i, denoted Pi:MB .
5. I-stream memory barrier instruction issued by processor i, denoted Pi:IMB .
The first access type is also called an I-stream access or I-fetch. The next two are also called
D-stream accesses. The first three types collectively are called read/write accesses, denoted
Pi:'k(x,a). The last two types collectively are called barriers.

5-10 • System Architecture and Programming Implications

During actual execution in an Alpha system, each processor has a time-ordered issue sequence of
all the memory references presented by that processor (to all memory locations), and each
location has,a time-ordered access sequence of all the accesses presented to that location (from all
processors).

Architectural Definition of Processor Issue Sequence
The issue sequence for a processor is architecturally defined with respect to a hypothetical simple
implementation that contains one processor and a single shared memory, with no caches or
buffers. This is the instruction execution model:
1. I-fetch: An Alpha instruction is fetched from memory.
2. ReadlWrite: That instruction is executed and runs to completion, including a single data read
from memory for a Load instruction or a single data write to memory for a Store instruction.
3. Update: The PC for the processor is updated.
4. Loop: Repeat the above sequence indefinitely.

If the instruction fetch step gets a memory management fault, the I-fetch is not done and the PC
is updated to point to a PALcode fault handler. If the read/write step gets a memory management
fault, the read/write is not done and the PC is updated to point to a PALcode fault handler.
All memory references are aligned quadwords. For the purpose of defining ordering, aligned
longword references are modeled as quadword references to the containing aligned quadword.

Definition of Processor Issue Order
A partial ordering, called processor issue order, is imposed on the issue sequence defined in

Architectural Definition of Processor Issue Sequence in this chapter.
For two accesses u and v issued by processor Pi, u is said to PRECEDE v IN ISSUE ORDER «) if u
occurs earlier than v in the issue sequence for Pi, and either of the following applies:
1. The access types are of the following issue order:
Table 5-1 • Processor Issue Order
1stJJ2nd--7

Pi:I(y,b)

Pi:I(x,a)
Pi:R(x,a)
Pi:W(x,a)
Pi:MB
Pi:IMB

< if x=y

Pi:R(y,b)

Pi:W(y,b)

Pi:MB

Pi:IMB

< if x=y
< if x=y
<
<

< if x=y
< if x=y
< if x=y
<
<

<
<
<
<
<

2. Or, u is a TB fill, for example, a PTE read in order to satisfy a TB miss, and v is an 1- or
D-stream access using that PTE (see Litmus Tests).

5-11

Issue order is thus a partial order imposed on the architecturally specified issue sequence.
Implementations are free to do memory accesses from a single processor in any sequence that is
consistent with this partial order.
Note that accesses to different locations are ordered only with respect to barriers and TB fill. The
table asymmetry for I-fetch allows writes to the I-stream to be incoherent until an 1MB is
executed.

Definition of Memory Access Sequence
The access sequence for a location cannot be observed directly, nor fully predicted before an
actual execution, nor reproduced exactly from one execution to another. Nonetheless, some
useful ordering properties must hold in all Alpha implementations.

Definition of Location Access Order
A partial ordering, called location access order, is imposed on the memory access sequence
defined above.
For two accesses u and v to location x, u is said to PRECEDE v IN ACCESS ORDER «<) if u
occurs earlier than v in the access sequence for x, and at least one of them is a write:

Table 5-2 • Location Access Order
Isd/2nd--7
Pi:I(x,a)
Pi:R(x,a)
Pi:W(x,a)

Pi:I(x,b)

Pi:R(x,b)

Pi:W(x,b)

«
«

Access order is thus a partial order imposed on the actual access sequence for a given location.
Each location has a separate access order. There is no direct ordering relationship between
accesses to different locations.
Note that reads and I-fetches are ordered only with respect to writes.

Definition of Storage
If u is Pi:W(x,a) , and v is either Pj:I(x,b) or Pj:R(x,b) , and u«v , and no w Pk:W(x,c) exists
such that u«w«v , then the value b returned by v is exactly the value a written by u.
Conversely, if u is Pi:W(x,a) , and v is either Pj:I(x,b) or Pj:R(x,b), and b=a (and a is distinguishable from values written by accesses other than u), then u«v and for any other w Pk:W(x,c) either
w«u or v«w .
The only way to communicate information between different processors is for one to write a
shared location and the other to read the shared location and receive the newly written value. (In
this context, the sending of an interrupt from processor Pi to processor Pj is modeled as Pi
writing to a location INTij, and Pj reading from INTij.)

5-12 • System Architecture and Programming Implications

Relationship Between Issue Order and Access Order
If u is Pi:~"(x,a) , and v is Pi:>"(x,b) , one of which is a write, and u<v in the issue order for
processor Pi, then u«v in the access order for location x.
In other words, if two accesses to the same location are ordered on a given processor, they are
ordered in the same way at the location.

Definition of Before
For two accesses u and v, u is said to be BEFORE v (<=) if:
u<vor
u «v or
there exists an access w such that:
(u < wand w <= v) or
(u « wand w <= v).
In other words, "before" is the transitive closure over issue order and access order.

Definition of After
If u <= v , then v is said to be AFTER u.
At most one of u <= v and v <= u is true.

Timeliness
Even in the absence of a barrier after the write, a write by one processor to a given location may
not be delayed indefinitely in the access order for that location.

Litmus Tests
Many issues about writing and reading shared data can be cast into questions about whether a
write is before or after a read. These questions can be answered by rigorously applying the
ordering rules described previously to demonstrate whether the accesses in question are ordered
at all.
Assume, in the litmus tests below, that initially all memory locations contain 1.

Litmus Test 1 (Impossible Sequence)
Pi

[UI] Pi:W(x,2)

[VI] Pj:R(x,2)
[V2] Pj:R(x,l)

VI reading 2 implies UI « VI, by the definition of storage
V2 reading I implies V2 « VI, by the definition of storage
VI < V2, by the definition of issue order
The first two orderings imply that V2 <= VI , whereas the last implies that VI <= V2 .
Both implications cannot be true. Thus, once a processor reads a new value from a location, it
must never see an old value-time must not go backward. V2 must read 2.

5-13

Litmus Test 2 (Impossible Sequence)
Pi

[Ol] Pi:W(x,2)

[VI] Pj:W(x,3)
[V2] Pj:R(x,2)
[V3] Pj:R(x,3)

V2 reading 2 implies VI <= 01
V3 reading 3 implies 01 <= VI
Both implications cannot be true. Thus, once a processor reads a new value written by 01, any
other writes that must precede the read must also precede 01. V3 must read 2.

Litmus Test 3 (Impossible Sequence)
Pi

[Ol] Pi:W(x,2)

[VI] Pj:W(x,3)

[Wl] Pk:R(x,3)

[02] Pi:R(x,3)

[W2] Pk:R(x,2)

02 reading 3 implies 01 <= VI
W2 reading 2 implies VI <= 01
Both implications cannot be true. Again, time cannot go backward. If 02 reads 3 then W2 must
read 3. Alternately, if W2 reads 2, then 02 must read 2.

Litmus Test 4 (Sequence Okay)
[Ol] Pi:W(x,2)

[VI] Pj:R(y,2)

[02] Pi:W(y,2)

[V2] Pj:R(x,l)

There are no conflicts in this sequence. 02 <= VI and V2 <= 01. 01 and 02 are not ordered with
respect to each other. VI and V2 are not ordered with respect to each other. There is no
conflicting implication that 01 <= V2 .

Litmus Test 5 (Sequence Okay)
Pi

[Ol] Pi:W(x,2)

[Vl] Pj:R(y,2)
[V2] Pj:MB

[02] Pi:W(y,2)

[V3] Pj:R(x,l)

There are no conflicts in this sequence. 02 <= VI <= V3 <= 01 . There is no conflicting
implication that 01 <= 02 .

5-14 • System Architecture and Programming Implications

Litmus Test 6 (Sequence Okay)
Pi

[VI] Pi:W(x,2)

[VI] Pj:R(y,2)

[V2] Pi:MB

[V3] Pi:W(y,2)

[V2] Pj:R(x,l)

There are no conflicts in this sequence. V2 ¢::: VI ¢::: V3 ¢::: VI. There is no conflicting implication
that VI ¢::: V2.
In scenarios 4, 5, and 6, writes to two different locations x and yare observed (by another
processor) to occur in the opposite order than that in which they were performed. An update to y
propagates quickly to Pj, but the update to x is delayed, and Pi and Pj do not both have MBs.

Litmus Test 7 (Impossible Sequence)
Pi

[VI] Pi:W(x,2)

[VI] Pj:R(y,2)

[V2] Pi:MB

[V2] Pj:MB

[V3] Pi:W(y,2)

[V3] Pj:R(x,l)

VI reading 2 implies V3 ¢::: VI
V3 reading 1 implies V3 ¢::: VI
But, by transitivity, VI ¢::: V3 ¢::: VI ¢::: V3
Both cannot be true, so if VI reads 2, then V3 must also read 2.

Litmus Test 8 (Impossible Sequence)
Pi

[VI] Pi:W(x,2)

[VI] Pj:W(y,2)

[V2] Pi:MB

[V2] Pj:MB

[V3] Pi:R(y,l)

[V3] Pj:R(x,l)

V3 reading 1 implies V3 ¢::: VI
V3 reading 1 implies V3 ¢::: VI
But, by transitivity, VI ¢::: V3 ¢::: VI ¢::: V3

Both cannot be true, so if V3 reads 1, then V3 must read 2, and vice versa.

5-15

Litmus Test 9 (Impossible Sequence)
Pi

[VI] Pi:W(x,2)

[VI] Pj:W(x,3)

[V2] Pi:R(x,2)

[V2] Pj:R(x,3)

[V3] Pi:R(x,3)

[V3] Pj:R(x,2)

V3 reading 2 implies VI ¢= V3
V2 ¢= V3 and V2 reading 3 implies V2 ¢= VI
VI ¢= V2 and V2 ¢= VI implies VI ¢= VI

V3 reading 3 implies VI ¢= V3
V2 ¢= V3 and V2 reading 2 implies V2 ¢= VI
VI ¢= V2 and V2 ¢= VI implies VI ¢= VI
Both VI ¢= VI and VI ¢= VI cannot be true. Time cannot go backwards. If V3 reads 2, then V3
must read 2. Alternatively, If V3 reads 3, then V3 must read 3.

Implied Barriers
In Alpha, there are no implied barriers. If an implied barrier is needed for functionally correct
access to shared data, it must be written as an explicit instruction. (Software must explicitly
include any needed MB or 1MB instructions.)
Alpha transitions such as the following have no built-in implied memory barriers:
• Entry to PALcode
• Sending and receiving interrupts
• Returning from exceptions, interrupts, or machine checks
• Swapping context
• Invalidating the Translation Buffer (TB)
Depending on implementation choices for maintaining cache coherency, some PAL/cache implementations may have an implied 1MB in the I-stream TB fill routine, but this is transparent to the
non-PAL programmer.

Implications for Software
Software must explicitly include MB or 1MB instructions in the following circumstances.

Single-Processor Data Stream
No barriers are ever needed. A read to physical address x will always return the value written by
the immediately preceding write to x in the processor issue sequence.

5-16 • System Architecture and Programming Implications

Single-Processor Instruction Stream
An I-fetch from virtual or physical address x does not necessarily return the value written by the
immediately preceding write to x in the issue sequence. To make the I-fetch reliably get the newly
written instruction, an 1MB is needed between the write and the I-fetch.

Multiple-Processor Data Stream (Including Single Processor with DMA 110)
The only way to communicate shared data reliably is to write the shared data on one processor,
then do an MB on that processor, then write a flag (equivalently, send an interrupt) signaling the
other processor that the shared data is ready. Each receiving processor must read the new flag
(equivalently, receive the interrupt), then do an MB, then read or update the shared data.
Leaving out the first MB removes the assurance that the shared data is written before the flag is.
Leaving out the second MB removes the assurance that the shared data is read or updated only
after the flag is seen to change; in this case, an early read could see an old value, and an early
update could be overwritten.
This implies that after a CPU has prepared some data buffer to be read from memory by a DMA
1/0 device (such as writing a buffer to disk), it must do an MB before starting the 1/0, and the
1/0 device after receiving the start signal must logically do an MB before reading the data buffer.
This also implies that after a DMA 1/0 device has written some data to memory (such as paging in
a page from disk), the DMA device must logically do an MB before posting a completion
interrupt, and the interrupt handler software must do an MB before the data is guaranteed to be
visible to the interrupted processor. Other processors must also do MBs before they are guaranteed to see the new data.
An important special case occurs when a write is done (perhaps by an 1/0 device) to some
physical page frame, then an MB, then a previously invalid PTE is changed to be a valid mapping
of the physical page frame that was just written. In this case, all processors that access using the
newly valid PTE must guarantee to deliver the newly written data after the TB miss, for both
I-stream and D-stream accesses.

Multiple-Processor Instruction Stream (Including Single Processor with DMA 110)
The only way to update the I-stream reliably is to write the shared I-stream on one processor,
then do an 1MB (MB if the writing processor is not going to execute the new I-stream) on that
processor, then write a flag (equivalently, send an interrupt) signaling the other processor that the
shared I-stream is ready. Each receiving processor must read the new flag (equivalently, receive
the interrupt), then do an 1MB, then fetch the shared I-stream.
Leaving out the first IMB(MB) removes the assurance that the shared I-stream is written before
the flag is.
Leaving out the second 1MB removes the assurance that the shared I-stream is read only after the
flag is seen to change; in this case, an early read could see an old value.

5-17

This implies that after a DMA I/O device has written some I-stream to memory (such as paging in
a page from disk), the DMA device must logically do an IMB(MB) before posting a completion
interrupt, and the interrupt handler software must do an 1MB before the I-stream is guaranteed to
be visible to the interrupted processor. Other processors must also do IMBs before they are
guaranteed to see the new I-stream.

An important special case occurs when a write is done (perhaps by an I/O device) to some
physical page frame, then an IMB(MB), then a previously invalid PTE is changed to be a valid
mapping of the physical page frame that was just written. In this case, all processors that access
using the newly valid PTE must guarantee to deliver the newly written I-stream after the TB miss.

Multiple-Processor Context Switch
If a process migrates from executing on one processor to executing on another, the context
switch operating system code must include a number of barriers.

A process migrates by having its context stored into memory, then eventually having that context
reloaded on another processor. In between, some shared mechanism must be used to communicate that the context saved in memory by the first processor is available to the second processor.
This could be done by using an interrupt, by using a flag bit associated with the saved context, or
by using a shared-memory multiprocessor data structure, as follows:

Second Processor

First Processor
Save state of current process.
MB [lJ
Pass ownership of process context data
structure memory.

Pick up ownership of process context data
structure memory.
MB [2J
Restore state of new process context data
structure memory.
Make I-stream coherent [3 J.
Make TB coherent [4].
Execute code for new process that accesses
memory that is not common to all processes.

MB [lJ ensures that the writes done to save the state of the current process happen before the
ownership is passed.
MB [2J ensures that the reads done to load the state of the new process happen after the
ownership is picked up and hence are reliably the values written by the processor saving the old
state. Leaving this MB out makes the code fail if an old value of the context remains in the second
processor's cache and invalidates from the writes done on the first processor are not delivered
soon enough.

5-18 • System Architecture and Programming Implications

The TB on the second processor must be made coherent with any write to the page tables that
may have occurred on the first processor just before the save of the process state. This must be
done with a series of TB invalidate instructions to remove any nonglobal page mapping for this
process, or by assigning an ASN that is unused on the second processor to the process. One of
these actions must occur sometime before starting execution of the code for the new process that
accesses memory (instruction or data) that is not common to all processes. A common method is
to assign a new ASN after gaining ownership of the new process and before loading its context,
which includes its ASN.
The D-cache on the second processor must be made coherent with any write to the D-stream that
may have occurred on the first processor just before the save of process state. This is ensured by
MB [2] and does not require any additional instructions.
The I-cache on the second processor must be made coherent with any write to the I-stream that
may have occurred on the first processor just before the save of process state. This can be done
with an 1MB PAL call sometime before the execution of any code that is not common to all
processes, More commonly, this can be done by forcing a TB miss (via the new ASN or via TB
invalidate instructions) and using the TB-fill rule (see Multiple-Processor Data Stream (Including
Single Processor with DMA I/O) in this chapter). This latter approach does not require any
additional instruction.
Combining all these considerations gives:
First Processor

Pick up ownership of process context
data structure memory.
MB
Assign new ASN or invalidate TBs.
Save state of current process.
Restore state of new process.
MB
Pass ownership of process context data
structure memory.

Second Processor

::::::}

Pickup ownership of new process context data
structure memory.
MB
Assign new ASN or invalidate TBs.
Save state of current process.
Restore state of new process.
MB
Pass ownership of old process context data
structure memory.
Execute code for new process that accesses
memory that is not common to all processes.

Note that on a single processor there is no need for the barriers.

5-19

Multiple-Processor Send/Receive Interrupt
If one processor writes some shared data, then sends an interrupt to a second processor, and that
processor receives the interrupt, then accesses the shared data, the sequence from MultipleProcessor Data Stream (Including Single Processor with DMA I/O) in this chapter must be used:
First Processor

Write data
MB
Send into

Second Processor

::::::>

Receive int.
MB
Access data

Leaving out the MB at the beginning of the interrupt-receipt routine makes the code fail if an old
value of the context remains in the second processor's cache and invalidates from the writes done
on the first processor are not delivered soon enough.

Implications for Hardware
The coherency point for physical address x is the place in the memory subsystem at which
accesses to x are ordered. It may be at a main memory board, or at a cache containing x
exclusively, or at the point of winning a common bus arbitration.
The coherency point for x may move with time, as exclusive access to x migrates between main
memory and various caches.

MB and 1MB force all preceding writes to at least reach their respective coherency points. This
does not mean that main-memory writes have been done, just that the order of the eventual writes
is committed. For example, on the XMI with retry, this means getting the writes acknowledged as
received with good parity at the inputs to memory board queues; the actual RAM write happens
later.
MB and 1MB also force all queued cache invalidates to be delivered to the local caches before
starting any subsequent reads (that may otherwise cache hit on stale data) or writes (that may
otherwise write the cache, only to have the write effectively overwritten by a late-delivered
invalidate).
Implementations may allow reads of x to hit (by physical address) on pending writes in a write
buffer, even before the writes to x reach the coherency point for x. If this is done, it is still true
that no earlier value of x may subsequently be delivered to the processor that took the hit on the
write buffer value.
Virtual data caches are allowed to deliver data before doing address translation, but only if there
cannot be a pending write under a synonym virtual address. Lack of a write-buffer match on
untranslated address bits is sufficient to guarantee this.

5-20 • System Architecture and Programming Implications

Virtual data caches must invalidate or otherwise become coherent with the new value whenever a
PALcode routine is executed that affects the validity, fault behavior, protection behavior, or
virtual-to-physical mapping specified for one or more pages. Becoming coherent can be delayed
until the next subsequent MB instruction or TB fill (using the new mapping), if the implementation of the PALcode routine always forces a subsequent TB fill.

• Arithmetic Traps
Alpha implementations are allowed to execute multiple instructions concurrently and to forward
results from one instruction to another. Thus, when an arithmetic trap is detected, the PC may
have advanced an arbitrarily large number of instructions past the instruction T (calculating result
R) whose execution triggered the trap.
When the trap is detected, any or all of these subsequent instructions may run to completion
before the trap is actually taken. Instruction T and the set of instructions subsequent to T that
complete before the trap is taken are collectively called the trap shadow of T. The PC pushed on
the stack when the trap is taken is the PC of the first instruction past the trap shadow.
The instructions in the trap shadow of T may use the undefined result R of T, they may generate
additional traps, and they may completely change the PC (branches, JSR).
Thus, by the time a trap is taken, the PC pushed on the stack may bear no useful relationship to
the PC of the trigger instruction T, and the state visible to the programmer may have been
updated using the undefined result R. If an instruction in the trap shadow of T uses R to calculate
a subsequent register value, that register value is undefined, even though there may be no trap
associated with the subsequent calculation. Similarly:

• If an instruction in the trap shadow of T stores R or any subsequent undefined result, the stored
value is undefined.

• If an instruction in the trap shadow of T uses R or any subsequent undefined result as the basis of
a conditional or calculated branch, the branch target is undefined.

• If an instruction in the trap shadow of T uses R or any subsequent undefined result as the basis of
an address calculation, the memory address actually accessed is undefined.
Software that is intended to bound how far the PC may advance before taking a trap, or how far
an undefined result may propagate, must insert TRAPB instructions at appropriate points.
Software that is intended to continue from a trap by supplying a well-defined result R within an
arithmetic trap handler, can do so reliably by following the rules for software completion code
sequences given in Floating-Point Trapping Modes in Chapter 4.

Chapter 6 · Common PALcode Architecture

• PALcode
In a family of machines, both users and operating system implementors require functions to be
implemented consistently. When functions conform to a common interface, the code that uses
those functions can be used on several different implementatiOns without modification.
These functions range from the binary encoding of the instruction and data to the exception
mechanisms and synchronization primitives. Some of these functions can be implemented cost
effectively in hardware, but others are impractical to implement directly in hardware. These
functions include low-level hardware support functions such as Translation Buffer miss fill
routines, interrupt acknowledge, and vector dispatch. They also include support for privileged
and atomic operations that require long instruction sequences.
In the VAX, these functions are generally provided by microcode. This is not seen as a problem
because the VAX architecture lends itself to a microcoded implementation.
One of the goals of Alpha is that microcode will not be necessary for practical implementation.
However, it is still desirable to provide an architected interface to these functions that will be
consistent across the entire family of machines. The Privileged Architecture Library (PALcode)
provides a mechanism to implement these functions without resorting to a microcoded machine.

• PALcode Environment
The PALcode environment differs from the normal environment in the following ways:
• Complete control of the machine state.
• Interrupts are disabled.
• Implementation-specific hardware functions are enabled, as described below.
• I-stream memory management traps are prevented (by disabling I-stream mapping, mapping
PALcode with a permanent IB entry, or by other mechanisms).
Complete control of the machine state allows all functions of the machine to be controlled.
Disabling interrupts allows the system to provide multi-instruction sequences as atomic operations. Enabling implementation-specific hardware functions allows access to low-level system
hardware. Preventing I-stream memory management traps allows PALcode to implement memory
management functions such as Translation Buffer fill.

6-2 • Common PALcode Architecture

• Special Functions Required for PALcode
PALcode uses the Alpha instruction set for most of its operations. A small number of additional
functions are needed to implement the PALcode. There are five opcodes reserved to implement
PALcode functions: PALRESO, PALRESl, PALRES2, PALRES3 and PALRES4. These instructions
produce an Illegal Instruction Trap if executed outside the PALcode environment.
• PALcode needs a mechanism to save the current state of the machine and dispatch into PALcode.
• PALcode needs a set of instructions to access hardware control registers.
• PALcode needs a hardware mechanism to transition the machine from the PALcode environment
to the non-PALcode environment. This mechanism loads the PC, enables interrupts, enables
mapping, and disables PALcode privileges.
An Alpha implementation may also choose to provide additional functions to simplify or improve
performance of some PALcode functions. The following are some examples:
• An Alpha implementation may include a read/write virtual function that allows PALcode to
perform mapped memory accesses using the mapping hardware rather than providing the virtual-to-physical translation in PALcode routines. PALcode may provide a special function to do
physical reads and writes and have the Alpha loads and stores continue to operate on virtual
address in the PALcode environment.
• An Alpha implementation may include hardware assists for various functions-for example,
saving the virtual address of a reference on a memory management error rather than having to
generate it by simulating the effective address calculation in PALcode.
• An Alpha implementation may include private registers so it can function without having to save
and restore the native general registers.

• PALcode Effects on System Code
PALc;ode will have one effect on system code. Because PALcode may be resident in main memory
and maintain privileged data structures in main memory, the operating system code that allocates
physical memory cannot use all of physical memory.
The amount of memory PALcode requires is small, so the loss to the system is negligible.

• PALcode Replacement
Alpha systems are required to support the replacement of Digital-supplied PALcode with an
operating system-specific version. The following functions must be implemented in PALcode, not
directly in hardware, to facilitate replacement with different versions.

1. Translation Buffer fill. Different operating systems will want to replace the Translation Buffer
(TB) fill routines. The replacement routines will use different data structures. Therefore, no
portion of the TB fill flow that would change with a change in page tables may be placed in
hardware, unless it is placed in a manner that can be overridden by PALcode.

2. Process structure. Different operating systems might want to replace the process context
switch routines. The replacement routines will use different data structures. Therefore, no
portion of the context switching flows that would change with a change in process structure
may be placed in hardware.

6-3

PALcode must be written in a modular manner that facilitates easy replacement of major
subsections. The subsections that need to be simple to replace are:
• Translation Buffer fill
• Process structure and context switch
• Interrupt and exception frame format and routine dispatch
• Privileged PALcode instructions

• Required PALcode Instructions
The PALcode instructions listed in Table 6-1 and described in the following sections must be
supported by all Alpha implementations:
Table 6-1 · Required PALcode Instructions
Mnemonic

Type

Operation

HALT

Privileged

Halt processor

1MB

Unprivileged

I-stream memory barrier

6-4 • Common PALcode Architecture

Halt
Format:
CALL_PAL HALT

!PALcode format

Operation:
IF PS<CM> NE a THEN
{privileged instruction exception}
CASE {halt_action} OF
halt:
{halt}
restart/halt:
{restart/halt}
restart/boot/halt:
{restart/boot/halt}
boot/halt:
{boot/halt}
ENDCASE
Exceptions:
Privileged Instruction
Instruction mnemonics:

CALL_PAL HALT

Halt Processor

Description:
The HALT instruction stops normal instruction processing, and depending on the HALT action
setting, the processor may either enter console mode or the restart sequence.

6-5

Instruction Memory Barrier
Format:
CALL_PAL 1MB

IPALcode format

Operation:
{Make instruction stream coherent with Data stream}
Exceptions:

None
Instruction mnemonics:

CALL_PAL 1MB I-stream Memory Barrier
Description:
An 1MB instruction must be executed after software or I/O devices write into the instruction
stream or modify the instruction stream virtual address mapping, and before the new value is
fetched as an instruction. An implementation may contain an instruction cache that does not
track either processor or I/O writes into the instruction stream. The instruction cache and
memory are made coherent by an 1MB instruction.

If the instruction stream is modified and an 1MB is not executed before fetching an instruction
from the modified location, it is UNPREDICTABLE whether the old or new value is fetched.
The cache coherency and sharing rules are described in Chapter 5.

Chapter 7 · Console Subsystem Overview

On an Alpha system, underlying control of the system platform hardware is provided by a

console. The console:
1. Initializes, tests, and prepares the system platform hardware for Alpha system software.
2. Bootstraps (loads into memory and starts the execution of) system software.
3. Controls and monitors the state and state transitions of each processor in a multiprocessor
system.
4. Provides services to system software that simplify system software control of and access to
platform hardware.
5. Provides a means for a console operator to monitor and control the system.
The console interacts with system platform hardware to accomplish the first three tasks. The
actual mechanisms of these interactions are specific to the platform hardware; however, the net
effects are common to all systems.
The console interacts with system software once control of the system platform hardware has
been transferred to that software.
The console interacts with the console operator through a virtual display device or console
terminal. The console operator may be a human being or a management application.

Chapter 8 · Alpha VMS

The following sections specify the Privileged Architecture Library (PALcode) instructions, that
are required to support an Alpha VMS system.

• Unprivileged VMS PALco de Instructions
The unprivileged PALcode instructions provide support for system operations to all modes of
operation (Kernel, Executive, Supervisor, and User).
Table 8-1 describes the unprivileged VMS PALcode instructions.
Table 8-1 • Unprivileged VMS PALcode Instruction Summary
Mnemonic

Operation and Description

BPT

Breakpoint
The BPT instruction is provided for program debugging. It switches the processor to Kernel mode and pushes R2 .. R7, the updated PC, and PS on the Kernel
stack. It then dispatches to the address in the Breakpoint vector, stored in a
control block.

BUGCHK

Bugcheck
The BUGCHK instruction is provided for error reporting. It switches the
processor to Kernel mode and pushes R2 .. R7, the updated PC, and PS on the
Kernel stack. It then dispatches to the address in the Bugcheck vector, stored
in a control block.

CHME

Change mode to Executive
The CHME instruction allows a process to change its mode in a controlled
manner.
A change in mode also results in a change of stack pointers: the old pointer is
saved, the new pointer is loaded. Registers R2 .. R7, PS, and PC are pushed onto
the selected stack. The saved PC addresses the instruction following the CHME
instruction.

CHMK

Change mode to Kernel
The CHMK instruction allows a process to change its mode to Kernel in a
controlled manner.
A change in mode also results in a change of stack pointers: the old pointer is
saved, the new pointer is loaded. R2 .. R7, PS, and PC are pushed onto the
Kernel stack. The saved PC addresses the instruction following the CHMK
instruction.

8-2 • Alpha VMS

Table 8-1 . Unprivileged VMS PALcode Instruction Summary
Mnemonic

Operation and Description

CHMS

Change mode to Supervisor

(Continued)

The CHMS instruction allows a process to change its mode in a controlled
manner.
A change in mode also results in a change of stack pointers: the old pointer is
saved, the new pointer is loaded. R2 ..R7, PS, and PC are pushed onto the
selected stack. The saved PC addresses the instruction following the CHMS
instruction.

CHMU

Change mode to User
The CHMU instruction allows a process to call a routine via the change mode
mechanism.

R2 ..R7, PS, and PC are pushed onto the current stack. The saved PC addresses
the instruction following the CHMU instruction.
GENTRAP

Generate trap
The GENTRAP instruction is provided for reporting runtime software conditions. It switches the processor to Kernel mode and pushes registers R2 ..R7, the
updated PC, and the PS on the Kernel stack. It then dispatches to the address
of the GENTRAP vector, stored in a control block.

1MB

I -Stream memory barrier

The 1MB instruction ensures that the contents of an instruction cache are
coherent after the instruction stream has been modified by software or I/O
devices.

If the instruction stream is modified and an 1MB is not executed before
fetching an instruction from the modified location, it is UNPREDICTABLE
whether the old or new value is fetched.

INSQHIL

Insert into longword queue at header, interlocked
The entry specified in R17 is inserted into the self-relative queue following the
header specified in R16. The insertion is a noninterruptible operation. The
insertion is interlocked to prevent concurrent interlocked insertions or removals at the head or tail of the same queue by another process, in a multiprocessor
environment.

8-3

Table 8-1 · Unprivileged VMS PALcode Instruction Summary

(Continued)

Mnemonic

Operation and Description

INSQHILR

Insert into longword queue at header, interlocked resident
The entry specified in R17 is inserted into the self-relative queue following the
header specified in R16. The insertion is a noninterruptible operation. The
insertion is interlocked to prevent concurrent interlocked insertions or removals at the head or tail of the same queue by another process, in a multiprocessor
environment.
This instruction requires that the queue be memory-resident and that the
queue header and elements are quadword-aligned.

INSQHIQ

Insert into quadword queue at header, interlocked
The entry specified in R17 is inserted into the self-relative queue following the
header specified in R16. The insertion is a noninterruptible operation. The
insertion is interlocked to prevent concurrent interlocked insertions or removals at the head or tail of the same queue by another process, in a multiprocessor
environment.

INSQHIQR

Insert into quadword queue at header, interlocked resident
The entry specified in R17 is inserted into the self-relative queue following the
header specified in R16. The insertion is a noninterruptible operation. The
insertion is interlocked to prevent concurrent interlocked insertions or removals at the head or tail of the same queue by another process, in a multiprocessor
environment.
This instruction requires that the queue be memory-resident and that the
queue header and elements are octaword-aligned.

INSQTIL

Insert into longword queue at tail, interlocked
The entry specified in R17 is inserted into the self-relative queue preceding the
header specified in R16. The insertion is a noninterruptible operation. The
insertion is interlocked to prevent concurrent interlocked insertions or removals at the head or tail of the same queue by another process, in a multiprocessor
environment.

INSQTILR

Insert into longword queue at tail, interlocked resident
The entry specified in R17 is inserted into the self-relative queue preceding the
header specified in R16. The insertion is a noninterruptible operation. The
insertion is interlocked to prevent concurrent interlocked insertions or removals at the head or tail of the same queue by another process, in a multiprocessor
environment.
This instruction requires that the queue be memory-resident and that the
queue header and elements are quadword-aligned.

8-4 • Alpha VMS

Table 8-1 · Unprivileged VMS PALcode Instruction Summary

(Continued)

Mnemonic

Operation and Description

INSQTIQ

Insert into quadword queue at tail, interlocked
The entry specified in R17 is inserted into the self-relative queue preceding the
header specified in R16. The insertion is a noninterruptible operation. The
insertion is interlocked to prevent concurrent interlocked insertions or removals at the head or tail of the same queue by another process, in a multiprocessor
environment.

INSQTIQR

Insert into quadword queue at tail, interlocked resident
The entry specified in R17 is inserted into the self-relative queue preceding the
header specified in R16. The insertion is a noninterruptible operation. The
insertion is interlocked to prevent concurrent interlocked insertions or removals at the head or tail of the same queue by another process, in a multiprocessor
environment.
This instruction requires that the queue be memory-resident and that the
queue header and elements are octaword-aligned.

INSQUEL

Insert into longword queue
The entry specified in R17 is inserted into the absolute queue following the
entry specified by the predecessor addressed by R16 for INSQUEL, or following
the entry specified by the contents of the longword addressed by R16 for
INSQUEL/D. The insertion is a noninterruptible operation.

INSQUEQ

Insert into quadword queue
The entry specified in R17 is inserted into the absolute queue following the
entry specified by the predecessor addressed by R16 for INSQUEQ, or following the entry specified by the contents of the quadword addressed by R16 for
INSQUEQ/D. The insertion is a noninterruptible operation.

PROBE

Probe read/write access

PROBE checks the read (PROBER) or write (PROBEW) accessibility of the first
and last byte specified by the base address and the signed offset; the bytes in
between are not checked. System software must check all pages between the
two bytes if they are to be accessed.
PROBE is only intended to check a single datum for accessibility.
Read processor status

RD_PS writes the Processor Status (PS) to register RO.
READ_UNQ

Read unique context

READ_UNQ reads the hardware process (thread) unique context value, if
previously written by WRITE_UNQ, and places that value in RO.

8-5

Table 8-1 · Unprivileged VMS PALcode Instruction Summary
Mnemonic

Operation and Description

REI

Return from exception or interrupt

(Continued)

The PS, PC, and saved R2 ..R7 are popped from the current stack and held in
temporary registers. The new PS is checked for validity and consistency. If it is
valid and consistent, the current stack pointer is then saved and a new stack
pointer is selected. Registers R2 through R7 are restored by using the saved
values held in the temporary registers. A check is made to determine if an AST
or interrupt is pending.

If the enabling conditions are present for an interrupt or AST at the completion
of this instruction, the interrupt or AST occurs before the next instruction.
REMQHIL

Remove from longword queue at header, interlocked
The self-relative queue entry following the header, pointed to by R16, is removed
from the queue, and the address of the removed entry is returned in RI. The
removal is interlocked to prevent concurrent interlocked insertions or removals
at the head or tail of the same queue by another process, in a multiprocessor
environment. The removal is a noninterruptible operation.

REMQHILR

Remove from longword queue at header, interlocked resident
The queue entry following the header, pointed to by R16, is removed from the
self-relative queue, and the address of the removed entry is returned in Rl. The
removal is interlocked to prevent concurrent interlocked insertions or removals
at the head or tail of the same queue by another process, in a multiprocessor
environment. The removal is a noninterruptible operation.
This instruction requires that the queue be memory-resident and that the
queue header and elements are quadword-aligned.

REMQHIQ

Remove from quadword queue at header, interlocked
The self-relative queue entry following the header, pointed to by R16, is removed
from the queue and the address of the removed entry is returned in RI. The
removal is interlocked to prevent concurrent interlocked insertions or removals
at the head or tail of the same queue by another process, in a multiprocessor
environment. The removal is a noninterruptible operation.

REMQHIQR

Remove from quadword queue at header, interlocked resident
The queue entry following the header, pointed to by R16, is removed from the
self-relative queue and the address of the removed entry is returned in Rl. The
removal is interlocked to prevent concurrent interlocked insertions or removals
at the head or tail of the same queue by another process, in a multiprocessor
environment. The removal is a noninterruptible operation.
This instruction requires that the queue be memory-resident and that the
queue header and elements are octaword-aligned.

8-6 • Alpha VMS

Table 8-1 · Unprivileged VMS PALcode Instruction Summary

(Continued)

Mnemonic

Operation and Description

REMQTIL

Remove from longword queue at tail, interlocked
The queue entry preceding the header, pointed to by R16, is removed from the
self-relative queue and the address of the removed entry is returned in Rl. The
removal is interlocked to prevent concurrent interlocked insertions or removals
at the head or tail of the same queue by another process, in a multiprocessor
environment. The removal is a noninterruptible operation.

REMQTILR

Remove from longword queue at tail, interlocked resident
The queue entry preceding the header, pointed to by R16, is removed from the
self-relative queue and the address of the removed entry is returned in Rl. The
removal is interlocked to prevent concurrent interlocked insertions or removals
at the head or tail of the same queue by another process, in a multiprocessor
environment. The removal is a noninterruptible operation.
This instruction requires that the queue be memory-resident and that the
queue header and elements are quadword-aligned.

REMQTIQ

Remove from quadword queue at tail, interlocked
The self-relative queue entry preceding the header, pointed to by R16, is removed from the queue and the address of the removed entry is returned in Rl.
The removal is interlocked to prevent concurrent interlocked insertions or removals at the head or tail of the same queue by another process, in a multiprocessor environment. The removal is a noninterruptible operation.

REMQTIQR

Remove from quadword queue at tail, interlocked resident
The queue entry preceding the header, pointed to by R16, is removed from the
self-relative queue and the address of the removed entry is returned in Rl. The
removal is interlocked to prevent concurrent interlocked insertions or removals
at the head or tail of the same queue by another process, in a multiprocessor
environment. The removal is a noninterruptible operation.
This instruction requires that the queue be memory-resident and that the
queue header and elements are octaword-aligned.

REMQUEL

Remove from longword queue
The queue entry addressed by RI6 for REMQUEL or the entry addressed by the
longword addressed by RI6 for REMQUEL/D is removed from the longword
absolute queue, and the address of the removed entry is returned in RI. The
removal is a noninterruptible operation.

8-7

Table 8·1 · Unprivileged VMS PALcode Instruction Summary
Mnemonic

Operation and Description

REMQUEQ

Remove from quadword queue

(Continued)

The queue entry addressed by R16 for REMQUEQ or the entry addressed by
the quadword addressed by R16 for REMQUEL/D is removed from the
quadword absolute queue, and the address of the removed entry removed is
returned in Rl. The removal is a noninterruptible operation.
RSCC

Read system cycle counter
Register RO is written with the value of the system cycle counter. This counter
is an unsigned 64-bit integer that increments at the same rate as the process
cycle counter.
The system cycle counter is suitable for timing a general range of intervals to
within 10% error and may be used for detailed performance characterization.

SWASTEN

Swap AST enable
SWASTEN swaps the AST enable bit for the current mode. The new state for
the enable bit is supplied in register R16<O> and previous state of the enable
bit is returned, zero-extended, in RO.

A check is made to determine if an AST is pending. If the enabling conditions
are present for an AST at the completion of this instruction, the AST occurs
before the next instruction.
Write unique context
WRITE_UNQ writes the hardware process (thread) unique context value
passed in R16 to internal storage or to the hardware privileged context block.
Write processor status software field
WR_PS_SW writes the Processor Status software field (PS<SW» with the
low-order three bits of R16<2:0>.

8-8 • Alpha VMS

• Privileged VMS Palcode Instructions
The privileged PALcode instructions can be called in Kernel mode only.
Table 8-2 describes the privileged VMS PALcode instructions.
Table 8-2 · Privileged VMS PALcode Instructions Summary
Mnemonic

Operation and Description

CFLUSH

Cache flush
At least the entire physical page specified by a page frame number in R16 is
flushed from any data caches associated with the current processor. After doing
a CFLUSH, the first subsequent load on the same processor to an arbitrary
address in the target page is fetched from physical memory.

DRAINA

Drain aborts
DRAINA stalls instruction issuing until all prior instructions are guaranteed to
complete without incurring aborts.

HALT

Halt processor
The HALT instruction stops normal instruction processing.

LDQP

Load quadword physical
The quadword-aligned memory operand, whose physical address is in R16, is
fetched and written to RO.

lf the operand address in R16 is not quadword-aligned, the result is
UNPREDICTABLE.
MFPR

Move from processor register
The internal processor register specified by the PALcode function field is
written to RO.

MTPR

Move to processor register
The source operands in integer registers R16 (and R17, reserved for future use)
are written to the internal processor register specified by the PALcode function
field. The effect of loading a processor register is guaranteed to be active on the
next instruction.

STQP

Store quadword physical
The quadword contents of R17 are written to the memory location whose
physical address is in R16.

lf the operand address in R16 is not quadword-aligned, the result is
UNPREDICTABLE.
SWPCTX

Swap privileged context
The SWPCTX instruction returns ownership of the data structure that contains
the current hardware privileged context (the HWPCB) to the operating system
and passes ownership of the new HWPCB to the processor.

Chapter 9 · Alpha aSF/1

The following sections specifiy the Privileged Architecture Library (PALcode) instructions that
are required to support an Alpha OSF/1 system.

• Unprivileged OSF/! PALcode Instructions
Table 9-1 describes the unprivileged OSF/1 PALcode instructions.
Table 9-1 . Unprivileged aSF/1 PALcode Instruction Summary
Mnemonic

Operation and Description

bpt

Break Point Trap
The bpt instruction switches mode to Kernel, builds a stack frame on the
Kernel stack, and dispatches to the breakpoint code.

bugchk

Bugcheck
The bugchk instruction switches mode to Kernel, builds a stack frame on the
Kernel stack, and dispatches to the breakpoint code.

callsys

System Call
The callsys instruction switches mode to Kernel, builds a callsys stack frame,
and dispatches to the system call code.

gentrap

Generate Trap
The gentrap instruction switches mode to Kernel, builds a stack frame on the
Kernel stack, and dispatches to the gentrap code.

imb

r-Stream Memory Barrier
The imb instruction makes the I-cache coherent with main memory.

rdunique

Read Unique
The rdunique instruction returns the process unique value.

wrunique

Write Unique
The wrunique instruction sets the process unique register.

9-2 • Alpha aSP/l

• Privileged OSF/1 PALcode Instructions
The privileged PALcode instructions can be called only from Kernel mode. They provide an
interface to control the privileged state of the machine.
Table 9-2 describes the privileged aSF/! PALcode instructions.
Table 9-2 • Privileged OSFIl PALcode Instruction Summary
Mnemonic

Operation and Description

halt

Halt Processor
The halt instruction stops normal instruction processing. Depending on the
halt action setting, the processor can either enter console mode or the restart
sequence.

rdps

Read Processor Status
The rdps instruction returns the current PS.

rdusp

Read User Stack Pointer
The rdusp instruction reads the User stack pointer while in Kernel mode and
returns it.

rdval

Read System Value
The rdval instruction reads a 64-bit per-processor value and returns it.

retsys

Return from System Call
The retsys instruction pops the return address, the User stack pointer, and the
User global pointer from the Kernel stack. It then saves the Kernel stack
pointer, sets mode to User, enables interrupts, and jumps to the address
popped off the stack.

rti

Return from Trap, Fault or Interrupt
The rti instruction pops certain registers from the Kernel stack. If the new
mode is User, the Kernel stack is saved and the User stack restored.

swpctx

Swap Privileged Context
The swpctx instruction saves the current process data in the current process
control block (PCB). Then swpctx switches to the PCB and loads the new
process context.

swpipl

Swap 1PL
The swpipl instruction returns the current value 1PL and sets the 1PL.

tbi

TB invalidate
The tbi instruction removes entries from the instruction and data translation
buffers when the mapping entries change.

9-3

Table 9-2 · Privileged OSF/! PALcode Instruction Summary
Mnemonic

(Continued)

Operation and Description

whami
The whami instruction returns the processor number for the current processor.
The processor number is in the range 0 to the number of processors minus one
(O..numproc-l) that can be configured in the system.
wrent

Write System Entry Address
The wrent instruction sets the virtual address of the system entry points.

wrfen

Write Floating-Point Enable
The wrfen instruction writes a bit to the floating-point enable register.

wrkgp

Write Kernel Global Pointer
The wrkgp instruction writes the Kernel global pointer internal register.

wrusp

Write User Stack Pointer
The wrusp instruction writes a value to the User stack pointer while in Kernel
mode.

wrval

Write System Value
The wrval instruction writes a 64-bit per-processor value.

wrvptptr

Write Virtual Page Table Pointer
The wrvptptr instruction writes a pointer to the virtual page table pointer
(vptptr).

Appendix A · Software Considerations

• Hardware-Software Compact
The Alpha architecture, like all RISC architectures, depends on careful attention to data alignment
and instruction scheduling to achieve high performance.
Since there will be various implementations of the Alpha architecture, it is not obvious how
compilers can generate high-performance code for all implementations. This chapter gives some
scheduling guidelines that, if followed by all compilers and respected by all implementations, will
result in good performance. As such, this section represents a good-faith compact between
hardware designers and software writers. It represents a set of common goals, not a set of
architectural requirements. Thus, an Appendix, not a Chapter.
Many of the performance optimizations discussed below are advantageous only for frequently
executed code. For rarely executed code, they may produce a bigger program that is not any
faster. Some of the branching optimizations also depend on good prediction of which path from a
conditional branch is more frequently executed. These optimizations are best done by using an
execution profile, either an estimate generated by compiler heuristics, or a real profile of a
previous run, such as that gathered by PC-sampling in PCA.
Each computer architecture has a "natural word size." For the PDP-ll, it is 16 bits; for VAX,
32 bits; and for Alpha, 64 bits. Other architectures also have· a natural word size that varies
between 16 and 64 bits. Except for very low-end implementations, ALU data paths, cache access
paths, chip pin buses, and main memory data paths are all usually the natural word size.
As an architecture becomes commercially successful, high-end implementations inevitably move
to double-width data paths that can transfer an aligned (at an even natural word address) pair of
natural words in one cycle. For Alpha, this means eventual 128-bit wide data paths. It is hard to
get much speed advantage from paired transfers unless the code being executed has instructions
and data appropriately aligned on aligned octaword boundaries. Since this is hard to retrofit to
old code, the following sections sometimes encourage "over-aligning" to octaword boundaries in
anticipation of high-speed Alpha implementations.
In some cases, there are performance advantages in aligning instructions or data to cache-block
boundaries, or putting data whose use is correlated into the same cache block, or trying to avoid
cache conflicts by not having data whose use is correlated placed at addresses that are equal
modulo the cache size. Since the Alpha architecture will have many implementations, an exact
cache design cannot be outlined here. Nonetheless, some expected bounds can be stated.

1. Small (first-level) cache sizes will likely be in the range 2 KB to 64 KB
2. Small cache block sizes will likely be 16, 32, 64, or 128 bytes
3. Large (second- or third-level) cache sizes will likely be in the range 128 KB to 8 MB
4. Large cache block sizes will likely be 32, 64, 128, or 256 bytes
5. TB sizes will likely be in the range 16 to 1024 entries

A-2 • Software Considerations
Thus, if two data items need to go in different cache blocks, it is desirable to make them at least
128 bytes apart (modulo 2 KB). Doing that creates a high probability of allowing both items to be
in a small cache simultaneously, for all Alpha implementations.
In each case below, the performance implication is given by an order-of-magnitude number: 1,3,
10, 30, or 100. A factor of 10 means that the performance difference being discussed will likely
range from 3 to 30 across all Alpha implementations.

• Instruction-Stream Considerations
The following sections describe considerations for the instruction stream.

Instruction Alignment
Code PSECTs should be octaword-aligned. Targets of frequently taken branches should be at
least quadword-aligned, and octaword-aligned for very frequent loops. Compilers could use
execution profiles to identify frequently taken branches.
Most Alpha implementations will fetch aligned quadwords of instruction stream (two instructions), and many will waste an instruction-issue cycle on a branch to an odd longword. High-end
implementations may eventually fetch aligned octawords, and waste up to 3 issue cycles on a
branch to an odd longword. Some implementations may only be able to fetch wide chunks of
instructions every other CPU cycle. Fetching four instructions from an aligned octaword can get
at most one cache miss, while fetching them from an odd longword address can get 2 or even
3 cache misses.
Quadword I-fetch implementors should give first priority to executing aligned quadwords
quickly. Octaword-fetch implementors should give first priority to executing aligned octawords
quickly, and second priority to executing aligned quadwords quickly. Dual-issue implementations
should give first priority to issuing both halves of an aligned quadword in one cycle, and second
priority to buffering and issuing other combinations.

Multiple Instruction Issue-Factor of 3
Some Alpha implementations will issue multiple instructions in a single cycle. To improve the
odds of multiple-issue, compilers should choose pairs of instructions to put in aligned quadwords.
Pick one from column A and one from column B (but only a total of one load/store/branch per
pair).

Column A

Column B

Integer Operate

Floating Operate

Floating Load/Store

Integer Load/Store

Floating Branch

Integer Branch
BRlBSRlJSR

A-3
Implementors of multiple-issue machines should give first priority to dual-issuing at least the
above pairs, and second priority to multiple-issue of other combinations.
In general, the above rules will give a good hardware-software match, but compilers may want to
implement model-specific switches to generate code tuned more exactly to a specific
implementation.

Branch Prediction and Minimizing Branch-Taken-Factor of 3
In many Alpha implementations, an unexpected change in I-stream address will result in about 10
lost instruction times. "Unexpected" may mean any branch-taken or may mean a mispredicted
branch. In many implementations, even a correctly predicted branch to a quadword target
address will be slower than straight-line code.
Compilers should follow these rules to minimize unexpected branches:
1. Implementations will predict all forward conditional branches as not-taken, and all backward
conditional branches as taken. Based on execution profiles, compilers should physically rearrange code so that it has matching behavior.

2. Make basic blocks as big as possible. A good goal is 20 instructions on average between
branch-taken. This means unrolling loops so that they contain at least 20 instructions, and
putting subroutines of less than 20 instructions directly in line. It also means using execution
profiles to rearrange code so that the frequent case of a conditional branch falls through. For
very high-performance loops, it will be profitable to move instructions across conditional
branches to fill otherwise wasted instruction issue slots, even if the instructions moved will not
always do useful work. Note that the Conditional Move instructions can sometimes be used to
avoid breaking up basic blocks.
3. In an if-then-else construct whose execution profile is skewed even slightly away from 50%50% (51-49 is enough), put the infrequent case completely out of line, so that the frequent
case encounters zero branch-takens, and the infrequent case encounters two branch-takens. If
the infrequent case is rare (5%), put it far enough away that it never comes into the I-cache. If
the infrequent case is extremely rare (error message code), put it on a page of rarely executed
code and expect that page never to be paged in.
4. There are two functionally identical branch-format opcodes, BSR and BR.
31

2625

2120

BSR

Displacement

Branch Format

Displacement

Branch Format

Compilers should use the first one for subroutine calls, and the second for GaTOs. Some
implementations may push a stack of predicted return addresses for BSR and not push the
stack for BR. Failure to compile the correct opcode will result in mispredicted return
addresses, and hence make subroutine returns slow.

A-4 • Software Considerations

5. The memory-format JSR instruction has 16 unused bits. These should be used by the compilers
to communicate a hint about expected branch-target behavior (see Chapter 4):
31

1615

Memory Format

If the JSR is used for a computed GOIO or a CASE statement, compile bits <15:14> as 00, and
bits <13:0> such that (updated PC+Instr<13:0>"'4) <15:0> equals (likely_targecaddr) <15:0>.
In other words, pick the low 14 bits so that a normal PC+displacemenC':4 calculation will
match the low 16 bits of the most likely target longword address. (Implementations will likely
prefetch from the matching cache block.)
If the JSR is used for a computed subroutine call, compile bits <15:14> as 01, and bits <13:0>
as above. Some implementations will prefetch the call target using the prediction and also push
updated PC on a return-prediction stack.

If the JSR is used as a subroutine return, compile bits <15:14> as 10. Some implementations
will pop an address off a return-prediction stack.

If the JSR is used as a coroutine linkage, compile bits <15:14> as 11. Some implementations
will pop an address off a return-prediction stack and also push updated PC on the
return-prediction stack.
Implementors should give first priority to executing straight-line code with no branch-takens as
quickly as possible, second priority to predicting conditional branches based on the sign of the
displacement field (backward taken, forward not-taken), and third priority to predicting subroutine return addresses by running a small prediction stack. (VAX traces show a stack of 2 to 4
entries correctly predicts most branches.)

Improving I-Stream Density-Factor of 3
Compilers should try to use profiles to make sure almost 100 percent of the bytes brought into an
I-cache are actually executed. This means aligning branch targets and putting rarely executed
code out of line. Doing so would consistently make an I-cache appear about two times larger,
compared to current VAX practice.
The example below shows the bytes actually brought into a VAX cache (from part of an address
trace of a DLINPAC). The dots represent bytes brought into the cache but never executed. They
occupy about half of the cache.
Each line shows the use of an aligned 64-byte I-cache block. A portion of DLINPAC and a portion
of VMS 4.x are shown. Uppercase I is the first byte of an instruction, and lowercase i marks
subsequent bytes. Period (.) shows a byte brought into the cache but never executed.

A-5
r-fetch

Byte 63

Byte 0

000268CO
00026900
00026940
00026980
000269CO
00026AOO
00026A40
00026A80
00026ACO

riiiIiiriiriiiiiiiiiriii

.
riiiiriiiiiiiiii
ririiririiririririiiririiririiiiiiiriiriii
.
riiiriiriiriiriiririiIii
r
riiiiriiriiiiririiiiriiirriririiririiiririii
.
IiriiiiiiiiiiiiiIiiIiiiriii
.
riiiiiiiiiriiiiiiiiririiiriirii
ririiiiriririiiririririiiiiiiiriiriiiriii
riirii
riiriii
.

80004440
80004680
80004900
80004940
80004AOO
80004A40
80004A80
80004F40
80004F80
80004FCO
80008A40
80008A80

.
.
riiriiriiriiiiririiriiriiriiiririiiiririiiriiiir
riiiiriiiIiiririii
riiiiriii
.
riiiiiriiriiiii
ririiriiiiriiiriiiriiiriii
riiiiirriiiiiriiiiriiriiir
riiiiriiiriiriiriii
riiriiriii
.
Iiiiiiriiiiiiriiiriiiiiiriii
.
riiiriiiiiiiriiririiiriiiiiiiiiiiiiiriiir
rriiiiiriiiririiiriii
riiiiririii
.
riiiriii
rriiriiiriiririiiriririiiririiriiiiiriiriiriiriiiiiiiririiiriii.

riiiririii

riiiiiriii

Instruction Scheduling-Factor of 3
The performance of Alpha programs will be sensitive to how carefully the code is scheduled to
minimize instruction-issue delays.
"Result latency" is defined as the number of CPU cycles that must elapse between an instruction
that writes a result register and one that uses that register, if execution-time stalls are to be
avoided. Thus, a latency of zero means that the instruction writes a result register and the
instruction that uses that register can be multiple-issued in the same cycle. A latency of 2 means
that if the writing instruction is issued at cycle N, the reading instruction can issue no earlier than
cycle N+2. Latency is implementation-specific.
Most Alpha instructions have a non-zero result latency. Compilers should schedule code so that a
result is not used too soon, at least in frequently executed code (inner loops, as identified by
execution profiles). In general, this will require loop unrolling and short procedure inlining.
"Too soon" is currently ill-defined, since no implementations have been designed yet. For
starters, assume that implementations can dual-issue instructions. Assume that Load and JSR
instructions have a latency of 3, shifts and byte manipulation a latency of 2, integer multiply a
latency of 10, and other integer operates a latency of 1. Assume floating multiply has a latency of
5, floating divide a latency of 10, and other floating operates a latency of 4. Scheduling to these
latencies will give at least reasonable performance on currently anticipated implementations.
Compilers should try to schedule code to match the above latency rules and also to match the
multiple-issue rules. If doing both is impractical for a particular sequence of code, the latency
rules are more important (since they apply even in single-issue implementations).

A-6 • Software Considerations

Implementors should give first priority to minimizing the latency of back-to-back integer operations, of address calculations immediately followed by load/store, of load immediately followed
by branch, and of compare immediately followed by branch. Second priority should be given to
minimizing latencies in general.

• Data-Stream Considerations
The following sections describe considerations for the data stream.

Data Alignment-Factor of 10
Data PSECTs should be at least octaword-aligned, so that aggregates (arrays, some records,
subroutine stack frames) can be allocated on aligned octaword boundaries to take advantage of
any implementations with aligned octaword data paths, and to decrease the number of cache fills
in almost all implementations.
Aggregates (arrays, records, common blocks, and so forth) should be allocated on at least aligned
octaword boundaries whenever language rules allow this. In some implementations, a series of
writes that completely fill a cache block may be a factor of 10 faster than a series of writes that
partially fill a cache block, when that cache block would give a read miss. This is true of
writeback caches that read a partially filled cache block from memory, but optimize away the read
for completely filled blocks.
For such implementations, long strings of sequential writes will be faster if they start on a
cache-block boundary (a multiple of 128 bytes will do well for most, if not all, Alpha implementations). This applies to array results that sweep through large portions of memory, and also to
register-save areas for context switching, graphics frame buffer accesses, and other places where
exactly 8, 16, 32, or more quadwords are stored sequentially. Allocating the targets at multiples of
8, 16, 32, or more quadwords, respectively, and doing the writes in order of increasing address
will maximize the write speed.
Items within aggregates that are forced to be unaligned (records, common blocks) should
generate compile-time warning messages and inline byte extract/insert code. Users must be
educated that the warning message means that they are taking a factor of 30 performance hit.
Compilers should consider supplying a switch that allows the compiler to pad aggregates to avoid
unaligned data.
Compiled code for parameters should assume that the parameters are aligned. Unaligned actuals
will therefore cause runtime alignment traps and very slow fixups. The fixup routine, if invoked,
should generate warning messages to the user, preferably giving the first few statement numbers
that are doing unaligned parameter access, and at the end of a run the total number of alignment
traps (and perhaps an estimate of the performance improvement if the data were aligned). Again,
users must be educated that the trap routine warning message means they are taking a factor of
30 performance hit.

A-7

Frequently used scalars should reside in registers. Each scalar datum allocated in memory should
normally be allocated an aligned quadword to itself, even if the datum is only a byte wide. This
allows aligned quadword loads and stores and avoids partial-quadword writes (which may be half
as fast as full-quadword writes, due to such factors "as read-modify-write a quadword to do
quadword ECC calculation).
Implementors should give first priority to fast reads of aligned octawords and second priority to
fast writes of full cache blocks. Partial-quadword writes need not have a fast repetition rate.

Shared Data in Multiple Processors-Factor of 3
Software locks are aligned quadwords and should be allocated to large cache blocks that either
contain no other data, or read-mostly data whose usage is correlated with the lock.
Whenever there is high contention for a lock, one processor will have the lock and be using the
guarded data, while other processors will be in a read-only spin loop on the lock bit. Under these
circumstances, any write to the cache block containing the lock will likely cause excess bus traffic
and cache fills, thus having a performance impact on all processors that are involved, and the
buses between them. In some decomposed FORTRAN programs, refills of the cache blocks
containing one or two frequently used locks can account for a third of all the bus bandwidth the
program consumes.
Whenever there is almost no contention for a lock, one processor will have the lock and be using
the guarded data. Under these circumstances, it might be desirable to keep the guarded data in
the same cache block as the lock.
For the high sharing case, compilers should assume that almost all accesses to shared data result
in cache misses all the way back to main memory, for each. distinct cache block used. Such
accesses will likely be a factor of 30 slower than cache hits. It is helpful to pack correlated shared
data into a small number of cache blocks. It is helpful also to segregate blocks written by one
processor from blocks read by others.
Therefore, accesses to shared data, including locks, should be minimized. For example, a
4-processor decomposition of some manipulation of a 1000-row array should avoid accessing lock
variables every row, but instead might access a lock variable every 250 rows.
Array manipulation should be partitioned across processors so that cache blocks do not thrash
between processors. Having each of 4 processors work on every fourth array element severely
impairs performance on any implementation with a cache block of 4 elements or larger. The
processors all contend for copies of the same cache blocks and use only 1/4 of the data in each
block. Writes in one processor severely impair cache performance on all processors.
A better decomposition is to give each processor the largest possible contiguous chunk of data to
work on (N/4 consecutive rows for 4 processors and row-major array storage; N/4 columns for
column-major storage). With the possible exception of 3 cache blocks at the partition boundaries,
this decomposition will result in each processor caching data that is touched by no other
processor.

A-8 • Software Considerations

Operating-system scheduling algorithms should attempt to minimize process migration from one
processor to another. Any time migration occurs, there are likely to be a large number of cache
misses on the new processor.
Similarly, operating-system scheduling algorithms should attempt to enforce some affinity
between a given device's interrupts and the processor on which the interrupt-handler runs. I/O
control data structures and locks for different devices should be disjoint. Doing both of these
allows higher cache hit rates on the corresponding I/O control data structures.
Implementors should give first priority to an efficient {low-bandwidth} way of transferring
isolated lock values and other isolated, shared write data between processors.
Implementors should assume that the amount of shared data will continue to increase, so over
time the need for efficient sharing implementations will also increase.

Avoiding Cache/TB Conflicts-Factor of 1
Occasionally, programs that run with a direct-mapped cache or TB will thrash, taking excessive
cache or TB misses. With some work, thrashing can be minimized at compile time.
In a frequently executed loop, compilers could allocate the data items accessed from memory so
that, on each loop iteration, all of the memory addresses accessed are either in exactly the same
aligned 64-byte block, or differ in bits VA<1O:6>; For loops that go through arrays in a common
direction with a common stride, this means allocating the arrays, checking that the first-iteration
addresses differ, and if not, inserting up to 64 bytes of padding between the arrays. This rule will
avoid thrashing in small direct-mapped data caches with block sizes up to 64 bytes and total sizes
of 2K bytes or more.
Example:
REAL*4 A(lOOO) ,B(lOOO)
DO 60 i=l,lOOO
60 A(i) = f(B(i))

BAD allocation (A and B thrash in 8 KB direct-mapped cache):

12K

16K

BETTER allocation (A and B offset by 64 mod 2 KB, so 16 elements of A and 16 of B can be in
cache simultaneously):

8K+64

12K

16K

A-9

BEST allocation (A and B offset by 64 mod 2 KB, so 16 elements of A and 16 of B can be in cache
simultaneously, and both arrays fit entirely in 8 KB or bigger cache):

4K-64

12K

16K

In a frequently executed loop, compilers could allocate the data items accessed from memory so
that, on each loop iteration, all of the memory addresses accessed are either in exactly the same
8 KB page, or differ in bits VA<17:13>. For loops that go through arrays in a common direction
with a common stride, this means allocating the arrays, checking that the first-iteration addresses
differ, and if not, inserting up to 8K bytes of padding between the arrays. This rule will avoid
thrashing in direct-mapped TBs and in some large direct-mapped data caches, with total sizes of
32 pages (256 KB) or more.
Usually, this padding will mean zero extra bytes in the executable image, just a skip in virtual
address space to the next-higher page boundary.
For large caches, the rule above should be applied to the I-stream, in addition to all the D-stream
references. Some implementations will have combined I-stream/D-stream large caches.
Both of the rules above can be satisfied simultaneously, thus often eliminating thrashing in all
anticipated direct-mapped cache/TB implementations.

Sequential ReadIWrite-Factor of 1
All other things being equal, sequences of consecutive reads or writes should use ascending
(rather than descending) memory addresses. Where possible, the memory address for a block of
2'b'<Kbytes should be on a 2,'d<K boundary, since this minimizes the number of different cache
blocks used and minimizes the number of partially written cache blocks.
To avoid overrunning memory bandwidth, sequences of more than eight quadword Loads or
Stores should be broken up with intervening instructions (if there is any useful work to be done).
For consecutive reads, implementors should give first priority to prefetching ascending cache
blocks, and second priority to absorbing up to eight consecutive quadword Loads (aligned on a
64-byte boundary) without stalling.
For consecutive writes, implementors should give first priority to avoiding read overhead for fully
written aligned cache blocks, and second priority to absorbing up to eight consecutive quadword
Stores (aligned on a 64-byte boundary) without stalling.

A-lO • Software Considerations

Prefetching-Factor of 3
To use FETCH and FETCH_M effectively, software should follow this programming model:
1. Assume that at most two FETCH instructions can be outstanding at once, and that there are
two prefetch address registers, PREa and PREb, to hold prefetching state. FETCH instructions
alternate between loading PREa and PREb. Each FETCH instruction overwrites any previous
prefetching state, thus terminating any previous prefetch that is still in progress in the register
that is loaded. The order of fetching within a block and the order between PREa and PREb are
UNPREDICTABLE.
Implementation Note
Implementations are encouraged to alternate at convenient intervals
between PREa and PREb.

2. Assume, for maximum efficiency, that there should be about 64 unrelated memory access
instructions (load or store) between a FETCH and the first actual data access to the prefetched
data.
3. Assume, for instruction-scheduling purposes in a multilevel cache hierarchy, that FETCH does
not prefetch data to the innermost cache level, but rather one level out. Schedule loads to bury
the last level of misses.
4. Assume that FETCH is worthwhile if, on average, at least half the data in a block will be
accessed. Assume that FETCH_M is worthwhile if, on average, at least half the data in a block
will be modified.
5. Treat FETCH as a vector load. If a piece of code could usefully prefetch 4 operands, launch the
first two prefetches, do about 128 memory references worth of work, then launch the next two
prefetches, do about 128 more memory references worth of work, then start using the 4 sets of
prefetched data.
6. Treat FETCH as having the same effect on a cache as a series of 64 quadword loads. If the
loads would displace useful data, so will FETCH. If two sets of loads from specific addresses
will thrash in a direct-mapped cache, so will two FETCH instructions using the same pair of
addresses.
Implementation Note
Hardware implementations are expected to provide either no support for
FETCHx or support that closely matches this model.

A-ll

• Code Sequences
The following section describes code sequences.

Aligned ByteIWord
Byte/Word Memory Accesses
The instruction sequences given in Chapter 4 for byte and word accesses are worst-case code. In
the common case of accessing a byte or aligned word field at a known offset from a pointer that is
expected to be at least longword aligned, the common-case code is much shorter.
"Expected" means that the code should run fast for a longword-aligned pointer and trap for
unaligned. The trap handler may at its option fix up the unaligned reference.
For access at a known offset D from a longword-aligned pointer Rx, let D.lw be D rounded down
to a multiple of 4 ((D div 4)>"4), and let D.mod be D mod 4 .
In the common case, the intended sequence for loading and zero-extending an aligned word is:
LDL
EXTWL

Rl, D. 1w (Rx)
Rl,#D.mod,Rl

! Traps if unaligned
! Picks up word at byte 0 or byte 2

In the common case, the intended sequence for loading and sign-extending an aligned word is:
LDL
SLL
SRA

Rl,D.lw(Rx)
Rl,#48-8*D.mod,Rl
Rl,#48,Rl

Traps if unaligned
Aligns word at high end of Rl
SEXT to low end of Rl

Note
The shifts often can be combined with shifts that might surround subsequent arithmetic operations (for example, to produce word overflow
from the high end of a register).

In the common case, the intended sequence for loading and zero-extending a byte is:
LDL
EXTBL

Rl, D. 1w (Rx)
Rl,#D.mod,Rl

In the common case, the intended sequence for loading and sign-extending a byte is:
LDL
SLL
SRA

Rl,D.lw(Rx)
Rl,#56-8*D.mod,Rl!
Rl,#56,Rl

In the common case, the intended sequence for storing an aligned word R5 is:
LDL
INSWL
MSKWL
BIS
STL

Rl , D. 1w (Rx)
R5,#D.mod,R3
Rl,#D.mod,Rl
R3,Rl,Rl
Rl, D. 1w (Rx)

A-12 • Software Considerations

In the common case, the intended sequence for storing a byte R5 is:
LDL
INSBL
MSKBL
BIS
STL

Rl , D . 1 w ( Rx )
R5,#D.mod,R3
Rl,#D.mod,Rl
R3,Rl,Rl
Rl , D. 1 w (Rx)

Division
In all implementations, floating-point division is likely to have a substantially longer result latency
than floating-point multiply; in addition, in many implementations multiplies will be pipelined
and divides will not.
Thus, any division by a constant power of two should be compiled as a multiply by the exact
reciprocal, if it is representable without overflow or underflow. If language rules or surrounding
context allow, other divisions by constants can be closely approximated via multiplication by the
reciprocal.
Integer division does not exist as a hardware opcode. Division by a constant can always be done
via UMULH of another appropriate constant, followed by a right shift. General quadword
division by true variables can be done via a subroutine. The subroutine could test for small
divisors (less than about 1000 in absolute value) and for those, do a table lookup on the exact
constant and shift count for an UMULH/shift sequence. For the remaining cases, a table lookup
on about a 1000-entry table and a multiply can give a linear approximation to II divisor that is
accurate to 16 bits. Using this approximation, a multiply and a back-multiply and a subtract can
generate one 16-bit quotient "digit" plus a 48-bit new partial dividend. Three more such steps
can generate the full quotient. Having prior knowledge of the possible sizes of the divisor and
dividend, normalizing away leading bytes of zeros, and performing an early-out test can reduce
the average number of multiplies to about 5 (compared to a best case of 1 and a worst case of 9).

Stylized Code Forms
Using the same stylized code form for a common operation makes compiler output a little more
readable and makes it more likely that an implementation will speed up the stylized form.

NOP
The standard NOP forms are:
NOP
FNOP

BIS
CPYS

R31,R31,R31
F31, F31, F31

These generate no exceptions. In most implementations, they should encounter no operand issue
delays, no destination issue delay, and no functional unit issue delay. Implementations are free to
optimize these into no action and zero execution cycles.

A-l3

Clear a Register
The standard clear register forms are:
CLR
FCLR

BIS
CPYS

R31,R31,Rx
F31, F31, Fx

These generate no exceptions. In most implementations, they should encounter no operand issue
delays, and no functional unit issue delay.

Load Literal
The standard load integer literal (ZEXT 8-bit) form is:
MOV #lit8,Ry

BIS R31, lit8, Ry

The Alpha literal construct in Operate instructions creates a canonical longword constant for
values 0..255.
A longword constant stored in an Alpha 64-bit register is in canonical form when bits
<63:32>=bit <31>.
A canonical 32-bit literal can usually be generated with one or two instructions, but sometimes
three instructions are needed. Use the following procedure to determine the offset fields of the
instructions:
val

<sign-extended, 32-bit value>

low
tmpl

val<15:0>
val - SEXT(low)! Account for LDA instruction

high
tmp2

tmpl<31:16>
tmpl - SHIFT_LEFT ( SEXT(high,16)

)

if tmp2 NE 0 then
original val was in range 7FFF8000 16 .. 7FFFFFFF 16
extra = 4000 16
tmpl
tmpl - 40000000 16
high = tmpl<31:16>
else
extra = 0
endif

The general sequence is:
LDA Rdst, low(R31)
LDAH Rdst, extra (Rdst)
LDAH Rdst, high(Rdst)

Omit if extra=O
Omit if high=O

A-14 • Software Considerations

BIS RX,RX,RY
CPYS FX,FX,FY

These generate no exceptions. In most implementations, these should encounter no functional
unit issue delay.

Negate
The standard register negate forms are:
NEGz Rx,Ry
NEGz Fx,Fy

SUBz
SUBz

R31,Rx,Ry
F31,Fx,Fy

! z
!

L or Q

F G S or T

FNEGz Fx,Fy

CPYSN

Fx,Fx,Fy

F G S or T

The integer subtract generates no Integer Overflow trap if Rx contains the largest negative
number (SUBzlV would trap). The floating subtract generates a floating-point exception for a
non-finite value in Fx. The CPYSN form generates no exceptions.

NOT
The standard integer register NOT form is:
NOT Rx,Ry

ORNOT

R31,Rx,Ry

This generates no exceptions. In most implementations, this should encounter no functional unit
issue delay.

Booleans
The standard alternative to BIS is:
OR Rx,Ry,Rz

BIS

Rx,Ry,Rz

BIC

Rx,Ry,Rz

EQV

Rx,Ry,Rz

The standard alternative to BIC is:
ANDNOT Rx,Ry,Rz

The standard alternative to EQV is:
XORNOT Rx,Ry,Rz

Trap Barrier
The TRAPB instruction guarantees that following instructions do not issue until all possible
preceding traps have been signaled. This does not mean that all preceding instructions have
necessarily run to completion (for example, a Load instruction may have passed all the fault
checks but not yet delivered data from a cache miss).

A-15

Pseudo-Operations (Stylized Code Forms)
This section summarizes the pseudo-operations for the Alpha architecture that may be used by
various software components in an Alpha system. Most of these forms are discussed in preceding
sections.

In the context of this section, pseudo-operations all represent a single underlying machine
instruction. Each pseudo-operation represents a particular instruction with either replicated fields
(such as FMOV), or hard-coded zero fields. Since the pattern'is distinct, these pseudo-operations
can be decoded by instruction decode mechanisms.
In Table A-l, the pseudo-operation codes can be viewed as macros with parameters. The formal
form is listed in the left column, and the expansion in the code stream listed in the right column.
Some instruction mnemonics have synonyms. These are different from pseudo-operations in that
each synonym represents the same underlying instruction with no special encoding of operand
fields. As a result, synonyms cannot be distinquished from each other. They are not listed in the
table that follows. Examples of synonyms are: BIC/ANDNOT, BIS/OR, and EQV/XORNOT.
Table A-I · Decodable Pseudo-Operations (Stylized Code Forms)
Pseudo-Operation in Listing

Actual Instruction Encoding

FABS

No-exception generic floating absolute value:
Fx, Fy

CPYS

F3l,Fx, Fy

Branch to target (2l-bit signed displacement):
target

R3l, target

BIS

R3l, R3l, Rx

FCLR

Clear a floating-point register:
Fx

CPYS

F3l, F3l, Fx

FMOV

Floating-point move:
Fx, Fy

CPYS

Fx, Fx, Fy

No-exception generic floating negation:
Fx, Fy

CPYSN

Fx, Fx, Fy

FNOP

CPYS

F3l, F3l, F3l

Move Rx/8-bit zero-extended literal to Ry:
{Rx/Lit8} , Ry
MOV

BIS

R3l, {Rx/Lit8}, Ry

Clear integer register:

CLR

FNEG

Floating-point no-op:

A-16 • Software Considerations

Table A-1 · Decodable Pseudo-Operations (Stylized Code Forms) (Continued)
Pseudo-Operation in Listing

Actual Instruction Encoding

Move 16-bit sign-extended literal to Rx:
MOV
Lit, Rx

LDA

Rx, lit(R3I)

Move to FPCR:
MT_FPCR

MT_FPCR

Fx, Fx, Fx

Move from FPCR:
MF_FPCR
Fx

MF_FPCR

Fx, Fx, Fx

Negate F_floating:
NEGF
Fx, Fy

SUBF

F31, Fx, Fy

Negate F_floating, semi-precise:
NEGF/S
Fx, Fy

SUBF/S

F31, Fx, Fy

Negate G_floating:
NEGG
Fx, Fy

SUBG

F31, Fx, Fy

Negate G_floating, semi-precise:
NEGG/S
Fx, Fy

SUBG/S

F31, Fx, Fy

Negate longword:
NEGL
{Rx/Lit8}, Ry

SUBL

R31, {Rx/Lit}, Ry

Negate longword with overflow detection:
NEGL/V
{Rx/Lit8}, Ry

SUBL/V

R31, {Rx/Lit}, Ry

. Negate quadword:
NEGQ
{Rx/Lit8}, Ry

SUBQ

R31, {Rx/Lit}, Ry

Negate quadword with overflow detection:
NEGQ/V
{Rx/Lit8} , Ry

SUBQ/V

R31, {Rx/Lit}, Ry

Negate S_floating:
NEGS
Fx, Fy

SUBS

F31, Fx, Fy

Negate S_floating, software with underflow
detection:
NEGS/SU
Fx, Fy

SUBS/SU

F31, Fx, Fy

Negate S_floating, software with underflow and
inexact result detection:
NEGS/SUI
Fx, Fy

SUBS/SUI

F31, Fx, Fy

Negate T_floating:
NEGT
Fx, Fy

SUBT

F31, Fx, Fy

A-17

Table A-I · Decodable Pseudo-Operations (Stylized Code Forms) (Continued)
Pseudo-Operation in Listing

Actual Instruction Encoding

Negate T_floating, software with underflow
detection:
NEGT/SU
Fx, Fy

SUBT/SU

F3l, Fx, Fy

Negate T_floating, software with underflow and
inexact result detection:
NEGT/SUI

SUBT/SUI

F3l, Fx, Fy

Integer no-op:
Nap

BIS

R3l, R3l, R3l

Logical NOT of Rx/8-bit zero-extended literal
storing results in Ry:
NOT
{Rx/Lit8}, Ry

aRNOT

R3l, {Rx/Lit}, Ry

Longword sign-extension of Rx storing results in Ry:
SEXTL
{Rx/Lit8}, Ry

ADDL

R3l, {Rx/Lit}, Ry

• Timing Considerations: Atomic Sequences
A sufficiently long instruction sequence between LDx_L and STx_C will never complete, because
periodic timer interrupts will always occur before the sequence completes. The following rules
describe sequences that will eventually complete in all Alpha implementations:
1. At most 40 operate or conditional-branch (not taken) instructions executed in the sequence
between LDx_L and STx_C.
2. At most two I-stream TB-miss faults. Sequential instruction execution guarantees this.
3. No other exceptions triggered during the last execution of the sequence.
Implementation Note
On all expected implementations, this allows for about 50 Ilsec of execution time, even with 100 percent cache misses. This should satisfy any
requirement for a 1 msec timer interrupt rate.

Appendix B· IEEE Floating-Point Conformance

A subset of IEEE Standard for Binary Floating-Point Arithmetic (754-1985) is provided in the
Alpha floating-point instructions. This appendix describes how to construct a complete IEEE
implementation.
The order of presentation parallels the order of the IEEE specification.

· Alpha Choices for IEEE Options
Alpha supports IEEE single and double formats. Optional extended double is not supported.
Alpha hardware supports normal and chopped IEEE rounding modes. IEEE plus infinity and
minus infinity rounding modes can be implemented in hardware or software.
Alpha hardware does not support optional IEEE software trap enable/disable modes; see the
following discussion about software support.
Alpha hardware supports add, subtract, multiply, divide, convert between floating formats,
convert between floating and integer formats, and compare. Software routines support square
root, remainder, round to integer in floating-point format, and convert binary tolfrom decimal.
In the Alpha architecture, copying without change of format is not considered an operation.
(LDx, CPYSx, and STx do not check for non-finite numbers; an operation would.) Compilers may
generate ADDx F31,Fx,Fy to get the opposite effect.
Optional operations for differing formats are not provided.
The Alpha choice is that the accuracy provided will meet or exceed IEEE standard requirements.
It is implementation-dependent whether the software binary/decimal conversions beyond 9 or 17
digits treat any excess digits as zeros.
Overflow and underflow, NaNs, and infinities encountered during software binary to decimal
conversion return strings that specify the conditions. Such strings can be truncated to their
shortest unambiguous length.
Alpha hardware supports comparisons of same-format numbers. Software supports comparisons
of different-format numbers.
In the Alpha architecture, results are true-false in response to a predicate.
Alpha hardware supports the required six predicates and the optional unordered predicate. The
other 19 optional predicates can be constructed from sequences of two comparisons and two
branches.
Alpha hardware supports infinity arithmetic only by trapping when an infinity operand is
encountered and when an infinity is to be created from finite operands by overflow or division by
zero. A software trap handler (interposed between the hardware and the IEEE user) provides
correct infinity arithmetic.

B-2 • IEEE Floating-Point Conformance

Alpha hardware supports NaNs only by trapping when a NaN operand is encountered and when
a NaN is to be created. A software trap handler (interposed between the hardware and the IEEE
user) provides correct Signaling and Quiet NaN behavior.
In the Alpha architecture, Quiet NaNs do not afford retrospective diagnostic information.
In the Alpha architecture, copying a Signaling NaN without a change of format does not signal an
invalid exception (LDx, CPYSx, and STx do not check for non-finite numbers). Compilers may
generate ADDx F31,Fx,Fy to get the opposite effect.
Alpha hardware fully supports negative zero operands, and follows the IEEE rules for creating
negative zero results.
Alpha hardware does not supply IEEE exception trap behavior; the hardware traps are a superset
of the IEEE-required conditions. A software trap handler (interposed between the hardware and
the IEEE user) provides correct IEEE exception behavior.
In the Alpha architecture, tininess is detected by hardware after rounding, and loss of accuracy is
detected by software as an inexact result.
In the Alpha architecture, user trap handlers will be supported by compilers and a software trap
handler (interposed between the hardware and the IEEE user), as described in the next section.

• Alpha Hardware Support of Software Exception Handlers
In Alpha instructions, hardware trap behavior is determined only at compile time; short of
recompiling, there are no dynamic facilities for changing hardware trap behavior.
There is an essential disparity between the Alpha design goal of fast execution and the IEEE
design goal of exact trap behavior. The Alpha hardware architecture provides means for users to
choose various degrees of IEEE compliance, at appropriate performance cost.
Instructions compiled without the ISoftware modifier cannot produce IEEE-compliant trap
behavior, nor can they provide IEEE-compliant non-finite arithmetic. Trapping and stopping on
non-finite operands or results (rather than the IEEE default of continuing with NaNs propagated)
is an Alpha value-added behavior that some users prefer.
Instructions compiled without the IUnderflow hardware trap enable modifier cannot produce
IEEE-compliant underflow trap behavior, nor can they provide IEEE-compliant denormal results.
They are fast and provide true zero (not minus zero) results whenever underflow occurs. This is
an Alpha value-added behavior that some users prefer.
Instructions compiled without the IInexact hardware trap enable modifier cannot produce
IEEE-compliant inexact trap behavior. Trapping on Inexact will be painfully slow; few users
appear to prefer this, but they can get it if they really want it.
IEEE floating-point instructions compiled with the ISoftware modifier produce hardware traps
and unpredictable values; a software trap handler may then produce all IEEE-required behavior.
IEEE floating-point instructions compiled with the IUnderflow enable modifier produce hardware traps and true zero values for underflow; a software trap handler may then produce all
IEEE-required behavior.

B-3

IEEE floating-point instructions compiled with the IInexact enable modifier produce hardware
traps that allow a software trap handler to produce all IEEE-required behavior.
Thus, to get full IEEE compliance of all the required features of the standard, users must compile
with all three options enabled.
To get the optional full IEEE user trap handler behavior, a software trap handler must be
provided that implements the five exception flags, dynamic user trap handler disabling, handler
saving and restoring, default behavior for disabled user trap handlers, and linkages that allow a
user handler to return a substitute result.
Also, users must insert a TRAPB in every basic block with a floating operation that can potentially
trap, so that a software handler has an opportunity to scale the true result by 2'h'~192 or 2"d~1536,
as appropriate for enabled user trap handlers; and to supply the default +/- infinity, +/-MAX,
+/-MIN, denormal, or zero as appropriate for disabled user trap handlers.

• Mapping to IEEE Standard
There are five IEEE exceptions, each of which can be "IEEE software trap-enabled" or disabled
(the default condition). Implementing the IEEE software trap-enabled mode is optional in the
IEEE standard.
Our assumption, therefore, is that the only access to IEEE-specified software trap-enabled results
will be generated in assembly language code. The following design allows this, but only if such
assembly language code has TRAPB instructions after each floating-point instruction, and generates the IEEE-specified scaled result in a trap handler by emulating the instruction that was
trapped by hardware overflow/underflow detection, using the original operands.
There is a set of detailed IEEE-specified result values, both for operations that are specified to
raise IEEE traps and those that do not. This behavior is created on Alpha by four layers of
hardware, PALcode, the operating-system trap handler, and the user IEEE trap handler, as shown
in Figure B-1.

I User Condition Handler I
Figure B-1· IEEE Trap Handling Behavior

B-4 • IEEE Floating-Point Conformance

The IEEE-specified trap behavior occurs only with respect to the user IEEE trap handler (the last
layer in Figure B-U; any trap-and-fixup behavior in the first three layers is outside the scope of
the IEEE standard.
The IEEE number system is divided into finite and non-finite numbers:
• The finites are normal numbers:
-MAX..-MIN, -0, 0, +MIN..+MAX
• The non-finites are:
Denormals, +/- Infinity, Signaling NaN, Quiet NaN
Alpha hardware must treat minus zero operands and results as special cases, as required by the
IEEE standard.
Table B-1 specifies, for the IEEE /Software modes, which layer does each piece of trap handling.
See Chapter 4 for more detail on the hardware instruction descriptions.
Table B-1 • IEEE Floating-Point Trap Handling
OS
Trap
Handler

Alpha Instructions

Hardware

FBEQ FBNE FBLT FBLE FBGT FBGE

Bits Only-No Exceptions

LDS LDT

Bits Only-No Exceptions

STS STT

Bits Only-No Exceptions

CPYS CPYSN

Bits Only-No Exceptions

FCMOVx

Bits Only-No Exceptions

PAL

User
Software
Handler

ADDx SUBx INPUT Exceptions
Denormal operand

Trap

Supply
sum

+/-Inf operand

Trap

Supply
sum

QNaN operand

Trap

Supply
QNaN

SNaN operand

Trap

Supply
QNaN

[Invalid Op]

+Inf + -Inf

Trap

Supply
QNaN

[Invalid Op]

B-5

Table B-1 • IEEE Floating-Point Trap Handling (Continued)

Alpha Instructions

OS
Trap
Handler

User
Software
Handler

Supply
+/-Inf
+/-MAX

[Overflow]
Scale by
2"d<Alpha

Hardware

PAL

Exponent overflow

Trap

Exponent underflow and disabled

Supply +0

Exponent underflow and enabled

Supply +0
and trap

Trap

Denormal operand

Trap

Supply
prod.

+/-Inf operand

Trap

Supply
prod.

QNaN operand

Trap

Supply
QNaN

SNaN operand

Trap

Supply
QNaN

[Invalid Op]

o ;< Inf

Trap

Supply
QNaN

[Invalid Op]

Exponent overflow

Trap

Supply
+/-Inf
+/-MAX

[Overflow]
Scale by
2"d<Alpha

Exponent underflow and disabled

Supply +0

Exponent underflow and enabled

Supply +0
and Trap

Trap

Supply
+/-MIN
denorm
+/-0

[Underflow]
Scale by
2"d<Alpha

Trap

ADDx SUBx OUTPUT Exceptions

-1

Supply
+/-MIN
denorm
+/-0

[Underflow]
Scale by
2"d<Alpha

Inexact and disabled in the instruction
Inexact and enabled in the instruction

[Inexact]

MULx INPUT Exceptions

MULx OUTPUT Exceptions

Inexact and disabled
Inexact and enabled
1

[Inexact]

An implementation could choose instead to trap to PALcode and have the PALcode supply a zero result on all
underflows.

B-8 • IEEE Floating-Point Conformance

Table B-1 • IEEE Floating-Point Trap Handling (Continued)

Alpha Instructions

OS
Trap
Handler

User
Software
Handler

Hardware

PAL

Denormal operand

Trap

Supply
Cvt

+/-Inf operand

Trap

Supply
Cvt

QNaN operand

Trap

Supply
QNaN

SNaN operand

Trap

Supply
QNaN

[Invalid Op]

Exponent overflow

Trap

Supply
+/-Inf
+/-MAX

[Overflow]
Scale by
2'h':Alpha

Exponent underflow and disabled

Supply +0

Exponent underflow and enabled

Supply +0
and trap

Trap

Supply
+/-MIN
denorm
+/-0

[Underflow]
Scale by
2'h\-Alpha

Trap

CVTff INPUT Exceptions

CVTff OUTPUT Exceptions

Inexact and disabled
Inexact and enabled

[Inexact]

Other IEEE operations (software subroutines or sequences of instructions), are listed here for
completeness:
Remainder
SQRT
Round float to integer-valued float
Convert binary tolfrom decimal
Compare, other combinations than the four above

B-9

Table B-2 shows the IEEE standard charts.

Table B·2 · IEEE Standard Charts

Exception

IEEE Software
TRAP Disabled
(IEEE Default)

IEEE Software
TRAP Enabled
(Optional)

Invalid Operation
(1) Input signaling NaN

Quiet NaN

(2) Mag. subtract Inf.

Quiet NaN

0) 0 -k Inf.

Quiet NaN

(4) % or Inf/Inf

Quiet NaN

(5) x REM 0 or Inf REM y

Quiet NaN

(6) SQRT(negative non-zero)

Quiet NaN

(7) Cvt to int(ovfl, Inf, NaN)

Quiet NaN

(8) Compare unordered

Quiet NaN

Division by Zero
x/O, x finite <>0

+/-Inf

Overflow
Round nearest

+/-Inf.

Res/2"d:192 or 1536

Round to zero

+/-MAX

Round to - Inf

+MAXI-Inf

Res/2 id:192 or 1536
Res/2 id:192 or 1536

Round to +Inf

+Infl-MAX

Res/2 id:192 or 1536

Underflow

0/denorm/+ -MIN

Res*2**192 or 1536

Inexact

Rounded/ovfl

Res

IEEE software trap handler requirements are as follows:
Result is unpredictable unless supplied by trap handler.
Determine which exceptions occurred.
Determine the kind of operation.
Determine the destination format.
Overflow/underflow/inexact: the correctly rounded result, including parts that do not fit in the
format.
Invalid and divzero: the operand values.

Appendix C · Instruction Encodings

The encodings for the Alpha instruction set are given in the following sections. There is one
section for each instruction format, followed by a summary of all the instruction opcodes in a
single table.

• Memory Format Instructions
Table C-l lists the hexadecimal values of the 6-bit opcode field for the Memory format
instructions.
Table C-l . Memory Format Instruction Opcodes
Mnemonic

Mnemonic

LDL
LDQ
LDL_L
LDQ_L
LDQ_U

28
29
2A
2B
OB

STL
STQ
STL_C
STQ_C
STQ_U

2C
2D
2E
2F
OF

LDA

LDAH

LDF
LDG
LDS
LDT

Mnemonic
20
21
22
23

STF
STG
STS
STT

24
25
26

Table C-2 lists the hexadecimal values of the 6-bit opcode field and the 16-bit displacement field
for the Memory format instructions that use the displacement field as a function code. The
notation used is oo.ffff , where 00 is the 6-bit opcode and the ffff is the 16-bit displacement field.
Table C-2 • Memory Format Instructions with a Function Code
Mnemonic
FETCH
RC

TRAPB

Mnemonic
18.8000
18.EOOO
18.0000

Mnemonic
18.AOOO
18.COOO

MB
RS

18.4000
18.FOOO

Programming Note
The code points 18.4400, 18.4800, and 18.4COO must operate as Memory
Barrier instructions (MB 18.4000). Software will currently only use the
18.4000 code point for MB. This allows a weaker memory barrier to be
added.

C-2 • Instruction Encodings
Table C-3 lists the hexadecimal values of the high-order two bits of the displacement field for the
Memory format branch instructions. The notation used is oo.h, where 00 is the 6-bit opcode and
the h is the high-order two bits of the displacement field.
Table C-3 • Memory Format Branch Instruction Opcodes
Mnemonic
JMP

Mnemonic
1A.0

JSR

1A.1

Mnemonic

JSR_COROUTINE 1Ao3

RET

1A.2

• Branch Format Instructions
Table C-4 lists the hexadecimal values of the 6-bit opcode field for the Branch format
instructions.
Table C-4 • Branch Format instruction Opcodes
Mnemonic
BR
BSR
BLBC
BLBS

Mnemonic
30
34
38
3C

FBEQ
FBNE
BEQ
BNE

Mnemonic
31
35
39
3D

FBLT
FBGE
BLT
BGE

Mnemonic
32
36
3A
3E

FBLE
FBGT
BLE
BGT

33
37
3B
3F

• Operate Format Instructions
Table C-5 lists the hexadecimal values of the 6-bit opcode field and the 7-bit function code field
for the Operate format instructions The notation used is oo.ff, where 00 is the 6-bit opcode and
. the II is the 7-bit function code field
Table C-5 • Operate Format Instruction Opcodes and Function Codes
Mnemonic

Mnemonic

ADDL
ADDLIv
ADDQ
ADDQ!V
CMPULE
CMPBGE

10.00
10040
10.20
10.60
lOo3D
1O.0F

SUBL
SUBLIv
SUBQ
SUBQ!V

10.09
10049
10.29
10.69

CMPEQ
CMPLT
CMPLE
CMPULT

1O.2D
lOAD
1O.6D
1O.1D

S4ADDL
S4ADDQ

10.02
10.22

S4SUBL
S4SUBQ

1O.0B
1O.2B

S8ADDL
S8ADDQ

10.12
10032

AND
BIC
CMOVEQ
CMOVNE
CMOVLBS

11.00
11.08
11.24
11.26
11.14

BIS
ORNOT
CMOVLT
CMOVGE
CMOVLBC

11.20
11.28
11044
11.46
11.16

XOR
EQV
CMOVLE
CMOVGT

11040
11048
11.64
11.66

S8SUBL
S8SUBQ

1O.lB
1003B

C-3

Table C-5 • Operate Format Instruction Opcodes and Function Codes (Continued)
Mnemonic

Mnemonic

SLL
EXTBL
EXTWL
EXTLL
EXTQL
EXTWH
EXTLH
EXTQH

12.39
12.06
12.16
12.26
12.36
12.5A
12.6A
12.7A

SRA
INSBL
INSWL
INSLL
INSQL
INSWH
INSLH
INSQH

12.3C
12.0B
12.1B
12.2B
12.3B
12.57
12.67
12.77

SRL
MSKBL
MSKWL
MSKLL
MSKQL
MSKWH
MSKLH
MSKQH
ZAP
ZAPNOT

12.34
12.02
12.12
12.22
12.32
12.52
12.62
12.72
12.30
12.31

MULL
MULQ/V

13.00
13 .60

MULUV
UMULH

13.40
13.30

MULQ

13.20

• Floating-Point Operate Format
Table C-6 lists the hexadecimal values of the ll-bit function code field for the Floating-point
Operate format instructions that are data type independent. The 6-bit opcode for these instructions is 1716"
Table C-6 • Function Codes for Floating Data Type Independent Operations
Mnemonic

Mnemonic

CPYS
MF_FPCR

020
025

CPYSN
MT_FPCR

021
024

CPYSE
022
CVTQUSV 530

CVTLQ

010

CVTQL

030

CVTQUV

FCMOVEQ
FCMOVNE

02A
02B

FCMOVLT
FCMOVGE

02C
02D

FCMOVLE 02E
FCMOVGT 02F

130

IEEE Floating-Point Instructions
Table C-7 lists the hexadecimal value of the ll-bit function code field for the IEEE floating-point
instructions, with and without qualifiers. The opcode for these instructions is 16 16 ,
Table C-7 • IEEE Floating-Point Instruction Function Codes

ADDS
ADDT
CMPTEQ
CMPTLT
CMPTLE
CMPTUN

None

Iue

fUM

IUD

080
OAO
0A5
OA6
OA7
OA4

000
020

040
060

OCO
OEO

180
lAO

100
120

140
160

1CO
lEO

C-4 • Instruction Encodings

Table (-7 • IEEE Floating-Point Instruction Function Codes (Continued)

CVTQS
eVTQT
CVTTS
DIVS
DIVT
MULS
MULT
SUBS
SUBT
ADDS
ADDT
CMPTEQ
CMPTLT
CMPTLE
CMPTUN
CVTQS
CVTQT
CVTTS
DIVS
DrVT
MULS
MULT
SUBS
SUBT
CVTTQ
CVTTQ

None

Iue

IUM

IUD

OBC
OBE
OAC
083
0A3
082
0A2
081
OA1

mc
03E
02C
003
023
002
022
001
021

07C
07E
06C
043
063
042
062
041
061

OFC
OFE
OEC
OC3
OE3
OC2
OE2
OC1
OE1

lAC
183
1A3
182
1A2
181
1A1

12C
103
123
102
122
101
121

16C
143
163
142
162
141
161

lEC
1C3
1E3
1C2
lE2
1C1
1E1

Isu

Isue

ISUM

ISUD

ISUI

ISUIe

ISUIM

ISUID

580
5AO
5A5
5A6
5A7
5A4

500
520

540
560

5CO
5EO

780
7AO

700
720

740
760

7CO
7EO

73C
73E
72C
703
723
702
722
701
721

77C
77E
76C
743
763
742
762
741
761

7FC
7FE
7EC
7C3
7E3
7C2
7E2
7C1
7E1

5AC
583
5A3
582
5A2
581
5A1

52C
503
523
502
522
501
521

56C
543
563
542
562
541
561

5EC
5C3
5E3
5C2
5E2
5C1
5E1

7BC
7BE
7AC
783
7A3
782
7A2
781
7A1

None

ISV

Isve

ISVI

ISVIe

OAF

02F

1AF

12F

5AF

52F

7AF

72F

IVD

ISVD

ISVID

ISVM

ISVIM

OEF

1EF

5EF

7EF

06F

16F

56F

76F

Programming Note
Since underflow cannot occur for CMPTxx, there is no difference in
function or performance between CMPTxx/S and CMPTxx/SU. It is
intended that software generate CMPTxx/SU in place of CMPTxx/S.

C-5

VAX Floating-Point Instructions
Table C-8lists the hexadecimal value of the ll-bit function code field for the VAX floating-point
instructions. The opcode for these instructions is 15 16 ,
Table C-S • VAX Floating-Point Instruction Function Codes

ADDF
eVTDG
ADDG
CMPGEQ
CMPGLT
CMPGLE
CVTGF
CVTGD
CVTQF
CVTQG
D1VF
D1VG
MULF
MULG
SUBF
SUBG
CVTGQ

None

Ive

Ise

/sv

Isve

080
09E
OAO
0A5
OA6
OA7
OAC
OAD
OBC
OBE
083
0A3
082
0A2
081
OA1

000
OlE
020

180
19E
lAO

100
11E
120

400
41E
420

580
59E
5AO

500
51E
520

02C
02D
03C
03E
003
023
002
022
001
021

lAC
lAD

12C
12D

480
49E
4AO
4A5
4A6
4A7
4AC
4AD

42C
42D

5AC
5AD

52C
52D

183
1A3
182
1A2
181
1A1

103
123
102
122
101
121

483
4A3
482
4A2
481
4A1

403
423
402
422
401
421

583
5A3
582
5A2
581
5A1

503
523
502
522
501
521

None

Ivc

ISC

/sv

Isve

OAF

02F

1AF

12F

4AF

42F

5AF

52F

• Required PALcode Function Codes
The opcodes listed in Table C-9 are required for all Alpha implementations. The notation used is
oo.ffff, where 00 is the hexadecimal 6-bit opcode and Iffl is the hexadecimal 26-bit function code.
Table C-9 • Required PALcode Function Codes
Mnemonic

Type

Function Code

HALT
1MB

Privileged
Unprivileged

00.0000
00.0086

C-6 • Instruction Encodings

• Opcodes Reserved to PALcode
The opcodes listed in Table C-I0 are reserved for use in implementing PALcode.

Table C-10 • Opcodes Reserved for PALcode
Mnemonic

PAL19
PAL1F

Mnemonic

19
1F

PALlB

Mnemonic
1B

PALlD

Mnemonic

PALlE

• Opcodes Reserved to Digital
The opcodes listed in Table C-ll are reserved to DigitaL

Table C-11 · Opcodes Reserved for Digital
Mnemonic

OPC01
OPC05
OPCOC
OPC1C

Mnemonic

01
05
OC
1C

OPC02
OPC06
OPCOD

Mnemonic

02
06
OD

OPC03
OPC07
OPCOE

Mnemonic

03
07
OE

OPC04
OPCOA
OPC14

04
OA
14

• Opcode Summary
Table C-12 lists all Alpha opcodes from 00 (CALL_PALL) through 3F (BGT). In the table, the
column headings appearing over the instructions have a granularity of 8 16 , The rows beneath the
leftmost column supply the individual hex number to resolve that granularity.

If an instruction column has a 0 in the right (low) hex digit, replace that 0 with the number to the
left of the backslash in the leftmost column on the instruction's row. If an instruction column has
an 8 in the right (low) hexadecimal digit, replace that 8 with the number to the right of the
backslash in the leftmost column.
For example, the third row (2/A) under the 10 16 column contains the symbol INTS'\ representing
the all integer subtract instructions. The opcode for those instructions would then be 12 16
because the 0 in 10 is replaced by the 2 in the leftmost column. Likewise, the third row under the
18 16 column contains the symbol JSR~':, representing all jump instructions. The opcode for those
instructions is lA because the 8 in the heading is replaced by the number to the right of the
backslash in the leftmost column.
The instruction format is listed under the instruction symboL

C-7

The symbols in Table C-12 are explained in Table C-13.
Table C-12 · Opcode Summary

0/8

PAL'"
(pal)

LDA
(mem)

INTA '
(op)

MISe"
(mem)

LDF
(mem)

LDL
(mem)

BR
(br)

BLBC
(br)

1/9

Res

LDAH
(mem)

INTL'>'
(op)

\PAL\

LDG
(mem)

LDQ
(mem)

FBEQ
(br)

BEQ
(br)

2/A

Res

INTS i '
(op)

]SR'>'
(mem)

LDS
(mem)

LDL_L
(mem)

FBLT
(br)

BLT
(br)

3/B

Res

LDQ_U
(mem)

INTM'"
(op)

\PAL\

LDT
(mem)

LDQ_L
(mem)

FBLE
(br)

BLE
(br)

4/C

Res

STF
(mem)

STL
(mem)

BSR
(br)

BLBS
(br)

51D

Res

FLTV'>'
(op)

\PAL\

STG
(mem)

STQ
(mem)

FBNE
(br)

BNE
(br)

6/E

Res

FLTt'

\PAL\

STS
(mem)

STL_C
(mem)

FBGE
(br)

BGE
(br)

\PAL\

STT
(mem)

STQ_C
(mem)

FBGT
(br)

BGT
(br)

(op)
Res

7/F

STQ_U
(mem)

FLTL i '

(op)

Table C-13 . Key to Opcode Summary (Table C-12)
Symbol

Meaning

FLTt'

IEEE floating-point instruction opcodes

Floating-point Operate instruction opcodes

VAX floating-point instruction opcodes

INTA '

Integer arithmetic instruction opcodes

INTL i '

Integer logical instruction opcodes

INTM'>'

Integer multiply instruction opcodes

INTS'>'

Integer subtract instruction opcodes

]SK"

Jump instruction opcodes

MISe"

Miscellaneous instruction opcodes

PAL'>'

PALcode instruction (CALL_PAL) opcodes

\PAL\

Reserved for PALcode

Res

Reserved for Digital

FLTL '

FLTV '

Index

A
Add instructions
See also Floating-point Operate
add longword, 4-22
add quadword, 4-24
add scaled longword, 4-23
add scaled quadword, 4-25
ADDF instruction, 4-83
ADDG instruction, 4-83
ADDL instruction, 4-22
ADDQ instruction, 4-24
Address Space Match (ASM), virtual cache
coherency, 5-4
Address Space Number (ASN), virtual cache
coherency, 5-4
ADDS instruction, 4-84
ADDT instruction, 4-84
Aligned byte/word memory accesses, A-ll
ALIGNED data objects, 1-8
Alignment
atomic longword, 5-2
atomic quadword, 5-2
data considerations, A-6
double-width data paths, A-I
D_floating, 2-6
F_floating, 2-4
G_floating, 2-5
instruction, A-2
longword, 2-2
memory accesses, A-II
quadword, 2-2
S_floating, 2-8
T_floating, 2-9
Alpha architecture
See also Conventions
addressing, 2-1
overview, 1-1
porting operating systems to, 1-1
programming implications, 5-1
registers, 3-1
security, 1-6

Alpha Privileged Architecture Library
See PALcode
AND instruction, 4-36
Arithmetic instructions, 4-21
See also specific arithmetic instructions
Arithmetic left shift instruction, using logical
shift for, 4-35
Arithmetic traps
Division by Zero, 4-60
Inexact Result, 4-60
Integer Overflow, 4-60
Invalid Operation, 4-59
Overflow, 4-60
programming implications for, 5-20
TRAPB instruction with, 4-105
Underflow, 4-60
Atomic access, 5-2
Atomic operations
accessing longword datum, 5-2
accessing quadword datum, 5-2
updating shared data structures, 5-6
using load locked and store conditional
5-7
'
Atomic sequences, A-17

B
BEQ instruction, 4-17
BGE instruction, 4-17
BGT instruction, 4-17
BIC instruction, 4-36
BIS instruction, 4-36
BLBC instruction, 4-17
BLBS instruction, 4-17
BLE instruction, 4-17
BLT instruction, 4-17
BNE instruction, 4-17
Boolean instructions, 4-35
logical functions, 4-36
Boolean stylized code forms, A-14
bpt (PALcode) instruction, 9-1,
BPT (PALcode) instruction, 8-1,

1-2 • Index
BR instruction, 4-18
Branch instruction format, 3-9
Branch instructions, 4-16
See also Control instructions
backward conditional, 4-17
conditional branch, 4-17
displacement, 4-17
floating-point, summarized, 4-74
forward conditional, 4-17
opcodes for, C-2
unconditional branch, 4-18
Branch prediction model, 4-15
Branch prediction stack, with BSR instruction,
4-18
BSR instruction, 4-18
bugchk (PALcode) instruction, 9-1,
BUGCHK (PALcode) instruction, 8-1
Byte data type, 2-1
Byte manipulation instructions, 4-41
See also Extract instructions; Insert
instructions; Mask instructions

c
IC opcode qualifier
IEEE floating-point, 4-56
VAX floating-point, 4-56
Cache coherency, 5-1, 5-19
in multiprocessor environment, 5-5
Caches
design considerations, A-I
I-stream considerations, A-4
MB and 1MB instructions with, 5 -19
requirements for, 5-4
Translation Buffer conflicts, A-8
with powerfaillrecovery, 5-4
callsys (PALcode) instruction, 9-1
CALL_PAL (Call Privileged Architecture
Library) instruction, 4-100
Canonical form, 4-61
CFLUSH (PALcode) instruction, 8-8
Changed datum, 5-5
CHME (PALcode) instruction, 8-1
CHMK (PALcode) instruction, 8-1
CHMS (PALcode) instruction, 8-2

CHMU (PALcode) instruction, 8-2
Clear a register, A-13
CMOVEQ instruction, 4-37
CMOVGE instruction, 4-37
CMOVGT instruction, 4-37
CMOVLBC instruction, 4-37
CMOVLBS instruction, 4-37
CMOVLE instruction, 4-37
CMOVLT instruction, 4-37
CMOVNE instruction, 4-37
CMPBGE instruction, 4-42
CMPEQ instruction, 4-26
CMPGEQ instruction, 4-85
CMPGLE instruction, 4-85
CMPGLT instruction, 4-85
CMPLE instruction, 4-26
CMPLT instruction, 4-26
CMPTEQ instruction, 4-86
CMPTLE instruction, 4-86
CMPTLT instruction, 4-86
CMPTUN instruction, 4-86
CMPULE instruction, 4-27
CMPULT instruction, 4-27
Code forms, stylized, A-12
Boolean, A-14
load literal, A-13
negate, A-14
NOP, A-12
NOT, A-14
register, clear, A-13
register-to-register move, A-14
Code sequences, A-ll
Coherency
cache, 5-1
defined, 5-1
Compare instructions
See also Floating-point Operate
compare byte, 4-42
compare integer signed, 4-26
compare integer unsigned, 4-27
Conditional move instructions, 4-37
See also Floating-point Operate
Console, overview, 7-1
Control instructions, 4-15

1-3
Conventions
code examples, 1-9
extents, 1-8
figures, 1-9
instruction format, 3-8
notation, 3-8
numbering, 1-6
ranges, 1-8
CPSY instruction, 4-78
CPSYN instruction, 4-78
CPYSE instruction, 4-78
CVTDG instruction, 4-89
CVTGD instruction, 4-89
CVTGF instruction, 4-89
CVTGQ instruction, 4-87
CVTLQ instruction, 4-79
CVTQF instruction, 4-88
CVTQG instruction, 4-88
CVTQL instruction, 4-79
CVTQS instruction, 4-91
CVTQT instruction, 4-91
CVTTQ instruction, 4-90
CVTTS instruction, 4-92

D
/D opcode qualifier
FPCR (Floating-point Control Register),
4-61
IEEE floating-point, 4-56
D-stream considerations, A-6
Data alignment, A-6
Data format, overview, 1-3
Data sharing (multiprocessor), A-7
synchonization requirement, 5-5
Data stream
See D-stream
Data types
byte, 2-1
IEEE floating-point, 2-6
longword, 2-2
longword integer, 2-9
quadword, 2-2
quadword integer, 2-10
unsupported in hardware, 2-11

VAX floating-point, 2-3
word, 2-1
Denormal, defined for floating-point, 4-54
Dirty zero, defined for floating-point, 4-54
DrVF instruction, 4-93
DIVG instruction, 4-93
Division
integer, A-12
performance impact of, A-12
Drvs instruction, 4-94
DIVT instruction, 4-94
DRAINA (PALcode) instruction, 8-8
Dual-issue instruction considerations, A-2
D_floating data type, 2-5
alignment of, 2-6
mapping, 2-5
restricted, 2-6

E
EQV instruction, 4-36
Exception handlers, B-2
TRAPB instruction with, 4-105
EXTBL instruction, 4-44
EXTLH instruction, 4-44
EXTLL instruction, 4-44
EXTQH instruction, 4-44
EXTQL instruction, 4-44
Extract instructions (list), 4-44
EXTWH instruction, 4-44
EXTWL instruction, 4-44

F
FBEQ instruction, 4-75
FBGE instruction, 4-75
FBGT instruction, 4-75
FBLE instruction, 4-75
FBLT instruction, 4-75
FBNE instruction, 4-75
FCMOVEQ instruction, 4-80
FCMOVGE instruction, 4-80
FCMOVGT instruction, 4-80
FCMOVLE instruction, 4-80
FCMOVLT instruction, 4-80

1-4 • Index
FCMOVNE instruction, 4-80
FETCH (Prefetch Data) instruction, 4-101
performance optimization, A-10
FETCH_M (Prefetch Data, Modify Intent)
instruction, 4-101
performance optimization, A-lO
Finite number, Alpha, contrasted with VAX,
4-54
Floating-point branch instructions, 4-74
Floating-point Control Register (FPCR), 4-61
accessing, 4-63
at processor initialization, 4-63
bit descriptions, 4-62
instructions to read/write, 4-82
Operate instructions that use, 4-76
saving and restoring, 4-64
Floating-point Convert instructions, 3-12
Floating-point division, performance impact
of, A-12
Floating-point format, number representation
(encodings), 4-55
Floating-point instructions
Branch (list), 4-74
faults, 4-53
introduced, 4-53
Memory format (list), 4-65
Operate (list), 4-76
rounding modes, 4-55
terminology, 4-54
trapping modes, 4-57
traps, 4-53
Floating-point load instructions, 4-65
load F_floating, 4-66
load G_floating, 4-67
load S_floating, 4-68
load T_floating, 4-69
with nonfinite values, 4-65
Floating-point operate instructions, 4-76
add (IEEE), 4-84
add (VAX), 4-83
compare (IEEE), 4-86
compare (VAX), 4-85
conditional move, 4-80
convert IEEE floating to IEEE floating,
4-92

convert IEEE floating to integer, 4-90
convert integer to IEEE floating, 4-91
convert integer to integer, 4-79
convert integer to VAX floating, 4-88
convert VAX floating to integer, 4-87
convert VAX floating to VAX floating,
4-89
copy sign, 4-78
divide (IEEE), 4-94
divide (VAX), 4-93
format of, 3-11
move from/to FPCR, 4-82
multiply (IEEE), 4-96
multiply (VAX), 4-95
opcodes for, C-3
subtract (IEEE), 4-98
subtract (VAX), 4-97
Floating-point registers, 3-2
Floating-point rounding modes
IEEE, 4-56
VAX, 4-56
Floating-point single-precision operations,
4-61
Floating-point store instructions, 4-65
store F_floating, 4-70
store G_floating, 4-71
store S_floating, 4-72
store T_floating, 4-73
with nonfinite values, 4-65
Floating-point support
FPCR (Floating-point Control Register),
4-61
IEEE, 2-6
IEEE standard 754-1985, 4-64
instruction overview, 4-53
longword integer, 2-10
Operate instructions, 4-76
optional with Alpha, 4-2
quadword integer, 2-10
rounding modes, 4-55
single-precision operations, 4-61
trap modes, 4-57
VAX, 2-3

1-5

Floating-point trapping modes, 4-57
See also Arithmetic traps
imprecision from pipelining, 4-58
FPCR (Floating-point Control Register)
See Floating-point Control Register (FPCR)
F_floating data type, 2-3
alignment of, 2-4
compared to IEEE S_floating, 2-8
MAXIMIN, 4-55
operations, 4-61

G
gentrap (PALcode) instruction, 9-1
GENTRAP (PALcode) instruction, 8-2
G_floating data type, 2-4
alignment of, 2-5
mapping, 2-4
MAXIMIN, 4-55

H
halt (PALcode) instruction, 9-2
HALT (PALcode) instruction, 6-4, 8-8

I
II opcode qualifier, IEEE floating-point, 4-58
I-stream
design considerations, A-2
modifying physical, 5-5
modifying virtual, 5-4
PALcode with, 6-1
with caches, 5-4
IEEE convert-to-integer trap mode,
instruction notation for, 4-58
IEEE floating-point
See also Floating-point instructions
exception handlers, B-2
format, 2-6
FPCR (Floating-point Control Register),
4-61
hardware support, B-1
NaN, 2-6
options, B-1
standard charts, B-9
standard, mapping to, B-3

S_floating, 2-7
trap handling, B-4
trap modes, 4-58
T_floating, 2-8
IEEE floating-point instructions
add instructions, 4-84
compare instructions, 4-86
convert from integer instructions, 4-91
convert IEEE floating format instructions,
4-92
convert to integer instructions, 4-90
divide instructions, 4-94
multiply instructions, 4-96
opcodes for, C-3
Operate instructions, 4-76
qualifiers, summarized, C-3
subtract instructions, 4-98
IEEE rounding modes, 4-56
IEEE standard
conformance to, B-1
mapping to, B-3
support for, 4-64
IEEE trap modes, required instruction
notation, 4-58
IGN (Ignore), 1-8
imb (PALcode) instruction, 9-1
1MB (PALcode) instruction, 5-16, 6-5, 8-2
virtual I-cache coherency, 5-5
IMP (Implementation Dependent), 1-9
Infinity, defined for floating-point, 4-54
INSBL instruction, 4-47
Insert instructions (list), 4-47
INSLH instruction, 4-47
INSLL instruction, 4-47
INSQH instruction, 4-47
INSQHIL (PALcode) instruction, 8-2
INSQHILR (PALcode) instruction, 8-3
INSQHIQ (PALcode) instruction, 8-3
INSQHIQR (PALcode) instruction, 8-3
INSQL instruction, 4-47
INSQTIL (PALcode) instruction, 8-3
INSQTILR (PALcode) instruction, 8-3
INSQTIQ (PALcode) instruction, 8-4
INSQTIQR (PALcode) instruction, 8-4

1-6 • Index
INSQUEL (PALcode) instruction, 8-4
INSQUEQ (PALcode) instruction, 8-4
Instruction encodings
floating-point format, C-3
summarized, C-1
Instruction formats
Branch, 3-9
conventions, 3-8
Floating-point Convert, 3-12
Floating-point operate, 3-11
Memory, 3-8
Memory jump, 3-9
operand values, 3-8
operands, 3-8
Operate, 3-10
operators, 3-5
overview, 1-4
PALcode, 3-12
registers, 3-1
Instruction overview, 1-4
Instruction set
See also Floating-point instructions;
PALcode instructions
access type field, 3-4
Boolean (list), 4-35
branch (list), 4-16
byte (list), 4-41
conditional move (integer), 4-37
data type field, 3-5
extract (list), 4-41
floating-point subsetting, 4-2
insert (list), 4-41
integer arithmetic (list), 4-21
introduced, 1-6
jump (list), 4-16
load memory integer (list), 4-4
mask (list), 4-41
miscellaneous (list), 4-99
name field, 3-4
opcode qualifiers, 4-3
operand notation, 3-4
overview, 4-1
shift, arithmetic, 4-40
shift, logical, 4-39
software emulation rules, 4-2

store memory integer (list), 4-4
VAX compatibility, 4-106
Instruction stream
see I-stream
INSWH instruction, 4-47
INSWL instruction, 4-47
Integer arithmetic instructions
See Arithmetic instructions
Integer division, A-12
Integer registers
defined, 3-1
R31 restrictions, 3-1

J
JMP instruction, 4-19
JSR instruction, 4-19
JSR_COROUTINE instruction, 4-19
Jump instructions, 4-16, 4-19
See also Control instructions
branch prediction logic, 4-20
coroutine linkage, 4-20
return from subroutine, 4-19
unconditional long jump, 4-20

L
LDA instruction, 4-5
LDAH instruction, 4-5
LDF instruction, 4-66
LDG instruction, 4-67
LDL instruction, 4-6
LDL_L instruction, 4-8
restrictions, 4-9
with processor lock register/flag, 4-8
with STx_C instruction, 4-8
LDQ instruction, 4-6
LDQP (PALcode) instruction, 8-8
LDQ_L instruction, 4-8
restrictions, 4-9
with processor lock register/flag, 4-8
with STx_C instruction, 4-8
LDQ_U instruction, 4-7
LDS instruction, 4-68
LDT instruction, 4-69
Literals, operand notation, 3-4

1-7

Load instructions
See also Floating-point load instructions
emulation of, 4-2
FETCH instruction, 4-101
load address, 4-5
load address high, 4-5
load quadword, 4-6
load quadword locked, 4-8
load sign-extended longword, 4-6
load sign-extended longword locked, 4-8
load unaligned quadword, 4-7
multiprocessor environment, 5-5
serialization, 4-103
Load literal, A-13
Load memory integer instructions (list), 4-4
Location, 5-9
Location access order
defined, 5-11
with processor issue order, 5-12
Lock flag, per-processor
defined, 3-2
with load locked instructions, 4-8
with store conditional instructions, 4-11
Lock registers, per-processor
defined, 3-2
with load locked instructions, 4-8
with store conditional instructions, 4-11
Logical instructions
See Boolean instructions
Longword data type, 2-2
atomic access of, 5-2
integer floating-point format, 2-10
LSB (least significant bit), defined for
floating-point, 4-54

M
/M opcode qualifier, IEEE floating-point,
4-56
Mask instructions (list), 4-49
MAX, defined for floating-point, 4-55
MB (Memory Barrier) instruction, 4-103
See also 1MB
multiprocessors only, 4-103
using, 5-17
with DMA I/O, 5-16

with multiprocessor D-stream, 5-16
MBZ (Must be Zero), 1-8
Memory access
aligned byte/word, A-II
coherency of, 5-1
granularity of, 5-2
width of, 5-2
Memory access sequence, 5-11
Memory alignment, requirement for, 5-2
Memory format instructions
function codes, summarized, C-l
opcodes for, C-l
Memory instruction format, 3-8
with function code, 3-9
Memory jump instruction format, 3-9
Memory management, support in PALcode,
6-1
Memory prefetch registers, A-I0
defined, 3-2
Memory-like behavior, 5-3
MFPR (PALcode) instruction, 8-8
MF_FPCR instruction, 4-82
MIN, defined for floating-point, 4-55
Miscellaneous instructions (list), 4-99
Move instructions (conditional)
See Conditional move instructions
Move, register-to-register, A-14
MSKBL instruction, 4-49
MSKLH instruction, 4-49
MSKLL instruction, 4-49
MSKQL instruction, 4-49
MSKWH instruction, 4-49
MSKWL instruction, 4-49
MTPR (PALcode) instruction, 8-8
MT_FPCR instruction, 4-82
synchronization requirement, 4-63
MULF instruction, 4-95
MULG instruction, 4-95
MULL instruction, 4-28
with MULQ, 4-28
MULQ instruction, 4-29
with MULL, 4-28
with UMULH, 4-29
MULS instruction, 4-96
MULT instruction, 4-96

1-8 • Index
Multiple instruction issue, A-2
Multiply instructions
See also Floating-point Operate
multiply longword, 4-28
multiply quadword, 4-29
multiply unsigned quadward high, 4-30
Multiprocessor environment
See also Data sharing
cache coherency in, 5-5
context switching, 5-17
I-stream reliability, 5-16
MB instruction, 5-16
no implied barriers, 5-15
read/write ordering, 5-8
serialization requirements in, 4-103
shared data, 5-5, A-7

N
NaN (Not-a-Number)
defined, 2-6
Quiet, 4-54
Signaling, 4-54
NATURALLY ALIGNED data objects
See ALIGNED data objects
Negate stylized code form, A-14
Non-memory-like behavior, 5-3
NOP, A-12
NOT instruction, ORNOT with zero, 4-36
NOT stylized code form, A-14

o
Opcode qualifiers
See also specific qualifiers
default values, 4-3
notation (list), 4-3
Opcodes
reserved, C-6
summarized, C-6
Operand expressions, 3-3
Operand notation
defined, 3-2
from VAX architecture standard, 3-4
Operand values, 3-3

Operate format instructions, opcodes for, C-2
Operate instruction format, 3-10
Floating-point, 3-11
Floating-point Convert, 3-12
Operators, instruction format, 3-5
Optimization
See Performance optimizations
ORNOT instruction, 4-36
OSF/1 privileged PALcode instructions, 9-2
OSF/1 unprivileged PALcode instructions, 9-1

p
PALcode
barriers with, 5-15
CALL_PAL instruction, described, 4-100
compared to hardware instructions, 6-1
Digital-defined for Alpha OSF/I, 9-1
Digital-defined for Alpha VMS, 8-1
implementation-specific, 6-1
instead of microcode, 6-1
instruction format, 3-12
overview, 6-1
privileged Alpha OSF/1, 9-2
privileged VAX VMS, 8-8
replacing, 6-2
required function support, 6-2
required instructions, 6-3
running environment, 6-1
special functions, 6-2
unprivileged Alpha OSF/I, 9-1
unprivileged Alpha VMS, 8-1
PALcode instructions
opcodes for required, C-5
opcodes reserved for, C-6
PALRESO, 6-2
PALRES1, 6-2
PALRES2, 6-2
PALRES3, 6-2
PALRES4, 6-2
PC
See Program Counter register
PCC
See Process Cycle Counter

1-9
Performance optimizations
branch prediction, A-3
code sequences, A-ll
D-stream, A-6
for frequently executed code, A-I
for I-streams, A-2
I-stream density, A-4
instruction alignment, A-2
instruction scheduling, A-5
multiple instruction issue, A-2
shared data, A-7
Prefetch data (FETCH instruction), 4-101
Prefetch data registers, A-lO
Prefetching data, considerations, A-I0
Privileged Architecture Library
See PALcode
PROBE (PALcode) instruction, 8-4
Process Cycle Counter (PCC), RPCC
instruction with, 4-104
Processor issue order
defined, 5-10
with location access order, 5-12
Processor issue sequence, 5-10
Program Counter (PC) register, 3-1
Pseudo-ops, A-15

Q
Quadword data type, 2-2
alignment of, 2-2
atomic access of, 5-2
integer floating-point format, 2-10
T_floating with, 2-10

R
R31, restrictions, 3-1
RAZ (Read as Zero), 1-8
RC (Read and Clear) instruction, 4-107
rdps (PALcode) instruction, 9-2
rdunique (PALcode) instruction, 9-1
rdusp (PALcode) instruction, 9-2
rdval (PALcode) instruction, 9-2

RD_PS (PALcode) instruction, 8-4
Read/write ordering (multiprocessor), 5-8
determining requirements, 5-8
memory location defined, 5-9
Read/write, sequential, A-9
READ_UNQ (PALcode) instruction, 8-4
Register-to-register move, A-14
Registers, 3-1
floating-point, 3-2
integer, 3-1
lock, 3-2
memory prefetch, 3-2
optional, 3-2
Program Counter (pc), 3-1
value when unused, 3-8
VAX compatibility, 3-2
REI (PALcode) instruction, 8-5
REMQHIL (PALcode) instruction, 8-5
REMQHILR (PALcode) instruction, 8-5
REMQHIQ (PALcode) instruction, 8-5
REMQHIQR (PALcode) instruction, 8-5
REMQTIL (PALcode) instruction, 8-6
REMQTILR (PALcode) instruction, 8-6
REMQTIQ (PALcode) instruction, 8-6
REMQTIQR (PALcode) instruction, 8-6
REMQUEL (PALcode) instruction, 8-6
REMQUEQ (PALcode) instruction, 8-7
Representative result, defined for
floating-point, 4-54
Reserved instructions, opcodes for, C-6
Reserved operand, defined for floating-point,
4-55
Result latency, A-5
RET instruction, 4-19
retsys (PALcode) instruction, 9-2
Rounding modes
See Floating-point rounding modes
RPCC (Read Process Cycle Counter)
instruction, 4-104
RS (Read and Set) instruction, 4-107
RSCC (PALcode) instruction, 8-7
rti (PALcode) instruction, 9-2

1-10 • Index

s
/S opcode qualifier
IEEE floating-point, 4-58
VAX floating-point, 4-57
S4ADDL instruction, 4-23
S4ADDQ instruction, 4-25
S4SUBL instruction, 4-32
S4SUBQ instruction, 4-34
S8ADDL instruction, 4-23
S8ADDQ instruction, 4-25
S8SUBL instruction, 4-32
S8SUBQ instruction, 4-34
SBZ (Should be Zero), 1-8
Security holes, 1-6
with UNPREDICTABLE results, 1-7
Sequential read/write, A-9
Serialization, MB instruction with, 4-103
Shared data (multiprocessor), A-7
changed vs. updated datum, 5-5
Shared data structures
atomic update, 5-6
ordering considerations, 5-7
using Memory Barrier (MB) instruction
5-8
'
Shared memory
access sequence, 5-10
accessing, 5-10
defined, 5-9
issue sequence, 5-10
Shift arithmetic instructions, 4-40
Shift logical instructions, 4-39
Single-precision floating-point, 4-61
SLL instruction, 4-39
Software considerations, A-I
See also Performance optimizations
SRA instruction, 4-40
SRL instruction, 4-39
STF instruction, 4-70
STG instruction, 4-71
STL instruction, 4-13
STL_C instruction, 4-11
with LDx_L instruction, 4-11
with processor lock register/flag, 4-11

Store instructions
See also Floating-point store instructions
emulation of, 4-2
FETCH instruction, 4-101
multiprocessor environment, 5-5
serialization, 4-103
store longword, 4-13
store longword conditional, 4-11
store quadword, 4-13
store quadword conditional, 4-11
store unaligned quadword, 4-14
Store memory integer instructions (list), 4-4
STQ instruction, 4-13
STQP (PALcode) instruction, 8-8
STQ_C instruction, 4-11
with LDx_L inst., 4-11
with processor lock register/flag, 4-11
STQ_U instruction, 4-14
STS instruction, 4-72
STT instruction, 4-73
SUBF instruction, 4-97
SUBG instruction, 4-97
SUBL instruction, 4-31
SUBQ instruction, 4-33
SUBS instruction, 4-98
SUBT instruction, 4-98
Subtract instructions
See also Floating-point Operate
subtract longword, 4-31
subtract quadword, 4-33
subtract scaled longword, 4-32
subtract scaled quadword, 4-34
SWASTEN (PALcode) instruction, 8-7
swpctx (PALcode) instruction, 9-2
SWPCTX (PALcode) instruction, 9-2
swpipl (PALcode) instruction 9-2
S_floating data type
'
alignment of, 2-8
compared to F_floating, 2-8
exceptions, 2-8
format, 2-7
mapping, 2-7
MAXIMIN, 4-55
operations, 4-61

[-11

T
tbi (PALcode) instruction, 9-2
Timing considerations, atomic sequences,
A-17
Trap handler, with non-finite arithmetic
operands, 4-59
Trap handling, IEEE floating-point, B-4
Trap modes
Floating-point, 4-57
IEEE, 4-58
IEEE convert-to-integer, 4-58
VAX, 4-57
VAX convert-to-integer, 4-58
Trap shadow
defined, 4-58
defined for floating-point, 4-55
trap handler requirement for, 4-58
TRAPB (Trap Barrier) instruction, A-14
described, 4-105
with MT_FPCR, 4-63
with trap shadow, 4-58
True result, defined for floating-point, 4-54
True zero, defined for floating-point, 4-54
T_floating data type
alignment of, 2-9
exceptions, 2-9
format, 2-9
MAXIMIN, 4-55

u
IU opcode qualifier
IEEE floating-point, 4-58
VAX floating-point, 4-57
UMULH instruction, 4-30
with MULQ, 4-29
UNALIGNED data objects, 1-8
Unconditional long jump, 4-20
UNDEFINED results, 1-7
UNORDERED memory references, 5-8
UNPREDICTABLE results, 1-7
Updated datum, 5-5

IV opcode qualifier
IEEE floating-point, 4-58
VAX floating-point, 4-58
VAX compatibility instructions, restrictions
for, 4-106
VAX compatibility register, 3-2
VAX convert-to-integer trap mode, 4-58
VAX floating-point
See also Floating-point instructions
D_floating, 2-5
F_floating, 2-3
G_floating, 2-4
trap modes, 4-58
VAX floating-point instructions
add instructions, 4-83
compare instructions, 4-85
convert from integer instructions, 4-88
convert to integer instructions, 4-87
convert VAX floating format instructions,
4-89
divide instructions, 4-93
multiply instructions, 4-95
opcodes for, C-5
Operate instructions, 4-76
qualifiers, summarized, C-5
subtract instructions, 4-97
VAX rounding modes, 4-56
VAX trap modes, required instruction
notation, 4-58
VAX VMS privileged PALcode instructions,
8-8
Virtual D-cache, 5-3
maintaining coherency of, 5-3
Virtual I-cache, 5-3
maintaining coherency of, 5-5
VMS unprivileged PALcode instructions, 8-1

1-12 • Index

W
whami (PALcode) instruction, 9-3
Word data type, 2-1
wrent (PALcode) instruction, 9-3
wrfen (PALcode) instruction, 9-3
Write buffers, requirements for, 5-4
Write-back caches, requirements for, 5-4
WRITE_UNQ (PALcode) instruction, 8-7
wrkgp (PALcode) instruction, 9-3
wrunique (PALcode) instruction, 9-1
wrusp (PALcode) instruction, 9-3
wrval (PALcode) instruction, 9-3
wrvptptr (PALcode) instruction, 9-3
WR_PS_SW (PALcode) instruction, 8-7

x
XOR instruction, 4-36

z
ZAP instruction, 4-52
ZAPNOT instruction, 4-52
Zero byte instructions (list), 4-52

Alpha Architecture Handbook
Reader's Comments

Your comments and suggestions will help us in our continuous effort to improve the quality
and usefulness of our handbooks.
What is your general reaction to this handbook? (Format, accuracy, completeness,
organization, etc.)

What features are most useful? - - - - - - - - - - - - - - - - - - - - - - -

Does the publication satisfy your needs?

What errors have you found?

Additional Comments

Name

Title

Company

Address
City

_
_ State

Zip

_
EC-H1689-10

(please tape here)

___________ ..

__ .. __ .__,__ .__ . __.

(p!~?~~_ f~J~_ h~~~~

__ ., _.
NO POSTAGE
NECESSARY IF
MAILED IN THE
UNITED STATES

BUSINESS REPLY MAIL
FIRST CLASS PERMIT NO. 33 MAYNARD, MASS.
POSTAGE WILL BE PAID BY ADDRESSEE

DIGITAL EQUIPMENT CORPORATION
85 Swanson Road (BXB1-1 IF04)
Boxboro, MA 01719-9960

11111"11111'11111.111.1 •• 1.1111.1 ••• 11 •• 11 ••• 1111.1

__ .