Digital PDFs
Documents
Guest
Register
Log In
EK-ES450-SV-A01
July 2001
448 pages
Original
4.2MB
view
download
Document:
AlphaServer ES45 Service Guide
Order Number:
EK-ES450-SV
Revision:
A01
Pages:
448
Original Filename:
OCR Text
AlphaServer ES45 Service Guide Order Number: EK-ES450-SV. A01 This manual is for service providers and self-maintenance customers responsible for ES45 systems. Compaq Computer Corporation First Printing, July 2001 © 2001 Compaq Computer Corporation Compaq, the Compaq logo, Compaq Insight Manager, AlphaServer, StorageWorks, and TruCluster Registered in U.S. Patent and Trademark Office. OpenVMS and Tru64 are trademarks of Compaq Information Technologies Group, L.P. in the United States and other countries. Linux is a registered trademark of Linus Torvalds in several countries. UNIX is a trademark of The Open Group in the United States and other countries. All other product names mentioned herein may be trademarks of their respective companies. Compaq shall not be liable for technical or editorial errors or omissions contained herein. The information in this document is provided “as is” without warranty of any kind and is subject to change without notice. The warranties for Compaq products are set forth in the express limited warranty statements accompanying such products. Nothing herein should be construed as constituting an additional warranty. FCC Notice This equipment generates, uses, and may emit radio frequency energy. The equipment has been type tested and found to comply with the limits for a Class A digital device pursuant to Part 15 of FCC rules, which are designed to provide reasonable protection against such radio frequency interference. Operation of this equipment in a residential area may cause interference in which case the user at his own expense will be required to take whatever measures may be required to correct the interference. Any modifications to this device—unless expressly approved by the manufacturer—can void the user’s authority to operate this equipment under part 15 of the FCC rules. Modifications The FCC requires the user to be notified that any changes or modifications made to this device that are not expressly approved by Compaq Computer Corporation may void the user's authority to operate the equipment. Cables Connections to this device must be made with shielded cables with metallic RFI/EMI connector hoods in order to maintain compliance with FCC Rules and Regulations. Taiwanese Notice Japanese Notice Canadian Notice This Class A digital apparatus meets all requirements of the Canadian Interference-Causing Equipment Regulations. Avis Canadien Cet appareil numérique de la classe A respecte toutes les exigences du Règlement sur le matériel brouilleur du Canada. European Union Notice Products with the CE Marking comply with both the EMC Directive (89/336/EEC) and the Low Voltage Directive (73/23/EEC) issued by the Commission of the European Community. Compliance with these directives implies conformity to the following European Norms (in brackets are the equivalent international standards): EN55022 (CISPR 22) - Electromagnetic Interference EN50082-1 (IEC801-2, IEC801-3, IEC801-4) - Electromagnetic Immunity EN60950 (IEC950) - Product Safety Warning! This is a Class A product. In a domestic environment this product may cause radio interference in which case the user may be required to take adequate measures. Achtung! Dieses ist ein Gerät der Funkstörgrenzwertklasse A. In Wohnbereichen können bei Betrieb dieses Gerätes Rundfunkstörungen auftreten, in welchen Fällen der Benutzer für entsprechende Gegenmaßnahmen verantwortlich ist. Attention! Ceci est un produit de Classe A. Dans un environnement domestique, ce produit risque de créer des interférences radioélectriques, il appartiendra alors à l'utilisateur de prendre les mesures spécifiques appropriées. Contents Preface .................................................................................................................... xvii Chapter 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.12.1 1.12.2 1.13 1.14 1.15 1.16 1.17 1.18 System Architecture.............................................................................. 1-2 System Enclosures ................................................................................ 1-4 System Chassis—Front View/Top View................................................ 1-6 System Chassis—Rear View ................................................................. 1-7 Hot Swap Module.................................................................................. 1-9 I/O Ports and Slots .............................................................................. 1-10 Control Panel ...................................................................................... 1-12 System Motherboard........................................................................... 1-14 CPU Card ............................................................................................ 1-16 Memory Architecture and Options...................................................... 1-17 PCI Backplane..................................................................................... 1-19 Remote System Management Logic .................................................... 1-21 System Power Controller (SPC).................................................... 1-23 Remote Management Console (RMC) ........................................... 1-24 Power Supplies.................................................................................... 1-25 Fans..................................................................................................... 1-27 Removable Media Storage................................................................... 1-29 Hard Disk Drive Storage..................................................................... 1-30 System Access ..................................................................................... 1-31 Console Terminal ................................................................................ 1-33 Chapter 2 2.1 2.2 2.3 2.3.1 2.3.2 System Overview Troubleshooting Questions to Consider ........................................................................... 2-2 Diagnostic Tables .................................................................................. 2-3 Service Tools and Utilities .................................................................... 2-9 Error Handling/Logging Tools (Compaq Analyze).......................... 2-9 Loopback Tests................................................................................ 2-9 v 2.3.3 2.3.4 2.3.5 2.3.6 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.5 2.5.1 2.5.2 2.5.3 2.5.4 2.5.5 2.5.6 2.5.7 2.5.8 SRM Console Commands................................................................ 2-9 Remote Management Console (RMC) ........................................... 2-10 Crash Dumps ................................................................................ 2-10 Revision and Configuration Management Tool (RCM)................. 2-10 Q-Vet Installation Verification............................................................ 2-11 Installing Q-Vet ............................................................................ 2-13 Running Q-Vet .............................................................................. 2-15 Reviewing Results of the Q-Vet Run............................................. 2-17 De-Installing Q-Vet....................................................................... 2-19 Information Resources ........................................................................ 2-20 Compaq Service Tools CD .............................................................. 2-20 ES45 Service HTML Help File....................................................... 2-20 Alpha Systems Firmware Updates ................................................ 2-20 Fail-Safe Loader............................................................................. 2-21 Software Patches............................................................................ 2-21 Learning Utility ............................................................................. 2-21 Late-Breaking Technical Information ........................................... 2-21 Supported Options ......................................................................... 2-22 Chapter 3 3.1 3.2 3.3 3.3.1 3.3.2 3.3.3 3.4 3.4.1 3.4.2 3.4.3 3.4.4 3.4.5 3.5 3.6 vi Power-Up Diagnostics and Display Overview of Power-Up Diagnostics....................................................... 3-2 System Power-Up Sequence.................................................................. 3-3 Power-Up Displays................................................................................ 3-6 SROM Power-Up Display ............................................................... 3-6 SRM Console Power-Up Display................................................... 3-10 SRM Console Event Log ............................................................... 3-13 Power-Up Error Messages .................................................................. 3-14 SROM Messages with Beep Codes................................................ 3-14 Checksum Error............................................................................ 3-16 No MEM Error .............................................................................. 3-18 RMC Error Messages.................................................................... 3-19 SROM Error Messages.................................................................. 3-22 Forcing a Fail-Safe Floppy Load ......................................................... 3-24 Updating the RMC .............................................................................. 3-26 Chapter 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 Diagnostic Command Summary ........................................................... 4-2 buildfru.................................................................................................. 4-6 cat el and more el ................................................................................ 4-10 clear_error ........................................................................................... 4-11 crash.................................................................................................... 4-12 deposit and examine............................................................................ 4-13 exer...................................................................................................... 4-17 floppy_write......................................................................................... 4-22 grep ..................................................................................................... 4-23 hd......................................................................................................... 4-25 info....................................................................................................... 4-27 Kill and Kill_diags .............................................................................. 4-41 memexer.............................................................................................. 4-42 memtest............................................................................................... 4-44 net ....................................................................................................... 4-49 nettest ................................................................................................. 4-51 set sys_serial_num .............................................................................. 4-55 show error ........................................................................................... 4-56 show fru............................................................................................... 4-59 show_status......................................................................................... 4-62 sys_exer ............................................................................................... 4-64 test....................................................................................................... 4-66 Chapter 5 5.1 5.1.1 5.1.2 5.1.3 5.2 5.3 5.3.1 5.4 SRM Console Diagnostics Error Logs Error Log Analysis with Compaq Analyze............................................ 5-2 WEB Enterprise Service (WEBES) Director................................... 5-4 Using Compaq Analyze................................................................... 5-5 Bit to Test ..................................................................................... 5-10 Fault Detection and Reporting............................................................ 5-17 Machine Checks/Interrupts................................................................. 5-19 Error Logging and Event Log Entry Format ................................ 5-21 Environmental Errors Captured by SRM ........................................... 5-23 vii Chapter 6 6.1 6.1.1 6.1.2 6.2 6.3 6.4 6.4.1 6.5 6.6 6.7 6.7.1 6.7.2 6.7.3 6.7.4 6.7.5 6.9 System Consoles.................................................................................... 6-2 Selecting the Display Device........................................................... 6-3 Setting the Control Panel Message................................................. 6-4 Displaying the Hardware Configuration............................................... 6-5 Setting Environment Variables ............................................................ 6-6 Setting Automatic Booting.................................................................. 6-16 Setting the Operating System to Auto Start ................................ 6-16 Changing the Default Boot Device...................................................... 6-17 Setting SRM Security.......................................................................... 6-18 Configuring Devices ............................................................................ 6-20 CPU Configuration ....................................................................... 6-21 Memory Configuration.................................................................. 6-23 PCI Configuration......................................................................... 6-29 PCI Module LEDs ......................................................................... 6-31 Power Supply Configurations ....................................................... 6-32 Booting Linux...................................................................................... 6-34 Chapter 7 7.1 7.2 7.2.1 7.3 7.4 7.5 7.6 7.6.1 7.6.2 7.6.3 7.6.4 7.6.5 7.6.6 7.6.7 7.6.8 7.7 7.8 viii System Configuration and Setup Using the Remote Management Console RMC Overview ...................................................................................... 7-2 Operating Modes ................................................................................... 7-4 Bypass Modes.................................................................................. 7-6 Terminal Setup ..................................................................................... 7-9 Connecting to the RMC CLI................................................................ 7-10 SRM Environment Variables for COM1 ............................................. 7-12 RMC Command-Line Interface........................................................... 7-13 Defining the COM1 Data Flow ..................................................... 7-15 Displaying the System Status....................................................... 7-16 Displaying the System Environment............................................ 7-18 Dumping DPR Data ...................................................................... 7-20 Power On and Off, Reset, and Halt .............................................. 7-22 Configuring Remote Dial-In.......................................................... 7-24 Configuring Dial-Out Alert........................................................... 7-26 Resetting the Escape Sequence .................................................... 7-29 Resetting the RMC to Factory Defaults.............................................. 7-30 Troubleshooting Tips........................................................................... 7-32 Chapter 8 8.1 8.1.1 8.1.2 8.1.3 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.10.1 8.11 8.11.1 8.12 8.13 8.13 8.13 8.14 8.15 8.16 8.17 8.18 FRU Removal and Replacement FRUs ..................................................................................................... 8-3 Power Cords .................................................................................... 8-6 FRU Locations ................................................................................ 8-7 Important Information Before Replacing FRUs ............................. 8-9 Removing Enclosure Panels................................................................ 8-11 Accessing the System Chassis in a Cabinet........................................ 8-15 Removing Covers from the System Chassis........................................ 8-17 Power Supply ...................................................................................... 8-21 Fans..................................................................................................... 8-23 Universal Hard Disk Drives................................................................ 8-25 Removing the Shipping Bracket ......................................................... 8-28 CPUs ................................................................................................... 8-30 Memory DIMMs .................................................................................. 8-33 Determining Memory Configuration ............................................ 8-37 PCI Cards............................................................................................ 8-39 Replacing the PCI Hot Swap Module ........................................... 8-43 OCP Assembly..................................................................................... 8-44 Installing Disk Cages.......................................................................... 8-45 Cabling a Second Disk Drive Cage...................................................... 8-49 Adding or Replacing Removable Media .............................................. 8-50 Floppy Drive........................................................................................ 8-53 I/O Connector Assembly...................................................................... 8-55 PCI Backplane..................................................................................... 8-57 System Motherboard........................................................................... 8-62 Power Harness .................................................................................... 8-65 Appendix A SRM Console Commands Appendix B Jumpers and Switches B.1 B.2 B.3 B.4 B.5 RMC and SPC Jumpers on System Motherboard.................................B-2 TIG/FSL Jumpers on System Motherboard..........................................B-4 Clock Generator Switch Settings ..........................................................B-7 Jumper on PCI Board............................................................................B-9 Setting Jumpers..................................................................................B-11 ix Appendix C DPR Address Layout Appendix D Registers D.1 D.2 D.3 D.4 D.5 D.6 D.7 D.8 D.9 D.10 D.11 D.12 D.13 D.14 D.15 D.16 D.17 D.18 D.19 D.20 D.21 D.22 D.23 D.24 Ibox Status Register (I_STAT) ............................................................. D-2 Memory Management Status Register (MM_STAT) .......................... D-5 Dcache Status Register (DC_STAT)..................................................... D-6 Cbox Read Register .............................................................................. D-7 Exception Address Register (EXC_ADDR) .......................................... D-9 Interrupt Enable and Current Processor Mode Register (IER_CM).. D-10 Interrupt Summary Register (ISUM) ................................................ D-12 PAL Base Register (PAL_BASE) ....................................................... D-14 Ibox Control Register (I_CTL)............................................................ D-15 Process Context Register (PCTX)....................................................... D-20 21274 Cchip Miscellaneous Register (MISC) ..................................... D-23 21274 Cchip CPU Device Interrupt Request Register (DIRn, n=0,1,2,3) ............................................................................................................ D-26 21274 Array Address Registers (AAR0–AAR3).................................. D-27 Pchip System Error Register (SERROR) ........................................... D-29 Pchip A/G PCI Error Register (GPERROR, APERROR) ................... D-31 Pchip AGP Error Register (AGPERROR) .......................................... D-35 DPR Registers for 680 Correctable Machine Check Logout Frames . D-37 DPR Power Supply Status Registers ................................................. D-40 DPR 680 Fatal Registers.................................................................... D-41 CPU and System Uncorrectable Machine Check Logout Frame ....... D-42 Console Data Log Event Environmental Error Logout Frame (680 Uncorrectable).................................................................................... D-43 CPU and System Correctable Machine Check Logout Frame ........... D-44 Environmental Error Logout Frame (680 Correctable) ..................... D-45 Platform Logout Frame Register Translation.................................... D-46 Appendix E Isolating Failing DIMMs E.1 E.2 E.3 x Information for Isolating Failures ........................................................E-2 DIMM Isolation Procedure....................................................................E-3 EV68 Single-Bit Errors .......................................................................E-19 Examples 3–1 3–2 3–3 3–4 4–1 4–2 4–3 4–4 4–5 4–6 4–7 4–8 4–9 4–10 4–11 4–12 4–13 4–14 4–15 4–16 4–17 4–18 4–19 4–20 4–21 4–22 4–23 4–24 4–25 4–26 4–27 4–28 5–1 6–1 6–2 6–3 6–4 6–5 7–1 Sample SROM Power-Up Display......................................................... 3-6 SRM Power-Up Display ...................................................................... 3-10 Sample Console Event Log.................................................................. 3-13 Checksum Error and Fail-Safe Load................................................... 3-16 buildfru.................................................................................................. 4-6 more el................................................................................................. 4-10 clear_error ........................................................................................... 4-11 deposit and examine............................................................................ 4-13 exer...................................................................................................... 4-17 floppy_write......................................................................................... 4-22 grep ..................................................................................................... 4-23 hd......................................................................................................... 4-25 info 0.................................................................................................... 4-27 info 1.................................................................................................... 4-29 info 2.................................................................................................... 4-30 info 3.................................................................................................... 4-31 info 4.................................................................................................... 4-33 info 5.................................................................................................... 4-34 info 6.................................................................................................... 4-36 info 7.................................................................................................... 4-38 info 8.................................................................................................... 4-40 kill and kill_diags................................................................................ 4-41 memexer.............................................................................................. 4-42 memtest............................................................................................... 4-44 net -ic and net -s.................................................................................. 4-49 nettest ................................................................................................. 4-51 set sys_serial_num .............................................................................. 4-55 show error ........................................................................................... 4-56 show fru............................................................................................... 4-59 show status.......................................................................................... 4-62 sys_exer ............................................................................................... 4-64 test -lb ................................................................................................. 4-66 Console Level Environmental Error Logout Frame............................ 5-23 set ocp_text............................................................................................ 6-4 Set Password....................................................................................... 6-18 set secure............................................................................................. 6-19 clear password..................................................................................... 6-19 Linux Boot Output .............................................................................. 6-35 set com1_mode .................................................................................... 7-15 xi 7–2 7–3 7–4 7–5 7–6 7–7 7–8 7–9 7–10 status................................................................................................... 7-16 env....................................................................................................... 7-18 dump ................................................................................................... 7-20 power on/off ......................................................................................... 7-22 halt in/out............................................................................................ 7-23 reset..................................................................................................... 7-23 Dial-In Configuration.......................................................................... 7-24 Dial-Out Alert Configuration .............................................................. 7-26 set escape ............................................................................................ 7-29 Figures 1–1 1–2 1–3 1–4 1–5 1–6 1–7 1–8 1–9 1–10 1–11 1–12 1–13 1–14 1–15 1–16 1–17 1–18 3–1 3–2 5–1 5–2 5–3 5–4 6–1 6–2 6–3 6–4 xii System Block Diagram.......................................................................... 1-2 ES45 Systems........................................................................................ 1-4 Components Top/Front View (Pedestal/Rackmount Orientation) ........ 1-6 Rear Components (Pedestal/Rackmount Orientation).......................... 1-7 Hot Swap Module.................................................................................. 1-9 Rear Connectors.................................................................................. 1-10 Control Panel ...................................................................................... 1-12 Component and Connector Locations ................................................. 1-14 CPU Card ............................................................................................ 1-16 Memory Architecture .......................................................................... 1-17 I/O Control Logic ................................................................................. 1-19 Remote System Management Logic Diagram..................................... 1-21 Power Supplies.................................................................................... 1-25 System Fans........................................................................................ 1-27 Removable Media Drive Area ............................................................. 1-29 Hard Disk Storage Cage with Drives (Pedestal/Rack)........................ 1-30 System Lock and Key.......................................................................... 1-31 Console Terminal Connections (Local)................................................ 1-33 Power-Up Sequence............................................................................... 3-4 Function Jumpers ............................................................................... 3-24 Compaq Analyze Initial Screen............................................................. 5-5 Problem Reports Screen ........................................................................ 5-6 Compaq Analyze Problem Report Details............................................. 5-7 Compaq Analyze Problem Report Details (Continued)......................... 5-8 CPU Slot Locations (Pedestal/Rack) ................................................... 6-21 CPU Slot Locations (Tower)................................................................ 6-22 Stacked and Unstacked DIMMs ......................................................... 6-24 Memory Configuration (Pedestal/Rack) .............................................. 6-27 6–5 6–6 6–7 6–8 6–9 6–10 7–1 7–2 7–3 7–4 8–1 8–2 8–3 8–4 8–5 8–6 8–7 8–8 8–9 8–10 8–11 8–12 8–13 8–14 8–15 8–16 8–17 8–18 8–19 8–20 8–21 8–22 8–23 8–24 8–25 8–26 Memory Configuration (Tower)........................................................... 6-28 PCI Slot Locations (Pedestal/Rack)..................................................... 6-29 PCI Slot Voltages and Hose Numbers................................................. 6-30 PCI Slot Locations (Tower) ................................................................. 6-31 PCI Status LEDs................................................................................. 6-32 Power Supply Locations ...................................................................... 6-33 Data Flow in Through Mode ................................................................. 7-4 Data Flow in Bypass Mode.................................................................... 7-6 Terminal Setup for RMC (Tower View) ............................................... 7-9 RMC Jumpers (Default Positions) ...................................................... 7-30 FRUs — Front/Top (Pedestal/Rack View)............................................. 8-7 FRUs — Rear (Pedestal/Rack View) ..................................................... 8-8 Enclosure Panel Removal (Tower) ...................................................... 8-11 Enclosure Panel Removal (Pedestal)................................................... 8-13 Accessing the Chassis in a Cab ........................................................... 8-15 Moving the Inner Race Forward ......................................................... 8-16 Covers on the System Chassis (Tower) ............................................... 8-19 Covers on the System Chassis (Pedestal/Rack) .................................. 8-20 Replacing or Adding a Power Supply.................................................. 8-21 Replacing Fans.................................................................................... 8-23 Replacing or Adding a Hard Drive...................................................... 8-26 Removing the Shipping Bracket ......................................................... 8-28 Adding or Replacing CPU Cards......................................................... 8-30 Installing and Removing MMBs and DIMMs ..................................... 8-34 Aligning DIMM in MMB ..................................................................... 8-36 Pedestal/Rack Memory Configuration ................................................ 8-37 Tower Memory Configuration ............................................................. 8-38 Installing or Replacing a PCI Card..................................................... 8-40 PCI Module Hot Swap Assembly ........................................................ 8-42 Replacing the OCP Assembly.............................................................. 8-44 Cabling and Preparation for Installing Disk Cages............................ 8-45 Disk Cage Installation ........................................................................ 8-47 Cabling a Second Disk Cage ............................................................... 8-49 Adding a 5.25-Inch Device .................................................................. 8-50 Replacing the Floppy Drive................................................................. 8-53 Replacing the I/O Connector Assembly............................................... 8-55 xiii 8–27 8–28 8–29 8–30 8–31 B–1 B–2 B–3 B–4 Cables Connected to PCI Backplane................................................... 8-57 Removing the Separators.................................................................... 8-59 Replacing the PCI Backplane ............................................................. 8-60 Replacing the System Motherboard.................................................... 8-62 Replacing the Power Harness ............................................................. 8-65 RMC and SPC Jumpers ........................................................................B-2 TIG/SROM Jumpers..............................................................................B-4 CSB Switchpack E16.............................................................................B-7 PCI Board Jumpers...............................................................................B-9 Tables 1 1–1 2–1 2–2 2–3 2–4 2–5 3–1 3–2 3–3 3–4 4–1 4–2 4–3 5–1 5–2 5–3 5–4 6–1 7–1 7–2 7–3 xiv Compaq AlphaServer ES45 Documentation ....................................... xviii Fan Descriptions ................................................................................. 1-28 Power Problems..................................................................................... 2-4 Problems Getting to Console Mode ....................................................... 2-5 Problems Reported by the Console........................................................ 2-6 Boot Problems ....................................................................................... 2-7 Errors Reported by the Operating System............................................ 2-8 Error Beep Codes ................................................................................ 3-14 RMC Fatal Error Messages................................................................. 3-19 RMC Warning Messages..................................................................... 3-20 SROM Error Messages........................................................................ 3-22 Summary of Diagnostic and Related Commands.................................. 4-2 Show Error Message Translation........................................................ 4-58 Bit Assignments for Error Field.......................................................... 4-61 Common Event Header Example Table (CEH) V2.0 .......................... 5-11 ES45 Fault Detection and Correction ................................................. 5-18 Machine Checks/Interrupts................................................................. 5-19 Sample Error Log Event Structure Map (ES45 with 10 PCI Slots).... 5-22 SRM Environment Variables ................................................................ 6-8 Status Command Fields...................................................................... 7-17 Elements of Dial String and Alert String ........................................... 7-28 RMC Troubleshooting ......................................................................... 7-32 8–1 8–2 A–1 B–1 B–2 B–3 B–4 B–5 C–1 D–1 D–2 D–3 D–4 D–5 D–6 D–7 D–8 D–9 D–10 D–11 D–12 D–13 D–14 D–15 D–16 D–17 D–18 D–19 D–20 D–21 D–22 D–23 FRU List................................................................................................ 8-3 Country-Specific Power Cords............................................................... 8-6 SRM Commands Used on ES45 Systems..............................................A-1 RMC/SPC Jumper Settings...................................................................B-3 TIG/FSL Jumper Descriptions..............................................................B-5 Firmware Function Table (FIR_FUNC.................................................B-5 Clock Generator Settings ......................................................................B-8 PCI Board Jumper Descriptions .........................................................B-10 DPR Address Layout.............................................................................C-2 Ibox Status Register Fields.................................................................. D-2 Memory Management Status Register Fields...................................... D-5 Dcache Status Register Fields ............................................................. D-6 Cbox Read Register Fields ................................................................... D-7 IER_CM Register Fields .................................................................... D-11 ISUM Register Fields......................................................................... D-13 PAL_BASE Register Fields................................................................ D-14 I_CTL Register Fields ........................................................................ D-16 PCTX Register Fields......................................................................... D-21 21274 Cchip Miscellaneous Register Fields....................................... D-24 21274 Device Interrupt Request Register Fields ............................... D-26 21274 Array Address Register (AAR) ................................................ D-27 Pchip System Error Register.............................................................. D-29 Pchip Error Register .......................................................................... D-32 Pchip AGP Error Register.................................................................. D-35 DPR Locations A0:A9......................................................................... D-37 Nine Bytes Read from Power Supply ................................................. D-40 DPR 680 Fatal Registers.................................................................... D-41 CPU and System Uncorrectable Machine Check Logout Frame ....... D-42 Console Data Log Event Environmental Error Logout Frame (680 Uncorrectable).................................................................................... D-43 CPU and System Correctable Machine Check Logout Frame ......... D-44 Environmental Error Logout Frame.................................................. D-45 Bit Definition of Logout Frame Registers .......................................... D-46 xv E–1 E–2 E–3 E–4 E–5 E–6 xvi Information Needed to Isolate Failing DIMMs.....................................E-2 Determining the Real Failed Array for 4-Way Interleaving.................E-3 Determining the Real Failed Array for 2-Way Interleaving.................E-3 Description of DPR Locations 80, 82, 84, and 86 ..................................E-4 Failing DIMM Lookup Table.................................................................E-6 Syndrome to Data Check Bits Table ...................................................E-19 Preface Intended Audience This manual is for service providers and self-maintenance customers who are responsible for servicing ES45 systems. Document Structure This manual uses a structured documentation design. Topics are organized into small sections, usually consisting of two facing pages. Most topics begin with an abstract that provides an overview of the section, followed by an illustration or example. The facing page contains descriptions, procedures, and syntax definitions. This manual has eight chapters and five appendixes. • Chapter 1, System Overview, provides system overview information. • Chapter 2, Troubleshooting, describes the starting points for diagnosing problems. • Chapter 3, Power-up Diagnostics and Display, describes the power-up process and RMC, SROM, and SRM power-up diagnostics. • Chapter 4, SRM Console Diagnostics, provides troubleshooting information with the SRM console. • Chapter 5, Error Logs, provides information for interpreting error logs. • Chapter 6, System Configuration and Setup, describes how to configure and set up systems. • Chapter 7, Using the Remote Management Console, provides information for managing the system through remote management console. • Chapter 8, FRU Removal and Replacement, describes the procedures for removing and replacing FRUs. xvii • Appendix A, SRM Console Commands, lists the SRM console commands most frequently used with the ES4x family of systems. • Appendix B, Jumpers and Switches, lists and describes the configuration jumpers and switches on the system motherboard and PCI board. • Appendix C, DPR Address Layout, shows the address layout of the dual-port RAM. • Appendix D, Registers, describes 21264 EV68 internal processor registers. • Appendix E, Isolating Failing DIMMs, explains how to manually isolate a failing DIMM from the failing address and failing data bits. Documentation Titles Table 1 Compaq AlphaServer ES45 Documentation Title Order Number User Documentation Kit QA–6NUAB–G8 Owner’s Guide EK–ES450–UG Documentation CD (6 languages) AG–RPJ8A–TS Maintenance Kit QA–6NUAA–G8 Service Guide EK–ES450–SV Service Guide HTML CD (IPBs included) AG–RPJ5A–TS Loose Piece Items Basic Installation Card EK–ES450–PD Rackmount Installation Guide EK–ES450–RG Rackmount Installation Template ES–ES450–TP Information on the Internet Visit the Compaq Web site at www.compaq.com for service tools and more information about the AlphaServer ES45 system. xviii Chapter 1 System Overview This chapter provides an overview of the system in these sections: • System Architecture • System Enclosures • System Chassis—Front View/Top View • System Chassis—Rear View • Hot Swap Module • I/O Ports and Slots • Control Panel • System Motherboard • CPU Card • Memory Architecture and Options • PCI Backplane • Remote System Management Logic • Power Supplies • Fans • Removable Media Storage • Hard Disk Drive Storage • System Access • Console Terminal System Overview 1-1 1.1 System Architecture The system uses a switch-based interconnect system that maintains constant performance even as the number of transactions multiplies. Figure 1–1 System Block Diagram Command, Address, and Control Lines for Each Memory Array C-chip Control Lines for D-chips CAPbus P-chip 64 bit PCI 64 bit PCI P-chip 64 bit PCI 64 bit PCI PADBus First CPU CPUs CPU Data Bus 1 or 2 Memory Arrays Memory Data Bus 8 D-chips 1 or 2 Memory Arrays B-cache PKW1400A-99 1-2 ES45 Service Guide This system is designed to fully exploit the potential of the Alpha EV68 CB chip by using a switch-based (or point-to-point) interconnect system. With a traditional bus design, the processors, memory, and I/O modules share the bus. As the number of bus users increases, the transactions interfere with one another, increasing latency and decreasing aggregate bandwidth. With a switch-based system, speed is maintained and little degradation in performance occurs as the number of CPUs, memory, and I/O users increases. The switched system interconnect uses a set of complex microprocessor support chips that route the traffic over multiple paths. This chipset consists of one Cchip, two P-chips, and eight D-chips. • C-chip. Provides the command interface from the CPUs and main memory. The C-chip allows each CPU to do transactions simultaneously. • D-chips. Provide the data path for the CPUs, main memory, and I/O. • P-chips. Provide the interface to the I/O with two buses per chip. The chipset supports up to four CPUs and up to 32 Gbytes of memory. Interleaving occurs when at least two sibling or nonsibling memory arrays are used. Two 256-bit memory buses support four memory arrays, yielding a maximum 8 Gbytes/sec system bandwidth. Transactions are ECC protected. Upon the receipt of data, the receiver checks for data integrity and corrects any errors. System Overview 1-3 1.2 System Enclosures The ES45 family consists of a standalone tower, a pedestal with expanded storage capacity, and a cabinet. Figure 1–2 ES45 Systems Cabinet Pedestal Tower PK0212B 1-4 ES45 Service Guide The ES45 system provides connectors for eight DIMMs on each of the memory motherboards (MMBs) and connectors for ten PCI options on the PCI backplane. The system comes with the following: • 1–4 CPUs • Up to 32 DIMMs (8 DIMMs on each MMB) • 10 PCI slots (4 – 33 MHz and 6 – 66 MHz) The following components are common to all ES45 systems: • Up to four CPUs, based on the EV68 Alpha chip • Memory DIMMs (200-pin) • Floppy diskette drive (3.5-inch, high density) • CD-ROM drive • Two half-height or one full-height removable media bays • Up to two storage drive cages that house up to six 1-inch drives per cage • Up to three power supplies, offering N+1 power • A 25-pin parallel port, two 9-pin serial ports, mouse and keyboard ports, and one MMJ connector for a local console terminal • An operator control panel with a 16-character back-lit display and a Power button, Halt button, and Reset button System Overview 1-5 1.3 System Chassis—Front View/Top View Figure 1–3 Components Top/Front View (Pedestal/Rackmount Orientation) 8 7 6 4 1 9 3 6 2 5 Operator control panel CD-ROM drive Removable media bays Floppy diskette drive Storage drive bays Fans CPUs Memory PCI cards 1-6 ES45 Service Guide PK0201b 1.4 System Chassis—Rear View Figure 1–4 Rear Components (Pedestal/Rackmount Orientation) 3 5 2 4 1 PK0206B Power supplies PCI bulkhead I/O ports Power harness access cover Speaker System Overview 1-7 1.5 ! Hot Swap Module WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. WARNING: To prevent injury, unplug the power cord from each power supply before installing components. CAUTION: Hot swap is not currently supported by the operating systems. Do not press switches and on the hot swap module while the system is powered. Pressing the switches may result in loss of data. 1-8 ES45 Service Guide Figure 1–5 Hot Swap Module 1 2 3 4 Open Position Closed Position MR0028 Module release button Momentary hot swap power switch (Not supported) Communication connector Module release button connection System Overview 1-9 1.6 I/O Ports and Slots Figure 1–6 Rear Connectors Pedestal/Rack 1 2 3 4 5 6 8 7 8 1 2 3 4 5 7 Tower 1-10 ES45 Service Guide 6 PK0209A Rear Panel Connections Modem port—Dedicated 9-pin port for connection by modem to remote management console. COM2 serial port—Extra port to modem or any serial device. Keyboard port—To PS/2-compatible keyboard. Mouse port—To PS/2-compatible mouse. COM1 MMJ-type serial port/terminal port —For connecting a console terminal. Parallel port—To parallel device such as a printer. SCSI breakouts. PCI slots—For option cards for high-performance network, video, or disk controllers. System Overview 1-11 1.7 Control Panel The control panel provides system controls and status indicators. The controls are the Power, Halt, and Reset buttons. A 16-character back-lit alphanumeric display indicates system state. The panel has two LEDs: a green Power OK indicator and an amber Halt indicator. Figure 1–7 Control Panel 1 2 1-12 3 4 5 6 PK0204 Control panel display. A one-line, 16-character alphanumeric display that indicates system status during power-up and testing. Power button. Powers the system on and off. If a failure occurs that causes the system to shut down, pressing the power button off and then on clears the shutdown condition and attempts to power the system back on. Conditions that prevent the system from powering on can be determined by entering the env command from the remote management console (RMC) command line. The RMC is powered separately from the rest of the system and can operate as long as one power supply is plugged in. (See Chapter 7.) ES45 Service Guide Power LED (green). Lights when the power button is depressed and system power passes initial checks. Reset button. A momentary contact switch that restarts the system and reinitializes the console firmware. Power-up messages are displayed, and then the console prompt is displayed or the operating system boot messages are displayed, depending on how the startup sequence has been defined. Halt LED (amber). Lights when you press the Halt button. Halt button. Halts the system. • If the operating system is running, pressing the Halt button halts the operating system and returns to the SRM console. • If the Halt button is latched when the system is reset or powered up, the system halts in the SRM console. Systems that are configured to autoboot cannot boot until the Halt button is unlatched. Commands issued from the remote management console (RMC) can be used to reset, halt, and power the system on or off. RMC Command Function Power {off, on} Equivalent to pressing the Power button on the system. If the Power button is in the Off position, the RMC power on command has no effect. Halt {in, out} Equivalent to pressing the Halt button on the control panel to cause a halt (halt in) or releasing it from the latched position to deassert the halt (halt out). Reset Equivalent to pressing the Reset button on the control panel. System Overview 1-13 1.8 System Motherboard The system motherboard is located on the floor of the system card cage. It has slots for the CPUs and memory motherboards (MMBs) and has the PCI backplane interconnect. Figure 1–8 Component and Connector Locations Connector to PCI Backplane FSL J34 MMB3 J18 CPU2 J33 MMB2 J17 CPU0 RMC J32 J31 J16 CPU1 J15 CPU3 MMB1 MMB0 Cterm (VTerm_chk) Vterm (VTerm_data) PK0323C 1-14 ES45 Service Guide The system motherboard has the majority of the logic for the system, including: • CPU connectors • MMB connectors • Connector to PCI backplane • RMC jumpers • Fail-safe loader (FSL) jumpers • Vterm and Cterm regulators Figure 1–8 shows the location of components and connectors on the system motherboard. System Overview 1-15 1.9 CPU Card An ES45 can have up to four CPU cards. The CPU card has an 8-Mbyte second-level cache and a DC-to-DC converter that provides the required voltage to the Alpha chip. Power-up diagnostics are stored in a flash SROM on the card. Figure 1–9 CPU Card PK0271A The EV68 CB microprocessor is a superscalar CPU with out-of-order execution and speculative execution to maximize speed and performance. It contains four integer execution units and dedicated execution units for floating-point add, multiply, and divide. It has an instruction cache and a data cache on the chip. Each cache is a 64 KB, two-way, set associative, virtually addressed cache that has 64-byte blocks. The data cache is a physically tagged, write-back cache. Each CPU card has an 8 MB secondary B-cache (backup cache) consisting of dual data rate (DDR) static RAMs (SRAMs) that provide low latency and high bandwidth. Each CPU card also has a DC−to−DC converter that provides the required voltage to the Alpha chip. See Chapter 6 for CPU configuration. 1-16 ES45 Service Guide 1.10 Memory Architecture and Options The system has two 256-bit wide memory data buses, which can move large amounts of data simultaneously. Figure 1–10 Memory Architecture MMB3 MMB2 MMB1 MMB0 Address Arrays 0 & 2 Address Arrays 1 & 3 256 Data + 32 Check Bits 256 Data + 32 Check Bits Data Bus 1 To all eight D-Chips C-Chip Data Bus 0 To all eight D-Chips PK0272A System Overview 1-17 Memory Architecture Memory throughput in this system is maximized by the following features: • Two independent, wide memory data buses • Very low memory latency (120 ns) and high bandwidth with 125 MHz clock • ECC memory Each data bus is 256 bits wide (32 bytes). The memory bus speed is 125 MHz. This yields 4 GB/sec bandwidth per bus (32 x 125 MHz = 4 GB/sec). The maximum bandwidth is 8 GB/sec. The switch interconnect design takes full advantage of the capabilities of the two wide data buses. The 256 data bits are distributed equally over two memory motherboards (MMBs). Simultaneously, in a read operation, 128 bits come from one MMB and the other 128 bits come from another MMB, to make one 256-bit read. Another 256-bit read operation can occur at the same time on the other independent data bus. In addition, two address buses per MMB (one for each array) allow overlapping/ pipelined accesses to maximize use of each data bus. When all arrays are identical (same size), the memory is interleaved; that is, sequential blocks of memory are distributed across all four arrays. Memory Options Each memory option consists of a set of four 125 MHz, 200-pin JEDEC-standard DIMMs with PECL clocks. The DIMMs are synchronous DRAMs. Memory options are available in the following sizes: • 512 Mbytes (128 MB DIMMs) • 1 Gbyte (256 MB DIMMs) • 2 Gbytes (512 MB DIMMs) • 4 Gbytes (1 GB DIMMs) Memory options are installed into memory motherboards (MMBs) located on the system motherboard (see Figure 1–8). There are four MMBs. The MMBs have either four or eight slots for installing DIMMs. See Chapter 6 for memory configuration. 1-18 ES45 Service Guide 1.11 PCI Backplane The PCI backplane has four independent 64-bit, PCI buses, one at 33 MHz and three at 66 MHz. The PCI buses support 3.3 volt and 5 volt options. Figure 1–11 I/O Control Logic P-chip 0 2 Slots No Hot Plug 66 MHz PCI 2 Acer/Yukon HPC 4 Slots (3 with Hot Plug) 33 MHz PCI 0 COM1 COM2 Modem Printer Floppy Flash ROM (NVRAM functions) Keyboard Mouse CD-ROM C-chip Interrupts Config P-chip 1 2 Slots With Hot Plug 66 MHz PCI 1 2 Slots With Hot Plug 66 MHz PCI 3 PK0319B System Overview 1-19 PCI modules are either designed specifically for 5.0 or 3.3 volt slots, or are universal in design and can plug into either 3.3 or 5.0 volt slots. CAUTION: PCI modules designed specifically for 5.0 volts or 3.3 volts are keyed differently. Check the keying before you install the PCI module and do not force it in. Plugging a module into a wrong slot can damage it. PCI Bus Implementation • Is fully compliant with the PCI Version 2.2 Specification. • Operates at delivering a peak bandwidth of 1.8 GB/sec; over 533 Mbytes/sec for each 66 MHz bus and over 266 Mbytes/sec for the 33 MHz bus. • Has ten option slots (seven of which are hot swap). • Supports three address spaces: PCI I/O, PCI memory, and PCI configuration space • Supports byte/word, tri-byte, quadword, and longword operations • Exists in noncached address space only I/O Implementation The system has 10 I/O slots, with six slots at 66 MHz and four slots at 33 MHz. The Acer Labs 1543C chip provides the bridge from PCI 0 to ISA. The C-chip controls accesses to memory on behalf of both P-chips. I/O Ports The I/O ports are shown in Section 1.6. 1-20 ES45 Service Guide 1.12 Remote System Management Logic The remote system management logic consists of two major elements: the system power controller (SPC), used to monitor and control system power supplies, regulators, and cooling apparatus; and the remote management console (RMC), which facilitates remote interrogation and control of the system. The components used within the remote system management logic are powered by the AUX_5V supply, which is always present whenever AC input power is available to the system. Figure 1–12 Remote System Management Logic Diagram I2C RMC PIC PICADBUS ADDR ADDRESS Latch DUART COM1(Modem Port) AUX5 System COM1 UART AUX5 AUX5 DATA DualPort SRAM AUX5 ADDRESS DATA Bus Isolator AUX5 RMC Flash RAM STATUS SPC PIC PWR5 AUX5 CONTROL ADDRESS DATA AUX5 TIG SPC Register Array STATUS CONTROL AUX5 PKO912 System Overview 1-21 Dual-Port RAM (DPR) The ES45 system features a dual-port RAM—RAM that is shared between the RMC and the system motherboard logic—to ease communication between the system and the RMC. This book refers to the dual-port RAM as the DPR. The RMC reads 256 bytes of data from each FRU EEPROM at power-up and stores it in the DPR. This data contains configuration and possibly error log information. The data is accessible via the TIG chip to the firmware for configuration information during start-up. Remote or local applications can read the error log and configuration information. The error log information is written to the DPR by Compaq Analyze (see Chapter 5) and then written back to the EEPROMs by the RMC. This ensures that the error log is available on a FRU after power has been lost. • Section 1.12.1 describes the SPC logic. • Section 1.12.2 describes the RMC logic. 1-22 ES45 Service Guide 1.12.1 System Power Controller (SPC) The system power controller (SPC) is responsible for sequencing the turn-on/turn-off of all power supplies and regulators, monitoring all system power supplies and regulators, generating hardware resets to all logic elements, and generating power system status signals for use by other functional units within the system. Additionally, it is responsible for emergency shutdown if the internal system temperature exceeds permissible limits. An 8-bit CMOS microprocessor (PIC 17C44) with associated programming controls the functions of the SPC. The PIC processor receives inputs from: • Operator control panel (power-on, reset) • Power supplies and DC/DC regulators (Power-OK) • Thermal sensors (temperature failure) • TIG chip (command bus from the firmware) • Remote management console logic (remote power up/down, reset) It provides outputs to: • Power supplies and DC/DC regulators (power supply enables) • Processors (DC_OK, reset) • TIG bus chip (handshake) • Remote management console (power status) System Overview 1-23 1.12.2 Remote Management Console (RMC) The remote management console (RMC) provides a mechanism for remotely monitoring a system and manipulating it on a very low level. It also provides access to the repository for all error information in the system. This provides the operator, either remotely or locally, with the ability to monitor the system (voltages, temperature, fans, error status) and manipulate it (reset, power on/off, halt) without any interaction on the part of the operating system. The RMC can also detect alert conditions such as overtemperature, fan failure, and power supply failure and automatically dial a user-defined pager phone number or another computer system to make the remote operator aware of the alert condition. The RMC logic is implemented using an 8-bit microprocessor (PIC 17C44) as the primary control device. Support devices include: • Flash RAM (for code storage) • Address latch • Dual universal asynchronous receiver/transmitter (DUART) • 8-bit I2C port expanders • I2C temperature sensors • I2C nonvolatile memories (NVRAM) • Programmable array logic (PAL) • Dual-port RAM (DPR) • RS232 drivers and receivers Chapter 7 describes the operation and use of the RMC. 1-24 ES45 Service Guide 1.13 Power Supplies The power supplies provide power to components in the system box. The number of power supplies required depends on the system configuration. Figure 1–13 Power Supplies Tower 0 1 1 2 2 Pedestal/Rack 0 1 2 PK0207A System Overview 1-25 Two to three power supplies provide power to components in the system box. The system supports redundant power configurations to ensure continued system operation if a power supply fails. See Chapter 6 for power supply configurations. The power supplies select line voltage automatically (100V to 240V and 50 Hz or 60 Hz). Power Supply LEDs Each power supply has two green LEDs that indicate the state of power to the system. POK (Power OK) +5 V Auxiliary 1-26 Indicates that the power supply is providing power. The POK LED is on when the system is running. When the system power is on and a POK LED is off, that supply is not contributing to powering the system. Indicates that AC power is flowing from the wall outlet. As long as the power supply cord is plugged into the wall outlet, the +5V Aux LED is always on, even when the system power is off. ES45 Service Guide 1.14 Fans The system has six hot-plug fans that provide front-to-back airflow. Figure 1–14 System Fans 5 6 1 2 3 4 PK0208a System Overview 1-27 The system fans are shown in Figure 1–14 and described in Table 1–1. Table 1–1 Fan Descriptions Fan Number , 4.5-in. , 4.5-in. 4.5-in. redundant 6.75-in. main fan 1-28 Area Cooled Fan Failure Scenario PCI card cage Removable media Right drive cage Both fans are powered at all times. If one fan fails, all other system fans run at maximum speed to provide adequate cooling. You can replace either fan while the system is running. Power supplies Left drive cage Both fans are powered at all times. If one fan fails, all other system fans run at maximum speed to provide adequate cooling. You can replace either fan while the system is running. CPU and memory card cage Not powered except during a fan failure. CPU and memory card cage Fan is powered at all times. If it fails, all other system fans run at maximum speed to provide adequate cooling. You can replace the fan while the system is running. ES45 Service Guide 1.15 Removable Media Storage The system box houses a CD-ROM drive and a high-density 3.5-inch floppy diskette drive and supports two additional 5.25-inch halfheight drives or one additional full-height drive. The 5.25-inch half that can be removed to mount one fullheight area has a divider height 5.25-inch device. Figure 1–15 Removable Media Drive Area 2 1 3 PK0233 System Overview 1-29 1.16 Hard Disk Drive Storage The system chassis can house up to two storage disk cages. The storage subsystem supports “hot pluggable" universal hard disk drives that can be replaced while the storage backplane is powered and operating. You can install six 1-inch universal hard drives in each storage disk cage. See Chapter 8 for information on replacing hard disk drives. Figure 1–16 Hard Disk Storage Cage with Drives (Pedestal/Rack) MR0046 1-30 ES45 Service Guide 1.17 System Access At the time of delivery, the system keys are taped inside the small front door that provides access to the operator control panel and removable media devices. Figure 1–17 System Lock and Key Tower Pedestal PK0224A System Overview 1-31 Both the tower and pedestal systems have a small front door through which the control panel and removable media devices are accessible. At the time of delivery, the system keys are taped inside this door. The tower front door has a lock that lets you secure access to the universal disk drives and to the rest of the system. The pedestal has two front doors, both of which can be locked. The upper door secures the disk drives and access to the rest of the system, and the lower door secures the disk expanded storage cages. 1-32 ES45 Service Guide 1.18 Console Terminal The console terminal can be a serial (character cell) terminal connected to the COM1 or COM2 port or a VGA monitor connected to a VGA adapter. A VGA monitor requires a keyboard and mouse. Figure 1–18 Console Terminal Connections (Local) VT Tower VT Pedestal/Rack PK0225B System Overview 1-33 Chapter 2 Troubleshooting This chapter describes the starting points for diagnosing problems on ES45 systems. The chapter also provides information resources. • Questions to Consider • Diagnostic Tables • Service Tools and Utilities • Q-Vet Installation Verification • Information Resources Troubleshooting 2-1 2.1 Questions to Consider Before troubleshooting any system problem, first check the site maintenance log for the system's service history. Be sure to ask the system manager the following questions: • Has the system been used and did it work correctly? • Have changes to hardware or updates to firmware or software been made to the system recently? If so, are the revision numbers compatible for the system? (Refer to the system release notes.) • What is the current state of the system? If the operating system is down, but you are able to access the SRM console, use the console environment diagnostic tools, including the OCP display, power-up display, and SRM commands. If you are unable to access the SRM console, enter the RMC CLI and issue commands to determine the hardware status. See Chapter 7. If the operating system has crashed and rebooted, the CCAT (Compaq Crash Analysis Tool), the Compaq Analyze service tools (to interpret error logs), the SRM crash command, and operating system exercisers can be used to diagnose system problems. 2-2 ES45 Service Guide 2.2 Diagnostic Tables System problems can be classified into the following five categories. Using these categories, you can quickly determine a starting point for diagnosis and eliminate the unlikely sources of the problem. 1. Power problems—Table 2–1 2. No access to console mode—Table 2–2 3. Console-reported failures—Table 2–3 4. Boot problems—Table 2–4 5. Errors reported by the operating system—Table 2–5 WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. These measures include: 1. Remove any jewelry that may conduct electricity. 2. If accessing the system card cage, power down the system and wait 2 minutes to allow components to cool. 3. Wear an anti-static wrist strap when handling internal components. Troubleshooting 2-3 Table 2–1 Power Problems Symptom Action System does not power on. • Check error messages on the OCP. • Check that AC power is plugged in. • Check that the ambient room temperature is within environmental specifications (10–35° C, 50–95° F). • Check the Power setting on the control panel. Toggle the Power button to off, then back on to clear a remote power disable. • Check that internal power supply cables are plugged in at the system motherboard. Power supply shuts down after a few seconds Reference The system may be powered off by one of the following: Loss of AC power RMC power off command System software Multiple fan failure Overtemperature condition Power supply failure (If N+1 config. multiple power supply failure) Faulty CPU (CPU DC/DC converter failure) If AC power is present, use the RMC env command to check environmental status. Chapter 7 Check jumper J5. If the system must be kept running, this jumper can be positioned to override an overtemperature condition. Appendix B 2-4 ES45 Service Guide Table 2–2 Problems Getting to Console Mode Symptom Action Reference Power-up screen is not displayed at system console. Note any error beep codes and observe the OCP display for a failure detected during self-tests. Chapter 3 Check keyboard and monitor connections. Chapter 1 Press the Return key. If the system enters console mode, check that the console environment variable is set correctly. If the console terminal is a VGA monitor, the console variable should be set to graphics. If it is a serial terminal, the console environment variable should be set to serial. Chapter 6 If console is set to serial, the power-up screen is routed to the COM1 serial communication port or MMJ port and cannot be viewed from the VGA monitor. Try connecting a console terminal to the COM1 serial communication port. When using the COM1 port set the console environment variable to serial. Chapter 6 Use RMC commands to determine status. Chapter 7 Troubleshooting 2-5 Table 2–3 Problems Reported by the Console Symptom Action Reference No SRM messages are displayed after the “jump to console” message. Console firmware is corrupted. Load new firmware with fail-safe loader. Chapter 3 The system attempts to boot from the floppy drive after a checksum error is reported. The system automatically reverts to the fail-safe loader to load new firmware. If the fail-safe load does not work, replace the system motherboard. Chapter 3 and Chapter 8 Console program reports error: • Error beep codes report an error at power-up. Use the error beep codes and OCP messages to determine the error. Chapter 3 • Power-up screen includes error messages. Examine the console event log (more el command). Chapter 4 • Power-up screen or console event log indicates problems with mass storage devices. Check cables and seating of drives. Check power to an external storage box. • Storage devices are missing from the show config display. Check cables and seating of drives. Check power to an external storage box. • PCI devices are missing from the show config display. Checking seating of modules. 2-6 ES45 Service Guide Table 2–4 Boot Problems Symptom Action Reference System cannot find boot device. Use the show config and show device commands to check the system configuration for the correct device parameters (node ID, device name, and so on). Chapter 6 Examine the auto_action, bootdef_dev, boot_osflags, and os_type environment variables. For network boots, make sure ei*0_protocols or ew*0_protocols is set to bootp for Tru64 UNIX or mop for OpenVMS. Device does not boot. For problems booting over a network, make sure ei*0_protocols or ew*0_protocols is set to bootp for Tru64 UNIX or mop for OpenVMS. Chapter 6 Run the test command to see if the boot device is operating. Chapter 4 Troubleshooting 2-7 Table 2–5 Errors Reported by the Operating System Symptom Action Reference System is hung, but SRM console is operating Press the Halt button and enter the crash command to provide a crash dump file for analysis. Chapter 4 • Refer to OpenVMS Alpha System Dump Analyzer Utility Manual for information on how to interpret OpenVMS crash dump files. • Refer to the Guide to Kernel Debugging for information on using the UNIX Krash Utility. Use the SRM info command to display registers and data structures. If the problem is intermittent, run the SRM test and sys_exer commands. System is hung and SRM console is not operating. Invoke the RMC CLI and enter the dump command to access DPR locations. Operating system has crashed and rebooted. Examine the operating system error log files to isolate the problem. If the problem is intermittent, ensure that Compaq Analyze has been installed and is running in background mode (GUI does not have to be running) to determine the defective FRU. 2-8 ES45 Service Guide Chapter 7 Chapter 5 2.3 Service Tools and Utilities This section lists some of the tools and utilities available for acceptance testing and diagnosis and gives recommendations for their use. 2.3.1 Error Handling/Logging Tools (Compaq Analyze) The operating systems provide fault management error detection, handling, notification, and logging. The primary tool for error handling is Compaq Analyze, a fault analysis utility designed to analyze both single and multiple error/fault events. Compaq Analyze uses error/fault data sources other than the traditional binary error log. See Chapter 5. 2.3.2 Loopback Tests Internal and external loopback tests are used to test the components on the I/O connector assembly (“junk I/O”) and to test Ethernet cards. The loopback tests are a subset of the SRM diagnostics. Use loopback tests to isolate problems with the COM2 serial port, the parallel port, and Ethernet controllers. See the test command in Chapter 4 for instructions on performing loopback tests. 2.3.3 SRM Console Commands SRM console commands are used to set and examine environment variables and device parameters. For example, the show configuration and show device commands are used to examine the configuration, and the set envar and show envar commands are used to set and view environment variables. SRM commands are also used to invoke ROM-based diagnostics and to run native exercisers. For example, the test and sys_exer commands are used to test the system. See Chapter 6 for information on configuration-related console commands and environment variables. See Chapter 4 for information on running console exercisers. See Appendix A for a list of console commands used most often on ES45 systems. Troubleshooting 2-9 2.3.4 Remote Management Console (RMC) The remote management console (RMC) is used for managing the server either locally or remotely. It also plays a key role in error analysis by passing error log information to the dual-port RAM (DPR), which is shared between the RMC and the system motherboard logic, so that this information can be accessed by the system. RMC also controls the control panel display. RMC has a command-line interface from which you can enter a few diagnostic commands. RMC can be accessed as long as the power cord for a working supply is plugged into the AC wall outlet and a console terminal is attached to the system. This feature ensures that you can gather information when the operating system is down and the SRM console is not accessible. See Chapter 7. 2.3.5 Crash Dumps For fatal errors, the operating systems save the contents of memory to a crash dump file. This file can be used to determine why the system crashed. CCAT, the Compaq Crash Analysis Tool, is the primary crash dump analysis tool for analyzing crash dumps on Alpha systems. CCAT compares the results of a crash dump with a set of rules. If the results match one or more rules, CCAT notifies the system user of the cause of the crash and provides information to avoid similar crashes in the future. 2.3.6 Revision and Configuration Management Tool (RCM) RCM is a tool to assist with revision and configuration management for hardware, firmware, operating system, and software products. It collects configuration and revision data from a system and stores it. A report generator produces configuration, change, and comparison reports that are useful in finding revision incompatibilities. RCM also helps you verify service actions. For example, if a new board was supposed to be installed, you can use RCM to verify that the installation was done. RCM is accessible from the following Web site: http://smsat-www.ilo.dec.com/products/rcm/service/index.htm 2-10 ES45 Service Guide 2.4 Q-Vet Installation Verification CAUTION: Customers are not authorized to access, download, or use QVet. Q-Vet is for use by Compaq engineers to verify the system installation. Misuse of Q-Vet may result in loss of customer data. Q-Vet is the Qualification Verifier Exerciser Tool that is used by Compaq engineers to exercise systems under development. Compaq recommends running the latest Q-Vet released version to verify that hardware is installed correctly and is operational. Q-Vet does not verify specific operating system or layered product configurations. The latest Q-Vet release, information, Release Notes, and documentation are located at http://chump2.mro.cpqcorp.net/qvet/. If the system has been partitioned, Q-Vet must be installed and run separately on each partition to verify the complete system. Compaq recommends that Compaq Analyze be installed on the operating system prior to running Q-Vet. CAUTION: Do not install the Digital System Verification Software (DECVET) on the system; use Q-Vet instead. Non-IVP Q-Vet scripts verify disk operation for some drives with "write enabled" techniques. These are intended for Engineering and Manufacturing Test. Run ONLY IVP scripts on systems that contain customer data or any other items that must not be written over. See the Q-Vet Disk Testing Policy Notice on the Q-Vet Web site for details. All Q-Vet IVP scripts use Read Only and/or File I/O to test hard drives. Floppy and tape drives are always write tested and should have scratch media installed. Q-Vet must be de-installed upon completion of system verification. Troubleshooting 2-11 Swap or Pagefile Space The system must have adequate swap space (on Tru64 UNIX) or pagefile space (on OpenVMS) for proper Q-Vet operation. You can set this up either before or after Q-Vet installation. During initialization, Q-Vet will display a message indicating the minimum amount of swap/pagefile needed, if it determines that the system does not have enough. You can then reconfigure the system. If you wish to address the swap/pagefile size before running Q-Vet, see the Swap/Pagefile Estimates on the Q-Vet Web site. 2-12 ES45 Service Guide 2.4.1 Installing Q-Vet The procedures for installation of Q-Vet differ between operating systems. You must install Q-Vet on each partition in the system. Install and run Q-Vet from the SYSTEM account on VMS and the root account on UNIX. Remember to install Q-Vet in each partition. Tru64 UNIX 1. Make sure that there are no old Q-Vet or DECVET kits on the system by using the following command: setld -i | grep VET Note the names of any listed kits, such as OTKBASExxx etc., and remove the kits using qvet_uninstall if possible. Otherwise use the command setld -d kit1_name kit2_name kit3_name 2. Copy the kit tar file (QVET_Vxxx.tar) to your system. 3. Be sure that there is no directory named output. If so move to another directory or remove the output directory. rm -r output 4. Untar the kit with the command tar xvf QVET_Vxxx.tar Note: The case of the file name may be different depending upon how it was stored on the system. Also, you may need to enclose the file name in quotation marks if a semi-colon is used. 5. Install the kit with the command setld -l output 6. During the install, if you intend to use the GUI you must select the optional GUI subset (QVETXOSFxxx). 7. The Q-Vet installation will size your system for devices and memory. It also runs qvet_tune. You should answer 'y' to the questions that are asked about setting parameters. If you do not, you may have trouble running Q-Vet. After the installation completes, you should delete the output directory with rm -r output. You can also delete the kit tar file. 8. You must reboot the system before starting Q-Vet. 9. On reboot you can start Q-Vet GUI via vet& or you can run non GUI (command line) via vet –nw. Troubleshooting 2-13 OpenVMS 1. Delete any QVETAXPxxx.A or QVETAXPxxx.EXE file from the current directory. 2. Copy the self-extracting kit image file (QVETAXPxxx.EXE) to the current directory. 3. It is highly recommended, but not required, that you purge the system disk before installing Q-Vet. This will free up space that may be needed for pagefile expansion during the AUTOGEN phase. $purge sys$sysdevice:[*…]*.* 4. Extract the kit saveset with the command $run QVETAXPxxx.EXE and verify that the kit saveset was extracted by checking for the "Successful decompression" message. 5. Use @sys$update:vmsinstal for the Q-Vet installation. The installation will size your system for devices and memory. You should choose all the default answers during the Q-Vet installation. This will verify the Q-Vet installation, tune the system, and reboot. During the install, if you do not intend to use the GUI, you can answer no to the question "Do you want to install Q-Vet with the DECwindows Motif interface?" 6. After the installation completes you should delete the QVETAXP0xx.A file and the QVETAXPxxx.EXE file. 7. On reboot you can start Q-Vet GUI via $vet or the command interface via $vet/int=char. 2-14 ES45 Service Guide 2.4.2 Running Q-Vet You must run Q-Vet on each partition in the system to verify the complete system. Compaq recommends that you review the Special Notices and the Testing Notes section of the Release Notes located at http://chump2.mro.cpqcorp.net/qvet/ before running Q-Vet. Follow the instructions listed for your operating system to run Q-Vet in each partition. Tru64 UNIX Graphical Interface 1. From the Main Menu, select IVP, Load Script and select Long IVP (the IVP tests will then load into the Q-Vet process window). 2. Click the Start All button to begin IVP testing. Command-Line Interface > vet -nw Q-Vet_setup> execute .Ivp.scp Q-Vet_setup> start Note that there is a "." in front of the script name, and that commands are case sensitive. Troubleshooting 2-15 OpenVMS Graphical Interface 1. From the Main Menu, select IVP, Load Script and select Long IVP (the IVP tests will then load into the Q-Vet process window). 2. Click the Start All button to begin IVP testing. Command-Line Interface $ vet /int=char Q-Vet_setup> execute ivp.vms Q-Vet_setup> start Note that commands are case sensitive. NOTE: A short IVP script is provided for a simple verification of device setup. It is selectable from the GUI IVP menu, and the script is called .Ivp_short.scp (ivp_short.vms). This script will run for 15 minutes and then terminate with a Summary log. The short script may be run prior to the long IVP script if desired, but not in place of the long IVP script, which is the full IVP test. The long IVP will run until the slowest device has completed one pass (typically 2 to 12 hours). This is called a Cycle of Testing. 2-16 ES45 Service Guide 2.4.3 Reviewing Results of the Q-Vet Run After running Q-Vet, check the results of the run by reviewing the summary log. If you follow the above steps, Q-Vet will run all exercisers until the slowest device has completed one full pass. Depending on the size of the system (number of CPUs and disks), this will typically take 2 to 12 hours. Q-Vet will then terminate testing and produce a summary log. The termination message will tell you the name and location of this file. All exerciser processes can also be manually terminated with the Suspend and Terminate buttons (stop and terminate commands). After all exercisers report “Idle,” the summary log is produced containing Q-Vet specific results and statuses. A. If there are no Q-Vet errors, no system event appendages, and testing ran to the specified completion time, the following message will be displayed: ”Q-Vet Tests Complete: Passed” B. Otherwise, a message will indicate: ”Additional information may be available from Compaq Analyze” It is recommended that you run Compaq Analyze to review test results. The testing times (for use with Compaq Analyze) are printed to the Q-Vet run window and are available in the summary log. Troubleshooting 2-17 2.4.4 De-Installing Q-Vet The procedures for de-installation of Q-Vet differ between operating systems. You must de-install Q-Vet from each partition in the system. Failure to do so may result in the loss of customer data at a later date if Q-Vet is misused. Follow the instructions listed under your operating system to de-install Q-Vet from a partition. The qvet_uninstall programs will remove the Q-Vet supplied tools and restore the original system tuning/configuration settings. Tru64 UNIX 1. Stop, Terminate, and Exit from Q-Vet testing. 2. Execute the command qvet_uninstall. This will also restore the system configuration/tuning file sysconfigtab. 3. Note: log files are retained in /usr/field/tool_logs 4. Reboot the system. You must reboot in any case, even if Q-Vet is to be reinstalled. OpenVMS 1. Stop, Terminate, and Exit from any Q-Vet testing. 2. Execute the command @sys$manager:qvet_uninstall. This will restore system tuning (modparams.dat) and the original UAF settings. 3. Note: log files are retained in sys$specific:[sysmgr.tool_logs] 4. Reboot the system. You must reboot in any case, even if Q-Vet is to be reinstalled. 2-18 ES45 Service Guide 2.5 Information Resources Many information resources are available, including tools that can be downloaded from the Internet, firmware updates, a supported options list, and more. 2.5.1 Compaq Service Tools CD The Compaq Service Tools CD-ROM enables field engineers to upgrade customer systems with the latest version of software when the customer does not have access to Compaq Web pages. The Web site is: http://caspian1.zko.dec.com/service_tools/ 2.5.2 ES45 Service HTML Help File The information contained in this guide, including the FRU procedures and illustrations, is available in HTML Help format as part of the Maintenance Kit. It can also be accessed from the Learning Utility and ProSIC Web sites. 2.5.3 Alpha Systems Firmware Updates The firmware resides in the flash ROM on the system motherboard. You can obtain the latest system firmware from CD-ROM or over the network. Quarterly Update Service The Alpha Systems Firmware Update Kit CD-ROM is available by subscription. Troubleshooting 2-19 Alpha Firmware Internet Access • You can obtain Alpha firmware updates from the following Web site: http://ftp.digital.com/pub/Digital/Alpha/firmware/readme.html The README file describes the firmware directory structure and how to download and use the files. • If you do not have a Web browser, you can download the files using anonymous ftp: http://gatekeeper.research.compaq.com/pub/Digital/Alpha/firmware/ Individual Alpha system firmware releases that occur between releases of the firmware CD are located in the interim directory: http://gatekeeper.research.compaq.com/pub/Digital/Alpha/firmware/interim/ 2.5.4 Fail-Safe Loader The fail-safe loader (FSL) allows you to boot the firmware update utility in an attempt to repair corrupted console files that reside within the flash ROMs on the system motherboard. You can download the fail-safe loader from the Internet (using the firmware update URL) to create your own fail-safe loader diskettes. See Chapter 3 for information on forcing a fail-safe floppy load. 2.5.5 Software Patches Software patches for the supported operating systems are available from: http://www.compaq.com/alphaserver/ 2.5.6 Learning Utility The Learning Utility provides information about various technical topics. http://learning1.americas.cpqcorp.net/mcsl-html/home.asp 2.5.7 Late-Breaking Technical Information You can download up-to-date files and late-breaking technical information from the Internet. 2-20 ES45 Service Guide The information includes firmware updates, the latest configuration utilities, software patches, lists of supported options, and more. http://www.compaq.com/alphaserver/es40/es40.html 2.5.8 Supported Options A list of options supported on the system is available on the Internet: http://www.compaq.com/alphaserver/es40/ Troubleshooting 2-21 Chapter 3 Power-Up Diagnostics and Display This chapter describes the power-up process and RMC, SROM, and SRM powerup diagnostics. The following topics are covered: • Overview of Power-Up Diagnostics • System Power-Up Sequence • Power-Up Displays • Power-Up Error Messages • Forcing a Fail-Safe Floppy Load • Updating the RMC Power-Up Diagnostics and Display 3-1 3.1 Overview of Power-Up Diagnostics The power-up process begins with the power-on of the power supplies. After the AC and DC power-up sequences are completed, the remote management console (RMC) reads EEROM information and deposits it into the DPR. The SROM minimally tests the CPUs, initializes and tests backup cache, and minimally tests memory. Finally, the SROM loads the SRM console program into memory and jumps to the first instruction in the console program. There are three distinct sets of power-up diagnostics: 1. System power controller and remote management console diagnostics— These diagnostics check the power regulators, temperature, and fans. Failures are reported in the dual-port RAM (DPR) and on the OCP display. Certain failures may prevent the system from powering on. 2. Serial ROM (SROM) diagnostics—SROM tests check the basic functionality of the system and load the console code from the FEPROM on the system motherboard into system memory. Failures during SROM tests are indicated by error beep codes and messages on the serial console terminal and the OCP. 3. Console firmware diagnostics—These tests are executed by the SRM console code. They test the core system, including boot path devices. Failures during these tests are reported to the console terminal through the powerup screen or console event log. 3-2 ES45 Service Guide 3.2 System Power-Up Sequence The power-up sequence is described below and illustrated in Figure 3–1. 1. When the power cord is plugged into the wall outlet, 5V auxiliary AC voltage is enabled. The 5 V AUX LEDs on the power supplies are lit, and the system power controller and RMC are initialized. 2. Pressing the Power button on the control panel or subsequently issuing the power-on command from the RMC turns on power to the power supplies, CPU converters, and VTERM regulators. The POK LEDs on the power supplies are lit and the power supplies are tested. If all power supplies are bad, power-up stops. All DC/DC converters and regulators are then tested. If any converter or regulator is bad, power-up stops. 3. CPU_DCOK and SYS_DC_OK are set to “true,” which means that DC power on the CPUs and system is okay. All CPUs load the initial Y divisor (clock multiplier). The OCP power LED is lit. 4. SYS_RESET is set to “false.” This setting releases the system motherboard logic and PCI backplane logic from the Reset state. 5. The primary CPU is selected and CPU_(P)_RESET is set to “false.” This allows the primary CPU to attempt to load flash SROM code. 6. If the primary CPU is good, it loads flash SROM. If bad, the system tries the next available CPU and if that CPU is good, it becomes the primary. The remaining CPUs load flash SROM. The SROM power-up then continues, as described in Section 3.3. Power-Up Diagnostics and Display 3-3 Figure 3–1 Power-Up Sequence Apply AC power 5 V AUX LEDs on PS are lit OCP Power button = IN Turn on power supplies Turn on CPU converters Turn on VTERM regulators Set all CPU_DCOK = True Set SYS_DC_OK = True Set SYS_RESET = False Set CPU(n)_RESET = False Set CPU(n)_RESET = False No CPU = "Alive"? Disable CPU All CPUs reload initial Y divisor Yes Continue SROM power-up PK0943 3-4 ES45 Service Guide Figure 3–1 Power-Up Sequence (Continued) SROM Power-Up Init EV68 Test PCI Determine Config Bad Good Reload Using Flash SROM Init EV68 Test PCI Release CPUs B-Cache Tests Memory Config and Tests Load SRM PK0964 Power-Up Diagnostics and Display 3-5 3.3 Power-Up Displays Power-up information is displayed on the operator control panel and on the console terminal startup screen. Messages sent from the RMC and SROM programs are displayed first, followed by messages from the SRM console. 3.3.1 SROM Power-Up Display The following example describes the SROM power-up sequence and shows the SROM power-up messages and corresponding OCP messages. Example 3–1 Sample SROM Power-Up Display SROM V2.15 CPU # 00 @ 1000 MHz SROM program starting Reloading SROM SROM V2.15 CPU # 00 @ 1000 MHz System Bus Speed @ 0125 MHz SROM program starting PCI66 bus speed check Reloading SROM SROM V2.15 CPU # 00 @ 1000 MHz System Bus Speed @ 0125 MHz SROM program starting PCI66 bus speed check Starting secondary on CPU #1 Starting secondary on CPU #2 Starting secondary on CPU #3 Bcache data tests in progress Bcache address test in progress CPU parity and ECC detection in progress 3-6 ES45 Service Guide PCI Test Power on RelCPU BC Data SROM Power-Up Sequence When the system powers up, the SROM code is loaded into the I-cache (instruction cache) on the first available CPU, which becomes the primary CPU. The order of precedence is CPU0, CPU1, and so on. The primary CPU attempts to access the PCI bus. If it cannot, either a hang or a failure occurs, and this is the only message displayed. The primary CPU interrogates the I C EEROM on the system board and 2 CPU modules through shared RAM. The primary CPU determines the CPU and system configuration to jump to. The primary CPU next checks the SROM checksum to determine the validity of the flash SROM sectors. If flash SROM is invalid, the primary CPU reports the error and continues the execution of the SROM code. Invalid flash SROM must be reprogrammed. If flash SROM is good, the primary CPU programs appropriate registers with the values from the flash data and selects itself as the target CPU to be loaded. The primary CPU (usually CPU0) initializes and tests the B-cache and memory, then loads the flash SROM code to the next CPU. That CPU then initializes the EV68 chip) and marks itself as the secondary CPU. Once the primary CPU sees the secondary, it loads the flash SROM code to the next CPU until all remaining CPUs are loaded. The flash SROM performs B-cache tests. For example, the ECC data test verifies the detection logic for single- and double-bit errors. Power-Up Diagnostics and Display 3-7 Example 3–1 Sample SROM Power-Up Display (Continued) Bcache ECC data tests in progress Bcache TAG lines tests in progress Memory sizing in progress Memory configuration in progress Testing AAR3 Memory data test in progress Memory address test in progress Memory pattern test in progress Testing AAR2 Memory data test in progress Memory address test in progress Memory pattern test in progress Testing AAR1 Memory data test in progress Memory address test in progress Memory pattern test in progress Testing AAR0 Memory data test in progress Memory address test in progress Memory pattern test in progress Memory thrashing test in progress Memory initialization Loading console Code execution complete (transfer control) Size Mem Load ROM Jump to Console NOTE: The power-up text that is displayed on the screen depends on what kind of terminal is connected as the console terminal: VT or VGA. If the SRM console environment variable is set to serial, the entire power-up display, consisting of the SROM and SRM power-up messages, is displayed on the VT terminal screen. If console is set to graphics, no SROM messages are displayed, and the SRM messages are delayed until VGA initialization has been completed. 3-8 ES45 Service Guide SROM Power-Up Sequence The primary CPU initiates all memory tests. The memory is tested for address and data errors for the first 32 MB of memory in each array. It also initializes all the “sized” memory in the system. If a memory failure occurs, an error is reported. An untested memory array is assigned to address 0 and the failed memory array is de-assigned. The memory tests are rerun on the first 32 MB of memory in the remaining arrays. If all memory fails, the “No Memory Available” message is reported and the system halts. If all memory passes, the primary CPU loads the console and transfers control to it. Power-Up Diagnostics and Display 3-9 3.3.2 SRM Console Power-Up Display When SROM power-up is complete, the primary CPU transfers control to the SRM console program. The console program continues the system initialization. Failures are reported to the console terminal through the power-up screen and a console event log. The following section shows the messages that are displayed once the SROM has transferred control to the SRM console. Example 3–2 SRM Power-Up Display OpenVMS PALcode V1.88-28, Tru64 UNIX PALcode V1.83-24 starting console on CPU 0 initialized idle PCB initializing semaphores initializing heap initial heap 240c0 memory low limit = 1e6000 heap = 240c0, 17fc0 initializing driver structures initializing idle process PID initializing file system initializing timer data structures lowering IPL CPU 0 speed is 1000 MHz create dead_eater create poll create timer create powerup access NVRAM 4096 MB of System Memory Testing Memory ... probe I/O subsystem 3-10 ES45 Service Guide Example 3–2 SRM Power-Up Display (Continued) Hose 0 - PCI bus running at 33Mhz entering idle loop probing hose 0, PCI probing PCI-to-ISA bridge, bus 1 probing PCI-to-PCI bridge, bus 2 bus 0, slot 8 -- pka -- NCR 53C895 bus 0, slot 9 -- eia -- DE600-AA bus 2, slot 0 -- pkb -- NCR 53C875 bus 2, slot 1 -- pkc -- NCR 53C875 bus 2, slot 2 -- ewa -- DE500-AA Network Controller bus 0, slot 16 -- dqa -- Acer Labs M1543C IDE bus 0, slot 16 -- dqb -- Acer Labs M1543C IDE Hose 1 - PCI bus running at 66Mhz probing hose 1, PCI bus 0, slot 2 -- vga -- 3Dlabs OXYGEN VX1 Hose 2 - AGP bus probing hose 2, PCI Hose 3 - PCI bus running at 33Mhz probing hose 3, PCI probing PCI-to-PCI bridge, bus 2 bus 2, slot 4 -- eib -- DE602-AA bus 2, slot 5 -- eic -- DE602-AA bus 2, slot 6 -- eid -- DE602-FA bus 0, slot 2 -- fwa -- DEFPA starting drivers SRM Power-Up Sequence The primary CPU prints a message indicating that it is running the console. Starting with this message, the power-up display is sent to any console terminal, regardless of the state of the console environment variable. If console is set to graphics, the display from this point on is saved in a memory buffer and displayed on the VGA monitor after the PCI buses are sized and the VGA device is initialized. The memory size is determined and memory is tested. The I/O subsystem is probed and I/O devices are reported. I/O adapters are configured. Device drivers started. Power-Up Diagnostics and Display 3-11 Example 3–2 SRM Power-Up Display (Continued) initializing keyboard starting console on CPU 1 initialized idle PCB initializing idle process PID lowering IPL CPU 1 speed is 1000 MHz create powerup entering idle loop starting console on CPU 2 initialized idle PCB initializing idle process PID lowering IPL CPU 2 speed is 1000 MHz create powerup starting console on CPU 3 initialized idle PCB initializing idle process PID lowering IPL CPU 3 speed is 1000 MHz create powerup initializing GCT/FRU at 220000 initializing pka pkb pkc ewa fwa dqa dqb eia eia0: link up : Negotiated 100Basx eib eic eid Memory Testing and Configuration Status Array Size Base Address Intlv Mode --------- ---------- ---------------- ---------0 4096Mb 0000000000000000 2-Way 1 1024Mb 0000000200000000 2-Way 2 4096Mb 0000000100000000 2-Way 3 1024Mb 0000000240000000 2-Way 10240 MB of System Memory AlphaServer ES45 Console V5.9-9, built on June 2001 at 17:09:49 The console is started on the secondary CPUs. The example shows a fourprocessor system. Various diagnostics are performed. The console terminal displays the SRM console banner and the prompt, Pnn>>>. The number n indicates the primary processor. In a multiprocessor system, the prompt could be P00>>>, P01>>>, P02>>>, or P03>>>. From the SRM prompt, you can boot the operating system. NOTE: If the console requires the heap to be expanded, it restarts. 3-12 ES45 Service Guide 3.3.3 SRM Console Event Log The SRM console event log helps you troubleshoot problems that do not prevent the system from coming up to the SRM console. The console event log consists of status messages received during power-up self-tests. Example 3–3 Sample Console Event Log >>> more el *** Error - CPU 1 failed powerup diagnostics *** Secondary start error EV6 BIST = 1 STR status = 1 CSC status = 1 PChip0 status = 1 PChip1 status = 1 DIMx status = 0 TIG Bus status = 1 DPR status = 0 CPU speed status = 0 CPU speed = 0 Powerup time = 00-00-00 00:00:00 CPU SROM sync = 0 *** Error - Fan 1 failed *** *** Error - Fan 2 failed *** If problems occur during power-up, error messages indicated by asterisks (***) may be embedded in the console event log. To display the console event log one screen at a time, use the more el command. Example 3–3 shows a console event log that shows errors. The console reported that CPU 1 did not power up and fans 1 and 2 failed. Power-Up Diagnostics and Display 3-13 3.4 Power-Up Error Messages Error messages at power-up may be displayed by the RMC, SROM, and SRM. A few SROM messages are announced by beep codes. 3.4.1 SROM Messages with Beep Codes Table 3–1 Error Beep Codes Beep Code Associated Messages 1 Jump to Console SROM code has completed execution. System jumps to SRM console. SRM messages should start to be displayed. If no SRM messages are displayed, it may indicate corrupted firmware. See Section 3.4.2. 1-1-4 ROM err The ROM err message is displayed briefly, then a single beep is emitted, and Jump to Console is displayed. The SROM code is unable to load the console code; a flash ROM header area or checksum error has been detected. See Section 3.4.2. 2-1-2 Cfg ERR n Cfg ERR s Configuration error on CPU n (n is 0, 1, 2, or 3) or a system configuration error. The system will still power up. 3-14 Meaning ES45 Service Guide Table 3–1 Error Beep Codes (Continued) Beep Code Associated Messages 1-2-4 BC error CPU error BC bad Backup cache (B-cache) error. Indicates a bad CPU. 1-3-3 No mem No usable memory detected. Some memory DIMMs may not be properly seated or some DIMM sets may be faulty. See Section 3.4.3. Meaning A few SROM error messages that appear on the operator control panel are announced by audible error beep codes, an indicated in Table 3–1. For example, a 1-1-4 beep code consists of one beep, a pause (indicated by the hyphen), one beep, a pause, and a burst of four beeps. This beep code is accompanied by the message “ROM err.” Related messages are also displayed on the console terminal if the console device is connected to the serial line and the SRM console environment variable is set to serial. Power-Up Diagnostics and Display 3-15 3.4.2 Checksum Error If no messages are displayed on the operator control panel after the Jump to Console message, the console firmware is corrupted. When the system detects the error, it attempts to load a utility called the failsafe loader (FSL) so that you can load new console firmware images. A sequence similar to the one in Example 3–4 occurs. Example 3–4 Checksum Error and Fail-Safe Load Loading console Console ROM checksum error Expect: 00000000.000000FE Actual: 00000000.000000FF XORval: 00000000.00000001 Loading program from floppy Code execution complete (transfer control) OpenVMS PALcode V1.91-33, Tru64 UNIX PALcode V1.87-27 starting console on CPU 0 . . . starting drivers entering idle loop P00>>> boot update_cd 3-16 ES45 Service Guide The sequence shown in Example 3–4 is as follows: The system detects the checksum error and writes a message to the console screen. The system attempts to automatically load the FSL program from the floppy drive. As the FSL program is initialized, messages similar to the console power-up messages are displayed. This example shows the beginning and ending messages. At the P00>>> console prompt, boot the Loadable Firmware Update Utility (LFU) from the Alpha Systems Firmware CD (shown in the example as the variable update_cd). NOTE: For more information on the LFU, see the Firmware Updates Web site: http://ftp.digital.com/pub/digital/Alpha/firmware/ Power-Up Diagnostics and Display 3-17 3.4.3 No MEM Error If the SROM code cannot find any usable memory, a 1-3-3 beep code is issued (one beep, a pause, a burst of three beeps, a pause, and another burst of three beeps), and the message “No MEM” is displayed on the OCP. The system does not come up to the console program. This error indicates missing or bad DIMMs. The OCP and console terminal display text similar to the following: Failed MMB-2 J9 Failed MMB-2 J5 Failed MMB-0 J9 Failed MMB-0 J5 Incmpat MMB-3 J9 Incmpat MMB-3 J5 Incmpat MMB-1 J9 Incmpat MMB-1 J5 Missing MMB-2 J6 Incmpat MMB-2 J2 Illegal MMB-0 J6 Incmpat MMB-0 J2 No usable memory detected Indicates failed DIMMs. M identifies the MMB; D identifies the DIMM. In this line, DIMM 2 on MMB1 failed. Indicates that some DIMMs in this array are mismatched. All DIMMs in the affected array are marked as incompatible (incmpat). Indicates that a DIMM in this array is missing. All missing DIMMs in the affected array are marked as missing. Indicates that the DIMM data for this array is unreadable. All unreadable DIMMs in the affected array are marked as illegal. See Chapter 6 for memory configuration rules. 3-18 ES45 Service Guide 3.4.4 RMC Error Messages Table 3–2 lists the fatal error messages that could potentially be displayed on the OCP by the remote management console during power-up. Most fatal error messages prevent the system from completing power-up. The warning messages listed in Table 3–3 require prompt attention but might not prevent the system from completing power-up or booting the operating system. NOTE: The VTERM and CTERM regulators referenced in Table 3–2 are located on the system motherboard. The “ CPUn failed” message does not necessarily prevent the completion of power-up. If the system finds a good CPU, it continues the power-up process. Table 3–2 RMC Fatal Error Messages Message Meaning AC loss No AC power to the system. CPUn failed CPU failed. “n” is 0, 1, 2, or 3. VTERM failed No VTERM voltage to CPUs. CTERM failed No CTERM voltage to CPUs. Fan5, 6 failed Main fan (6) and redundant fan (5) failed. OverTemp failure System temperature has passed the high threshold. No CPU in slot 0 Configuration requires that a CPU be installed in slot 0. CPU door opened System card cage cover off. Reinstall cover. TIG error Code essential to system operation is not loaded and/or running or TIG flash is corrupt. Mixed CPU types Different types of CPU are installed. Configuration requires that all CPUs be the same type. Power-Up Diagnostics and Display 3-19 Table 3–2 RMC Fatal Error Messages (Continued) Message Meaning Bad CPU ROM data Invalid data in EEROM on the CPU. 2.5V bulk failed 2.5V regulator failed AGP config error Power consumption requirement for AGP failed Table 3–3 RMC Warning Messages Message Meaning PSn failed Power supply failed. “n” is 0, 1, or 2. OverTemp Warning System temperature is near the high threshold. Fann failed Fan failed. “n” is 0 through 6. PCI door opened Cover to PCI card cage is off. Reinstall cover. Fan door opened Cover to main fan area (fans 5 and 6) is off. Reinstall cover. 3.3V bulk warn Power supply voltage over or under threshold. 5V bulk warn Power supply voltage over or under threshold. 12V bulk warn Power supply voltage over or under threshold. –12V bulk warn Power supply voltage over or under threshold. VTERM warn Voltage regulator over or under threshold. CTERM warn Voltage regulator over or under threshold. Continued on next page 3-20 ES45 Service Guide Table 3–3 RMC Warning Messages (Continued) Message Meaning CPUn VCORE warn CPU core voltage over or under threshold. “n” is 0, 1, 2, or 3. CPUn VIO warn I/O voltage on CPU over or under threshold. “n” is 0, 1, 2, or 3. CPUn VCACHE warn Cache voltage on CPU over or under threshold. “n” is 0, 1, 2, or 3. 1.5V bulk warn Power supply voltage over or under threshold. For AGP backplane only. 2.5V bulk warn Power supply voltage over or under threshold. Power-Up Diagnostics and Display 3-21 3.4.5 SROM Error Messages The SROM power-up identifies errors that may or may not prevent the system from coming up to the console. It is possible that these errors may prevent the system from successfully booting the operating system. Errors encountered during SROM power-up are displayed on the OCP. Some errors are also displayed on the console terminal screen if the console output is set to serial. Table 3–4 lists the SROM error messages. The code numbers shown in the Code column are displayed in place of OCP or SROM messages if the SROM flash is invalid. Table 3–4 SROM Error Messages Code SROM Message OCP Message FD FA EF PCI data path error No usable memory detected Bcache data lines test error PCI Err No Mem BC Error EE Bcache data march test error BC Error ED EC EB EA E9 E8 E7 E6 Bcache address test error CPU parity detection error CPU ECC detection error Bcache ECC data lines test error Bcache ECC data march test error Bcache TAG lines test error Bcache TAG march test error Console ROM checksum error BC Error CPU Err CPU Err BC Error BC Error BC Error BC Error ROM Err Continued on next page 3-22 ES45 Service Guide Table 3–4 SROM Error Messages (Continued) Code SROM Message OCP Message E5 Floppy driver error Flpy Err E4 E3 No real-time clock (TOY) Memory data path error TOY Err Mem Err E2 E1 E0 Memory address line error Memory pattern error Memory pattern ECC error Mem Err Mem Err Mem Err 7F 7E 7D Configuration error on CPU #3 Configuration error on CPU #2 Configuration error on CPU #1 CfgERR 3 CfgERR 2 CfgERR 1 7C 7B 7A Configuration error on CPU #0 Bcache failed on CPU #3 error Bcache failed on CPU #2 error CfgERR 0 BC Bad 3 BC Bad 2 79 78 77 Bcache failed on CPU #1 error Bcache failed on CPU #0 error Memory thrash error on CPU #3 BC Bad 1 BC Bad 0 MtrERR 3 76 75 Memory thrash error on CPU #2 Memory thrash error on CPU #1 MtrERR 2 MtrERR 1 74 73 72 71 70 6F Memory thrash error on CPU #0 Starting secondary on CPU #3 error Starting secondary on CPU #2 error Starting secondary on CPU #1 error Starting secondary on CPU #0 error Configuration error with system MtrERR 0 RCPU 3 E RCPU 2 E RCPU 1 E RCPU 0 E CfgERR S Power-Up Diagnostics and Display 3-23 3.5 Forcing a Fail-Safe Floppy Load Under some circumstances, you may need to force the activation of the FSL. For example, if you install a system motherboard that has an older version of the firmware than your system requires, you may not be able to bring up the SRM console. In that case you need to force a floppy load so that you can update the SRM firmware. Figure 3–2 Function Jumpers (FSL) FSL J24 1 2 3 J23 1 2 3 J22 1 2 3 J21 1 2 3 J20 1 2 3 J19 1 2 3 1 2 3 4 5 6 7 8 9 10 ON OFF SW2 Configuration Switch SC0033B 3-24 ES45 Service Guide 1. Turn off the system. Unplug the power cord from each power supply and wait for the 5V AUX indicators to extinguish. 2. Remove enclosure covers (tower and pedestal) or the front bezel (rackmount) to access the system chassis. See Chapter 8 for illustrations. 3. Remove the fan cover and the system card cage cover to gain access to the system motherboard. See Chapter 8 for illustrations. 4. Remove CPU 2 (closest to the PCI backplane) so that you can access the function jumpers. 5. Locate the J22 function jumper on the system motherboard. See Figure 3–2. 6. Enable the fail-safe loader by moving the J22 jumper from pins 1 and 2 to pins 2 and 3. NOTE: The J21 and J23 function jumpers must be in their default positions over pins 1 and 2. 7. Replace the chassis covers and enclosure covers. Plug in the power supplies. 8. Insert the LFU diskette into the floppy drive, and insert the update CD into the CD-ROM drive. 9. Power up the system and check the control panel display for progress messages. 10. At the P00>>> prompt, boot the update CD. Enter update at the UPD> prompt and press Return. Enter yes at the “Confirm update” prompt. 11. After the update is complete, turn off the system and unplug the power supplies. 12. Place J22 over pins 1 and 2. 13. Replace CPU 2. 14. Replace the chassis covers and enclosure covers, plug in the power supplies, and power up the system. NOTE: For more information on the LFU, see the Firmware Updates Web site: http://ftp.digital.com/pub/digital/Alpha/firmware/ Power-Up Diagnostics and Display 3-25 3.6 Updating the RMC Under certain circumstances, the RMC will not function. If the problem is caused by corrupted RMC flash ROM, you need to update RMC firmware. The RMC will not function if: • No AC power is provided to any of the power supplies. • DPR does not pass its self-test (DPR is corrupted). • RMC flash ROM is corrupted. If the RMC is not working, the control panel displays the following message: Bad RMC flash The SRM console also sends a message to the terminal screen: *** Error - RMC detected power up error - RMC Flash corrupted *** NOTE: If the RMC is corrupt, you may not get an output from Com 1 (MMJ). If this occurs, remove J4 and move the serial line to the modem port. 3-26 ES45 Service Guide You can update the remote management console firmware from flash ROM using the LFU. 1. Load the update medium. 2. At the UPD> prompt, exit from the update utility, and answer y to the manual update prompt. Enter update RMC to update the firmware. UPD> exit Do you want to do a manual update [y/(n)] y ***** Loadable Firmware Update Utility ***** ------------------------------------------------------------Function Description ------------------------------------------------------------Display Displays the system's configuration table. Exit Done exit LFU (reset). List Lists the device, revision, firmware name, and update revision. Readme Lists important release information. Update Replaces current firmware with loadable data image. Verify Compares loadable and hardware images. ? or Help Scrolls this function table. ----------------------------------------------------------UPD> update RMC . . . NOTE: For more information on the LFU, see the Firmware Updates Web site: http://ftp.digital.com/pub/digital/Alpha/firmware/ Power-Up Diagnostics and Display 3-27 Chapter 4 SRM Console Diagnostics This chapter describes troubleshooting with the SRM console. The SRM console firmware contains ROM-based diagnostics that allow you to run system-specific or device-specific exercisers. The exercisers run concurrently to provide maximum bus interaction between the console drivers and the target devices. Run the diagnostics by using commands from the SRM console. To run the diagnostics in the background, use the background operator “&” at the end of the command. Errors are reported to the console terminal, the console event log, or both. If you are not familiar with the SRM console, see the ES45 Owner’s Guide. SRM Console Diagnostics 4-1 4.1 Diagnostic Command Summary Diagnostic commands are used to test the system and help diagnose failures. Table 4–1 gives a summary of the SRM diagnostic commands and related commands. See Chapter 6 for a list of SRM environment variables, and see Appendix A for a list of SRM commands most commonly used for the ES45 system. Table 4–1 Summary of Diagnostic and Related Commands Command Function buildfru Initializes I2Cbus EEPROM data structures for the named FRU. cat el Displays the console event log. Same as more el, but scrolls rapidly. The most recent errors are at the end of the event log and are visible on the terminal screen. clear_error Clear errors logged in the FRU EEPROMs as reported by the show error command. crash Forces a crash dump at the operating system level. deposit Writes data to the specified address of a memory location, register, or device. examine Displays the contents of a memory location, register, or device. exer Exercises one or more devices by performing specified read, write, and compare operations. floppy_write Runs a write test on the floppy drive to determine whether you can write on the diskette. Continued on next page 4-2 ES45 Service Guide Table 4–1 Summary of Diagnostic and Related Commands (Continued) Command Function grep Searches for “regular expressions”—specific strings of characters—and prints any lines containing occurrences of the strings. hd Dumps the contents of a file (byte stream) in hexadecimal and ASCII. hose_x_default_ speed Controls the default PCI bus speed for the specified hose when no PCI devices are present. If PCI devices are present on the specified hose, this EV is ignored and the bus speed is negotiated based on whether each device is 66 MHz capable or not. info Displays registers and data structures. kill Terminates a specified process. kill_diags Terminates all executing diagnostics. more el Same as cat el, but displays the console event log one screen at a time. memexer Runs a requested number of memory tests in the background. memtest Tests a specified section of memory. net -ic Initializes the MOP counters for the specified Ethernet port. net -s Displays the MOP counters for the specified Ethernet port. nettest Runs loopback tests for PCI-based Ethernet ports. Also used to test a port on a “live” network. php_led_test Tests the LEDs of all hot-plug slots on a specified I/O hose. Each LED is tested with four available patterns (blink A, blink B, On and Off for 5 seconds in each pattern). Continued on next page SRM Console Diagnostics 4-3 Table 4–1 Summary of Diagnostic and Related Commands (Continued) Command Function php_button_test Tests the attention switch of each hot-plug slot on a specified I/O hose. The user is prompted to press the attention switch for each slot that has a blinking green LED. The user has 10 seconds to press the switch before the test declares a failure and moves onto the next slot. scsi_poll Controls whether or not a particular SCSI device driver polls for devices on the bus when the driver is started. This device is supported by some, but not all, console SCSI device drivers. scsi_reset Controls whether or not a particular SCSI device driver resets the SCSI bus when the driver is started. This EV is supported by some, but not all, console SCSI device drivers. set sys_serial_ num Sets the system serial number, which is then propagated to all FRUs that have EEPROMs. sys_com1_rmc Enables/disables internal COM1 access to the RMC. show error Reports errors logged in the FRU EEPROMs. show fru Displays information about field replaceable units (FRUs), including CPUs, memory DIMMs, and PCI cards. show_status Displays the progress of diagnostic tests. Reports one line of information for each executing diagnostic. sys_exer Exercises the devices displayed with the show config command. sys_exer -lb Runs console loopback tests for the COM2 serial port and the parallel port during the sys_exer test sequence. Continued on next page 4-4 ES45 Service Guide Table 4–1 Summary of Diagnostic and Related Commands (Continued) Command Function test Verifies the configuration of the devices in the system. test -lb Runs loopback tests for the COM2 serial port and the parallel port in addition to verifying the configuration of devices. SRM Console Diagnostics 4-5 4.2 buildfru 2 The buildfru command initializes I C bus EEPROM descriptive data structures for the named FRU and initializes its SDD and TDD error logs. This command uses data supplied on the command line to build the FRU descriptor. Buildfru is used by Manufacturing, FRU repair operations, or Field Service. Example 4–1 buildfru P00>>> buildfru smb0.mmb0.J3 54-24941-EA NI90200100 P00>>> buildfru smb0.cpu0 30-30158-05.AX05 NI94060554 Compaq P00>>> buildfru -s smb0.mmb0.J3 80 45 P00>>> buildfru -s smb0.mmb0.J3 80 47 46 45 44 43 42 41 Building of the FRU descriptor on a DIMM, passing a part number and a serial number Building of the FRU descriptor on a CPU, passing a part number, serial number, and miscellaneous string Building of the FRU descriptor on a DIMM with the -s qualifier, pass offset 80, and value of 45 Building of the FRU descriptor on a DIMM with the -s qualifier, pass offset 80, and many sequential data bytes The buildfru command is used for several purposes: • By Manufacturing to build a FRU table containing a description of each FRU in the system • By FRU repair operations for initializing good stocking spares • By Field Service to make any FRU descriptor adjustments required by the customer. 4-6 ES45 Service Guide The information supplied on the buildfru command line includes the console name for the FRU, part number, serial number, model number, and optional information. The buildfru command facilitates writing the FRU information to the EEPROM on the device. Use the show fru command to display the FRU table created with buildfru. Use the show error command to display FRUs that have errors logged to them. Typically, you only need to use buildfru in Field Service if you replace a device for which the information displayed with the show fru command is inaccurate or missing. After replacing the device, use buildfru to build the new FRU descriptor. NOTE: Be sure to enter the FRU information carefully. If you enter incorrect information, the callout used by Compaq Analyze will not be accurate. Three areas of the EEPROM can be initialized: the FRU generic data, the FRU specific data, and the system specific data. Each area has its own checksum, which is recalculated any time that segment of the EEPROM is written. When the buildfru command is executed, the FRU EEPROM is first flooded with zeros and then the generic data, the system specific data, and EEPROM format version information are written and checksums are updated. For certain FRUs, such as CPU modules, additional FRU “specific” data can be entered using the -s option. This data is written to the appropriate region, and its corresponding checksum is updated. FRU Assembly Hierarchy Alpha-based systems can be decomposed into a collection of FRUs. Some FRUs carry various levels of nested FRUs. For instance, the system motherboard is a FRU that carries a number of “child” FRUs. A child, such as a memory motherboard (MMB), may carry a number of its own children, DIMMs. The naming convention for FRUs represents the assembly hierarchy. The following is the general form of a FRU name: <frun>[.<frun>[.<frun>]] The fru is a placeholder for the appropriate FRU type at that level and n is the number of that FRU instance on that branch of the system hierarchy. SRM Console Diagnostics 4-7 The ES45 FRU assembly hierarchy has three levels. The FRU types from the top to the bottom of the hierarchy are as follows: Level FRU Type Meaning First Level SMB JIO OCP PWR (0–2) FAN System motherboard I/O connector module (junk I/O) Operator control panel Power supplies Fans Second Level CPU (0–3) MMB (0–3) CPB CPUs Memory motherboards PCI backplane Third Level J (2–9) PCI (0–9) SBM (0–1) Memory DIMMs PCI slots SCSI backplane To build a FRU descriptor for a lower level FRU, point back to the higher level FRUs to which it is associated. For example, to build a descriptor for a DIMM, point back to the MMB on which it resides and then to the system motherboard. All fields are automatically set to uppercase before writing to EEPROM. See Example 4–1. If you enter the buildfru data correctly for a device that has an EEPROM to program, nothing is displayed after you enter the command. If you enter incorrect data or the device does not have an EEPROM to program, an error message similar to the following is displayed: P00>>> P00>>> buildf fan4 54-12345-01.a001 ay84412345 Device FAN4 does not support setting FRU values P00>>> 4-8 ES45 Service Guide Syntax buildfru ( <fru_name> <part_num> <serial_num> [<misc> [<other>]] or -s <fru_name> <offset> <byte> [<byte>...] ) Arguments <fru_name> Console name for this FRU. This name reflects the position of the FRU in the assembly hierarchy. <part_num> The FRU's 2-5-2.4 part number. This ASCII string should be 16 characters (extra characters are truncated). This field should not contain any embedded spaces. If a space must be inserted, enclose the entire argument string in double quotes. This field contains the FRU revision, and in some cases an embedded space is allowed between the part number and the revision. <serial_num> The FRU's serial number. This ASCII string must be 10 characters (extra characters are truncated). The manufacturing location and date are extracted from this field. <misc> The FRU's model name or number or the common name for the FRU. This ASCII string may be up to 10 characters (extra characters are truncated). This field is optional, unless <alias> is specified. <other> The FRU's Compaq alias number, if one exists. This ASCII string may be up to 16 characters (extras are truncated). This field is optional. <offset> The beginning byte offset (0–255 hex) within this FRU's EEPROM, where the following supplied data bytes are to be written. <byte>... The data bytes to be written. At least one data byte must be supplied after the offset. Options -s Writes raw data to the EEPROM. This option is typically used to apply any FRU specific data. SRM Console Diagnostics 4-9 4.3 cat el and more el The cat el and more el commands display the contents of the console event log. In Example 4–2, the console reports that CPU 1 did not power up and fans 1 and 2 failed. Example 4–2 more el >>> more el *** Error - CPU 1 failed powerup diagnostics *** Secondary start error EV6 BIST = 1 STR status = 1 CSC status = 1 PChip0 status = 1 PChip1 status = 1 DIMx status = 0 TIG Bus status = 1 DPR status = 0 CPU speed status = 0 CPU speed = 0 Powerup time = 00-00-00 00:00:00 CPU SROM sync = 0 *** Error - Fan 1 failed *** *** Error - Fan 2 failed *** CPU 1 failed. Fan 1 and Fan 2 failed. Status and error messages are logged to the console event log at power-up, during normal system operation, and while running system tests. Standard error messages are indicated by asterisks (***). When cat el is used, the contents of the console event log scroll by. Use the Ctrl/S key combination to stop the screen from scrolling, and use Ctrl/Q to resume scrolling. The more el command allows you to view the console event log one screen at a time. Syntax cat el or more el 4-10 ES45 Service Guide 4.4 clear_error The clear_error command clears errors logged in the FRU EEPROMs as reported by the show error command. Example 4–3 clear_error P00>>> clear_error smb0 P00>>> P00>>> clear_error all P00>>> Clears all errors logged in the FRU EEPROM on the system motherboard (SMB0). Clears all errors logged to all FRU EEPROMs in the system. The clear_error command clears TDD, SDD, and checksum errors. Hardware failures and unreadable EEPROM errors are not cleared. See Table 4–2. Syntax clear_error <fruname> Clears all errors logged to a specific FRU. Fruname is the name of the specified FRU. If you do not specify a FRU, you must use clear_error all to clear errors. clear_error all Clears all errors logged to all system FRUs. See the show error command for information on the types of errors that might be logged to the FRU EEPROMs. SRM Console Diagnostics 4-11 4.5 crash The SRM crash command forces a crash dump to the selected device for UNIX and OpenVMS systems. P00>>> crash CPU 0 restarting DUMP: 19837638 blocks available for dumping. DUMP: 118178 wanted for a partial compressed dump. DUMP: Allowing 2060017 of the 2064113 available on 0x800001 device string for dump = SCSI 1 1 0 0 0 0 0. DUMP.prom: dev SCSI 1 1 0 0 0 0 0, block 2178787 DUMP: Header to 0x800001 at 2064113 (0x1f7ef1) device string for dump = SCSI 1 1 0 0 0 0 0. DUMP.prom: dev SCSI 1 1 0 0 0 0 0, block 2178787 DUMP: Dump to 0x800001: .......: End 0x800001 device string for dump = SCSI 1 1 0 0 0 0 0. DUMP.prom: dev SCSI 1 1 0 0 0 0 0, block 2178787 DUMP: Header to 0x800001 at 2064113 (0x1f7ef1) succeeded halted CPU 0 halt code = 5 HALT instruction executed PC = fffffc0000568704 P00>>> Use the crash command when the system has hung and you are able to halt it with the Halt button or the RMC halt in command. The crash command restarts the operating system and forces a crash dump to the selected device. • See the OpenVMS Alpha System Dump Analyzer Utility Manual for information on how to interpret OpenVMS crash dump files. • See the Guide to Kernel Debugging for information on using the Tru64 UNIX Krash Utility. 4-12 ES45 Service Guide 4.6 deposit and examine The deposit command writes data to the specified address of a memory location, register, or device. The examine command displays the contents of a memory location, register, or a device. Example 4–4 deposit and examine deposit P00>>> dep -b -n 1ff pmem:0 0 P00>>> d -l -n 3 vmem:1234 5 P00>>> d -n 8 r0 ffffffff P00>>> d -l -n 10 -s 200 pmem:0 8 P00>>> d -l pmem:0 0 P00>>> d + ff P00>>> d scbb 820000 examine P00>>> e dpr:34f0 -l -n 5 dpr: 34F0 00000000 dpr: 34F4 00000000 dpr: 34F8 00000000 dpr: 34FC 00000000 dpr: 3500 204D5253 dpr: 3504 352E3558 P00>>> SRM Console Diagnostics 4-13 Deposit The deposit command stores data in the location specified. If no options are given, the system uses the options from the preceding deposit command. If the specified value is too large to fit in the data size listed, the console ignores the command and issues an error. If the data is smaller than the data size, the higher order bits are filled with zeros. In Example 4–4: Clear first 512 bytes of physical memory Deposit 5 into four longwords starting at virtual memory address 1234. Load GPRs R0 through R8 with -1. Deposit 8 in the first longword of the first 17 pages in physical memory. Deposit 0 to physical memory address 0. Deposit FF to physical memory address 4. Deposit 820000 to SCBB. Examine The examine command displays the contents of a memory location, a register, or a device. If no options are given, the system uses the options from the preceding examine command. If conflicting address space or data sizes are specified, the console ignores the command and issues an error. For data lengths longer than a longword, each longword of data should be separated by a space. In Example 4–4: the DPR starting at location 34f0 and continuing through the Examine next 5 locations, and display the data size in longwords. Syntax deposit [-{b,w,l,q,o,h}] [-{n value, s value}] [space:] address data examine [-{b,w,l,q,o,h}] [-{n value, s value}] [space:] address 4-14 ES45 Service Guide -b Defines data size as byte. -w Defines data size as word. -l (default) Defines data size as longword. -q Defines data size as quadword. -o Defines data size as octaword. -h Defines data size as hexword. -d Instruction decode (examine command only) -n value The number of consecutive locations to modify. -s value The address increment size. The default is the data size. dev_name Device name (address space) of the device to access. Device names are: dpr Dual-port RAM. See Appendix C for the DPR address layout. eerom Nonvolatile ROM used for EV storage. fpr Floating-point register set; name is F0 to F31. Alternatively, can be referenced by name. gpr General register set; name is R0 to R31. Alternatively, can be referenced by name. ipr Internal processor registers. Alternatively, some IPRs can be referenced by name. pcicfg PCI configuration space. pciio PCI I/O space. pcimem PCI memory space pt The PALtemp register set; name is PT0 to PT23. pmem Physical memory (default). vmem Virtual memory. offset Offset within a device to which data is deposited. data Data to be deposited. Symbolic forms can be used for the address. They are: SRM Console Diagnostics 4-15 pc The program counter. The address space is set to GPR. + The location immediately following the last location referenced in a deposit or examine command. For physical and virtual memory, the referenced location is the last location plus the size of the reference (1 for byte, 2 for word, 4 for longword). For other address spaces, the address is the last referenced address plus 1. - The location immediately preceding the last location referenced in a deposit or examine command. Memory and other address spaces are handled as above. * The last location referenced in a deposit or examine command. @ The location addressed by the last location referenced in a deposit or examine command. 4-16 ES45 Service Guide 4.7 exer The exer command exercises one or more devices by performing specified read, write, and compare operations. Typically exer is run from the built-in console script. Advanced users may want to use the specific options described here. Note that running exer on disks can be destructive. Optionally, exer reports performance statistics: • A read operation reads from a device that you specify into a buffer. • A write operation writes from a buffer to a device that you specify. • A compare operation compares the contents of the two buffers. The exer command uses two buffers, buffer1 and buffer2, to carry out the operations. A read or write operation can be performed using either buffer. A compare operation uses both buffers. Example 4–5 exer P00>>> exer dk*.* -p 0 -secs 36000 Read SCSI disks for the entire length of each disk. Repeat this until 36000 seconds, 10 hours, have elapsed. All disks will be read concurrently. Each block read will occur at a random block number on each disk. P00>>> exer -l 2 dka0 Read block numbers 0 and 1 from device dka0. P00>>> exer -sb 1 -eb 3 -bc 4 -a 'w' -d1 '0x5a' dka0 Write hex 5a's to every byte of blocks 1, 2, and 3. The packet size is bc * bs, 4 * 512, 2048 for all writes. SRM Console Diagnostics 4-17 P00>>> ls -l dk*.* r--dk 0/0 0 P00>>> exer dk*.* -bc 10 -sec 20 -m -a 'r' dka0.0.0.0.0 exer completed packet IOs idle 8192 3325 27238400 0 166 dka0.0.0.0.0 elapsed 1360288 20 19 P00>>> exer -eb 64 -bc 4 -a '?w-Rc' dka0 A destructive write test over block numbers 0 through 100 on disk dka0. The packet size is 2048 bytes. The action string specifies the following sequence of operations: 1. Set the current block address to a random block number on the disk between 0 and 97. A four block packet starting at block numbers 98, 99, or 100 would access blocks beyond the end of the length to be processed so 97 is the largest possible starting block address of a packet. 2. Write a packet of hex 5a's from buffer1 to the current block address. 3. Set the current block address to what it was just prior to the previous write operation. 4. From the current block address read a packet into buffer2. 5. Compare buffer1 with buffer2 and report any discrepancies. 6. Repeat steps 1 through 5 until enough packets have been written to satisfy the length requirement of 101 blocks. P00>>> exer -a '?r-w-Rc' dka0 A nondestructive write test with packet sizes of 512 bytes. Use this test only if the customer has a current backup of any disks being tested. The action string specifies the following sequence of operations: 1. Set the current block address to a random block number on the disk. 2. From the current block address on the disk, read a packet into buffer1. 3. Set the current block address to the device address where it was just before the previous read operation occurred. 4. Write the contents of buffer1 back to the current block address. 5. Set the current block address to what it was just prior to the previous write operation. 4-18 ES45 Service Guide 6. From the current block address on the disk, read a packet into buffer2. 7. Compare buffer1 with buffer2 and report any discrepancies. 8. Repeat the above steps until each block on the disk has been written once and read twice. You can tailor the behavior of exer by using options to specify the following: • An address range to test within the test device(s) • The packet size, also known as the I/O size, which is the number of bytes read or written in one I/O operation • The number of passes to run • How many seconds to run • A sequence of individual operations performed on the test devices. The qualifier is called the action string qualifier. Syntax exer ( [-sb start_block>] [-eb end_block>] [-p pass_count>] [-l blocks>] [-bs block_size>] [-bc block_per_io>] [-d1 buf1_string>] [-d2 buf2_string>] [-a action_string>] [-sec seconds>] [-m] [-v] [-delay milliseconds>] device_name>... ) Arguments device_name Specifies the names of the devices or filestreams to be exercised. Options -sb <start_block> Specifies the starting block number (hex) within filestream. The default is 0. -eb <end_block> Specifies the ending block number (hex) within filestream. The default is 0. -p <pass_count> Specifies the number of passes to run the exerciser. If 0, then run forever or until Ctrl/C. The default is 1. -l <blocks> Specifies the number of blocks (hex) to exercise. -l has precedence over -eb. If only reading, then specifying neither -l nor -eb defaults to read till eof. If writing, and neither -l nor -eb are specified then exer will write for SRM Console Diagnostics 4-19 the size of device. The default is 1. -bs <block_size> Specifies the block size (hex) in bytes. The default is 200 (hex). -bc <block_per_io> Specifies the number of blocks (hex) per I/O. On devices without length (tape), use the specified packet size or default to 2048. The maximum block size allowed with variable length block reads is 2048 bytes. The default is 1. -d1 <buf1_string> String argument for eval to generate buffer1 data pattern from. Buffer1 is initialized only once before any I/O occurs. Default = all bytes set to hex 5A's. -d2 <buf2_string> String argument for eval to generate buffer2 data pattern from. Buffer2 is initialized only once before any I/O occurs. Default = all bytes set to hex 5A's. -a <action_string> Specifies an exerciser action string, which determines the sequence of reads, writes, and compares to various buffers. The default action string is ?r. The action string characters are: 4-20 • r • W Write from buffer1. • R Read into buffer2. • W Write from buffer2. • N Write without lock from buffer1. • N Write without lock from buffer2. • c Compare buffer1 with buffer2. • - Seek to file offset prior to last read or write. ES45 Service Guide Read into buffer1. -a <action_string> (continued) • ? Seek to a random block offset within the specified range of blocks. exer calls the program, random, to “deal” each of a set of numbers once. exer chooses a set that is a power of two and is greater than or equal to the block range. Each call to random results in a number that is then mapped to the set of numbers that are in the block range and exer seeks to that location in the filestream. Since exer starts with the same random number seed, the set of random numbers generated will always be over the same set of block range numbers. • s Sleep for a number of milliseconds specified by the delay qualifier. If no delay qualifier is present, sleep for 1 millisecond. Times as reported in verbose mode will not necessarily be accurate when this action character is used. • z Zero buffer 1 • Z Zero buffer 2 • b Add constant to buffer 1 • B Add constant to buffer 2 -sec <seconds> Specifies to terminate the exercise after the number of seconds have elapsed. By default the exerciser continues until the specified number of blocks or passcount are processed. -m Specifies metrics mode. At the end of the exerciser a total throughput line is displayed. -v Specifies verbose mode. Data read is also written to stdout. This is not applicable on writes or compares. The default is verbose mode off. -delay <millisecs> Specifies the number of milliseconds to delay when s appears as a character in the action string. SRM Console Diagnostics 4-21 4.8 floppy_write The floppy_write script runs a write test on the floppy drive to determine whether or not you can write on the diskette. Use this script if a customer is unable to write data to the floppy. This is a destructive test, so use a blank floppy. Example 4–6 floppy_write P00>>> floppy_write Destructive Test of the Floppy started P00>>> show_status ID Program Device Pass Hard/Soft Bytes Written Bytes Read -------- ------------ ------------ ------ --------- ------------- ----------00000001 idle system 0 0 0 0 0 00000c37 exer_kid dva0.0.0.100 0 0 0 6656 6656 The floppy_write script uses exer to run a write test on the floppy. The test runs in the background. Use the show_status command to display the progress of the test. Use the kill or kill_diags command to terminate the test. 4-22 ES45 Service Guide 4.9 grep The grep command is very similar to the UNIX grep command. It allows you to search for “regular expressions”—specific strings of characters—and prints any lines containing occurrences of the strings. Using grep is similar to using wildcards. Example 4–7 grep P00>>> show fru | grep PCI SMB0.CPB0.PCI1 0 DE500-BA Network Cont SMB0.CPB0.PCI4 0 DEC PowerStorm SMB0.CPB0.PCI5 0 NCR 53C895 P00>>> In Example 4–7 the output of the show fru command is piped into grep (the vertical bar is the piping symbol), which filters out only lines with “PCI.” Grep supports the following metacharacters: ^ Matches beginning of line $ Matches end of line . Matches any single character [] Set of characters; [ABC] matches either 'A' or 'B' or 'C'; a dash (other than first or last of the set) denotes a range of characters: [A-Z] matches any uppercase letter; if the first character of the set is '^' then the sense of match is reversed: [^0-9] matches any non-digit; several characters need to be quoted with backslash (\) if they occur in a set: '\', ']', '-', and '^' * Repeated matching; when placed after a pattern, indicates that the pattern should match any number of times. For example, '[a-z][0-9]*' matches a lowercase letter followed by zero or more digits. + Repeated matching; when placed after a pattern, indicates that the pattern should match one or more times '[0-9]+' matches any non-empty sequence of digits. ? Optional matching; indicates that the pattern can match zero or one times. '[a-z][0-9]?' matches lowercase letter alone or followed by a single digit. \ Quote character; prevent the character that follows from having special meaning. SRM Console Diagnostics 4-23 Syntax grep ( [-{c|i|n|v}] [-f <file>] [<expression>] [<file>...] ) Arguments <expression> Specifies the target regular expression. If any regular expression metacharacters are present, the expression should be enclosed with quotes to avoid interpretation by the shell. <file>... Specifies the files to be searched. If none are present, then standard input is searched. Options -c Print only the number of lines matched. -i Ignore case. By default grep is case sensitive. -n Print the line numbers of the matching lines. -v Print all lines that do not contain the expression. -f <file> Take regular expressions from a file, instead of command. 4-24 ES45 Service Guide 4.10 hd The hd command dumps the contents of a file (byte stream) in hexadecimal and ASCII. Example 4–8 hd P00>>> hd -eb 0 dpr:2b00 block 0 00000000 48 45 4C 4C 4F FF FF FF FF FF FF FF FF FF FF FF 00000010 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000020 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000030 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 3A 00000040 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000050 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000060 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000070 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000080 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000090 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000000a0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000000b0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000000c0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000000d0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000000e0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000000f0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000100 48 45 4C 4C 4F FF FF FF FF FF FF FF FF FF FF FF 00000110 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000120 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000130 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 3A 00000140 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000150 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000160 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000170 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000180 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 00000190 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000001a0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000001b0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000001c0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000001d0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000001e0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF 000001f0 FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF FF P00>>> HELLO........... ................ ................ ...............: ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ HELLO........... ................ ................ ...............: ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ ................ SRM Console Diagnostics 4-25 Example 4–8 shows a hex dump to DPR location 2b00, ending at block 0. Syntax hd [-{byte|word|long|quad}] [-{sb|eb} <n>] <file>[:<offset>]. Arguments <file>[:<offset>] Specifies the file (byte stream) to be displayed. Options -byte Print out data in byte sizes -word Print out data by word -long Print out data by longword -quad Print out data by quadword -sb <n> Start block -eb <n> End block 4-26 ES45 Service Guide 4.11 info The info command displays registers and data structures. You can enter the command by itself or followed by a number (0 − 8). If you do not specify a number, a list of selections is displayed and you are prompted to enter a selection. Example 4–9 info 0 P00>>> info 0. HWRPB MEMDSC 1. Console PTE 2. GCT/FRU 5 3. Dump System CSRs 4. IMPURE area (abbreviated) 5. IMPURE area (full) 6. LOGOUT area 7. Dump Error Log 8. Clear Error Log Enter selection: 0 HWRPB: 2000 MEMDSC:2d40 Cluster count: 3 Cluster: 0, Usage: Console START_PFN: 00000000 PFN_COUNT: 0000016e PFN_TESTED: 00000000 366 pages from 0000000000000000 to 00000000002dbfff Cluster: 1, Usage: System START_PFN: 0000016e PFN_COUNT: 0013fe75 PFN_TESTED: 00003e92 BITMAP_VA: 0000000000000000 BITMAP_PA: 000000027ffd8000 1310327 good pages from 00000000002dc000 to 00000027fffc5fff Cluster: 2, Usage: Console START_PFN: 0011ffe3 PFN_COUNT: 0000001d PFN_TESTED: 00000000 29 pages from 000000027ffc6000 to 000000027fffffff P00>>> SRM Console Diagnostics 4-27 For information about the data displayed by the info commands, see the following documents: • For info 0, info 1, and info 4, see the Alpha System Reference Manual, Third Edition (EY-W938E-DP), available from Digital Press, an imprint of Butterworth-Heinemann. • For info 2, see the Galaxy Console and Alpha Systems V5.0 FRU Configuration Tree Specification. • For info 3, see the Titan 21274 Chipset Functional Specification. info 0 Displays the SRM memory descriptors as described in the Alpha System Reference Manual. info 1 Displays the page table entries (PTE) used by the console and operating system to map virtual to physical memory. Valid data is displayed only after a boot operation. info 2 Dumps the Galaxy Configuration Tree (GCT) FRU table. Galaxy is a software architecture that allows multiple instances of OpenVMS to execute cooperatively on a single computer. info 3 Dumps the contents of the system control status registers (CSRs) for the C-chip, D-chip, and P-chips. info 4 Displays the per CPU impure area in abbreviated form. The console uses this scratch area to save processor context. info 5 Displays the per CPU impure area in full form. info 6 Displays the per CPU machine check logout area. info 7 Displays the contents of the Console Data Log. info 8 Clears all event frames in the Console Data Log. For more information on: • info 0, 1, 4, and 5 see the Alpha System Reference Manual • info 2 see the Galaxy console and Alpha Systems V5.0 FRU configuration Tree Specification. • info 3 see the Titan Chipset Engineering Specification. • info 6 and 7 see the AlphaServer ES45 Platform Fault Management Specification. 4-28 ES45 Service Guide Example 4–10 shows an info 1 display. Example 4–10 info 1 P00>>> info 1 pte 000000003FFA8000 0000000100001101 pte 000000003FFA8008 0000000200001101 pte 000000003FFA8010 0000000300001101 pte 000000003FFA8018 0000000400001101 pte 000000003FFA8020 0000000500001101 pte 000000003FFA8028 0000000600001101 pte 000000003FFA8030 0000000700001101 pte 000000003FFA8038 0000000800001101 pte 000000003FFA8040 0000000900001101 pte 000000003FFA8048 0000000A00001101 pte 000000003FFA8050 0000000B00001101 pte 000000003FFA8058 0000000C00001101 pte 000000003FFA8060 0000000D00001101 pte 000000003FFA8068 0000000E00001101 pte 000000003FFA8070 0000000F00001101 pte 000000003FFA8078 0000001000001101 pte 000000003FFA8080 0000001100001101 pte 000000003FFA8088 0000001200001101 pte 000000003FFA8090 0000001300001101 pte 000000003FFA8098 0000001400001101 pte 000000003FFA80A0 0000001500001101 pte 000000003FFA80A8 0000001600001101 pte 000000003FFA80B0 0000001700001101 pte 000000003FFA80B8 0000001800001101 pte 000000003FFA80C0 0000001900001101 pte 000000003FFA80C8 0000001A00001101 pte 000000003FFA80D0 0000001B00001101 pte 000000003FFA80D8 0000001C00001101 pte 000000003FFA80E0 0000001D00001101 pte 000000003FFA80E8 0000001E00001101 va 0000000010000000 va 0000000010002000 va 0000000010004000 va 0000000010006000 va 0000000010008000 va 000000001000A000 va 000000001000C000 va 000000001000E000 va 0000000010010000 va 0000000010012000 va 0000000010014000 va 0000000010016000 va 0000000010018000 va 000000001001A000 va 000000001001C000 va 000000001001E000 va 0000000010020000 va 0000000010022000 va 0000000010024000 va 0000000010026000 va 0000000010028000 va 000000001002A000 va 000000001002C000 va 000000001002E000 va 0000000010030000 va 0000000010032000 va 0000000010034000 va 0000000010036000 va 0000000010038000 va 000000001003A000 pa 0000000000002000 pa 0000000000004000 pa 0000000000006000 pa 0000000000008000 pa 000000000000A000 pa 000000000000C000 pa 000000000000E000 pa 0000000000010000 pa 0000000000012000 pa 0000000000014000 pa 0000000000016000 pa 0000000000018000 pa 000000000001A000 pa 000000000001C000 pa 000000000001E000 pa 0000000000020000 pa 0000000000022000 pa 0000000000024000 pa 0000000000026000 pa 0000000000028000 pa 000000000002A000 pa 000000000002C000 pa 000000000002E000 pa 0000000000030000 pa 0000000000032000 pa 0000000000034000 pa 0000000000036000 pa 0000000000038000 pa 000000000003A000 pa 000000000003C000 . . . SRM Console Diagnostics 4-29 Example 4–11 shows an info 2 display. Example 4–11 info 2 P00>>> info 2 GCT_ROOT_NODE GCT_NODE Type Subtype Hd_extension Size Rev_major Rev_minor Id node_flags saved_owner affinity parent child fw_usage Root->lock Root->transient_level Root->current_level Root->console_req Root->min_alloc Root->min_align Root->base_alloc Root->base_align Root->max_phys_addr Root->mem_size Root->platform_type Root->platform_name Root->primary_instance Root->first_free Root->high_limit Root->lookaside Root->available Root->max_partition Root->partitions Root->communities Root->bindings Root->max_plat_partition Root->max_desc Root->galaxy_id Root->root_flags 21e000 1 0 0 c000 6 0 0000000000000000 0 0 0 0 2c0 0 ffffffff 1f 1f 200000 100000 100000 2000000 2000000 7fffffffff 27fffffff 40500000022 0000000000000280 0 69c0 bcc0 0 2d00 1 0000000000000180 00000000000001c0 0000000000000200 1 1 21e128 1 dump depth view ? (Y/<N> N dump each node ? (Y/<N>) N dump binary ? (Y/<N>) N show flags ? (Y/<N>) N Dump a Node – Enter Handle (hex) ? P00>>> P00>>> 4-30 ES45 Service Guide Example 4–12 shows an info 3 display. Example 4–12 info 3 P00>>> info 0. HWRPB MEMDSC 1. Console PTE 2. GCT/FRU 5 3. Dump System CSRs 4. IMPURE area (abbreviated) 5. IMPURE area (full) 6. LOGOUT area 7. Dump Error Log 8. Clear Error Log Enter selection: 3 CCHIP CSC MTR MISC AAR0 AAR1 AAR2 AAR3 DIM0 DIM1 DIM2 DIM3 DIR0 DIR1 DIR2 DIR3 DRIR TTR TDR CSRs: 801a0000000 705B80000919792F : 0000 00002F641E001225 : 0040 0000001100000000 : 0080 0000000000009305 : 0100 0000000200007105 : 0140 0000000100009305 : 0180 0000000240007105 : 01c0 F884003010011000 : 0200 0000000000000000 : 0240 0000000000000000 : 0600 0000000000000000 : 0640 0000000000000000 : 0280 0000000000000000 : 02c0 0000000000000000 : 0680 0000000000000000 : 06c0 0300000000000000 : 0300 000000000000077F : 0580 F7FFF7FFF7FFF7FF : 05c0 DCHIP DSC DSC2 STR DREV CSRs: 801b0000000 7F7F7F7F7F7F7F7F : 0800 7F7F7F7F7F7F7F7F : 08c0 3939393939393939 : 0840 1111111111111111 : 0880 PCHIP 0 CSRs: GWSBA0 GWSBA1 GWSBA2 GWSBA3 GWSM0 GWSM1 GWSM2 GWSM3 GTBA0 GTBA1 GTBA2 GTBA3 GPCTL GPLAT SERROR 80180000000 0000000000800000 : 0000 0000000080000001 : 0040 0000000000000000 : 0080 0000000000000002 : 00c0 0000000000700000 : 0100 000000003FF00000 : 0140 0000000000000000 : 0180 0000000000000000 : 01c0 0000000000000000 : 0200 0000000000000000 : 0240 0000000000000000 : 0280 0000000000000000 : 02c0 00000004C38000C0 : 0300 000000000000FF00 : 0340 0000000000000000 : 0400 SRM Console Diagnostics 4-31 SERREN GPERROR GPERREN SCTL AWSBA0 AWSBA1 AWSBA2 AWSBA3 AWSM0 AWSM1 AWSM2 AWSM3 ATBA0 ATBA1 ATBA2 ATBA3 APCTL APLAT AGPERROR AGPERREN APERROR APERREN 000000000000000E 0440 0000000000400000 0500 00000000000007F6 0540 0000000002831611 0700 0000000000800000 : 0000 0000000080000001 : 0040 0000000000000000 : 0080 0000000000000002 : 10c0 0000000000700000 : 1100 000000003FF00000 : 1140 0000000000000000 : 1180 0000000000000000 : 11c0 0000000000000000 : 1200 0000000000000000 : 1240 0000000000000000 : 1280 0000000000000000 : 12c0 48000004C2C200C0 : 1300 000000000000FF00 : 1340 0000054112908000 : 1400 0000000000000010 1440 0000000000000000 1500 00000000000007F6 1540 PCHIP 1 CSRs: GWSBA0 GWSBA1 GWSBA2 GWSBA3 GWSM0 GWSM1 GWSM2 GWSM3 GTBA0 GTBA1 GTBA2 GTBA3 GPCTL GPLAT SERROR SERREN GPERROR GPERREN SCTL AWSBA0 AWSBA1 AWSBA2 AWSBA3 AWSM0 AWSM1 AWSM2 AWSM3 ATBA0 ATBA1 ATBA2 ATBA3 APCTL APLAT AGPERROR AGPERREN APERROR APERREN 80380000000 0000000000800000 0000 0000000080000001 0040 0000000000000000 0080 0000000000000002 00c0 0000000000700000 0100 000000003FF00000 0140 0000000000000000 0180 0000000000000000 01c0 0000000000000000 0200 0000000000000000 0240 0000000000000000 0280 0000000000000000 02c0 00000004C34000C0 0300 000000000000FF00 0340 0000000000000000 0400 000000000000000E 0440 0000080000004000 0500 00000000000007F6 0540 0000000002831711 0700 0000000000800000 : 1000 0000000080000001 : 1040 0000000000000000 : 1080 0000000000000002 : 10c0 0000000000700000 : 1100 000000003FF00000 : 1140 0000000000000000 : 1180 0000000000000000 : 12c0 0000000000000000 : 1200 0000000000000000 : 1240 0000000000000000 : 1280 0000000000000000 : 12c0 00000004C1C200C0 : 1300 000000000000FF00 : 1340 0008000000000000 : 1400 0000000000000010 1440 0000000000010000 1500 00000000000007F6 1540 4-32 ES45 Service Guide Example 4–13 shows an info 4 display. Example 4–13 info 4 P00>>> info 4 cpu00 cpu01 cpu02 cpu03 per_cpu impure area 00004200 00004800 00004e00 00005400 cns$flag 00000001 00000001 00000001 00000001 : 0000 cns$flag+4 00000000 00000000 00000000 00000000 : 0004 cns$hlt 00000000 00000000 00000000 00000000 : 0008 cns$hlt+4 00000000 00000000 00000000 00000000 : 000c cns$mchkflag 00000210 00000210 00000210 00000210 : 0210 cns$mchkflag+4 00000000 00000000 00000000 00000000 : 0214 cns$fpcr 00000000 00000000 00000000 00000000 : 0318 cns$fpcr+4 8ff00000 8ff00000 8ff00000 8ff00000 : 031c cns$va ffffffec fe00385f fe00385f fe00385f : 0320 cns$va+4 ffffffff 00000801 00000801 00000801 : 0324 cns$va_ctl 00000000 00000000 00000000 00000000 : 0328 cns$va_ctl+4 00000000 00000000 00000000 00000000 : 032c cns$exc_addr 00600930 00000000 00000000 00000000 : 0330 cns$exc_addr+4 00000000 00000000 00000000 00000000 : 0334 cns$ier_cm 00000000 00000000 00000000 00000000 : 0338 cns$ier_cm+4 00000020 00000020 00000020 00000020 : 033c cns$sirr 00000000 00000000 00000000 00000000 : 0340 cns$sirr+4 00000000 00000000 00000000 00000000 : 0344 cns$isum 00000000 00000000 00000000 00000000 : 0348 cns$isum+4 00000020 00000020 00000020 00000020 : 034c cns$exc_sum 00001fc0 000010c0 000010c0 000010c0 : 0350 cns$exc_sum+4 00000000 00000000 00000000 00000000 : 0354 cns$pal_base 00008000 00008000 00008000 00008000 : 0358 cns$pal_base+4 00000000 00000000 00000000 00000000 : 035c cns$i_ctl 16300386 16300386 16300386 16300386 : 0360 cns$i_ctl+4 00000000 00000000 00000000 00000000 : 0364 cns$pctr_ctl 00000000 00000000 00000000 00000000 : 0368 cns$pctr_ctl+4 00000000 00000000 00000000 00000000 : 036c cns$process_context 00000004 00000004 00000004 00000004 : 0370 cns$process_context+ 00000000 00000000 00000000 00000000 : 0374 cns$i_stat c0000000 80000000 00000000 80000000 : 0378 cns$i_stat+4 0000013d 00000142 00000174 0000017F : 037c cns$dtb_alt_mode 00000000 00000000 00000000 00000000 : 0380 cns$dtb_alt_mode+4 00000000 00000000 00000000 00000000 : 0384 cns$mm_stat 00000290 000000e1 000000e1 000000e1 : 0388 cns$mm_stat+4 00000000 00000000 00000000 00000000 : 038c cns$m_ctl 00000020 00000020 00000020 00000020 : 0390 cns$m_ctl+4 00000000 00000000 00000000 00000000 : 0394 cns$dc_ctl 000000c3 000000c3 000000c3 000000c3 : 0398 cns$dc_ctl+4 00000000 00000000 00000000 00000000 : 039c cns$dc_stat 00000000 00000000 00000000 00000000 : 03a0 cns$dc_stat+4 00000000 00000000 00000000 00000000 : 03a4 cns$write_many 00000000 00000000 00000000 00000000 : 03a8 cns$write_many+4 00000000 00000000 00000000 00000000 : 03ac cns$virbnd 00000000 00000000 00000000 00000000 : 03b0 cns$virbnd+4 00000000 00000000 00000000 00000000 : 03b4 cns$sysptbr 00000000 00000000 00000000 00000000 : 03b8 cns$sysptbr+4 00000000 00000000 00000000 00000000 : 03bc cns$report_1am 00000000 00000000 00000000 00000000 : 03c0 cns$report_1am+4 00000000 00000000 00000000 00000000 : 03c4 P00>>> SRM Console Diagnostics 4-33 Example 4–14 shows an info 5 display. Example 4–14 info 5 P00>>> info 5 per_cpu impure area cns$flag cns$flag+4 cns$hlt cns$hlt+4 cns$gpr[0] cns$gpr[0]+4 cns$gpr[1] cns$gpr[1]+4 cns$gpr[2] cns$gpr[2]+4 cns$gpr[3] cns$gpr[3]+4 cns$gpr[4] cns$gpr[4]+4 cns$gpr[5] cns$gpr[5]+4 cns$gpr[6] cns$gpr[6]+4 cns$gpr[7] cns$gpr[7]+4 cns$gpr[8] cns$gpr[8]+4 cns$gpr[9] cns$gpr[9]+4 cns$gpr[10] cns$gpr[10]+4 cns$gpr[11] cns$gpr[11]+4 cns$gpr[12] cns$gpr[12]+4 cns$gpr[13] cns$gpr[13]+4 cns$gpr[14] cns$gpr[14]+4 cns$gpr[15] cns$gpr[15]+4 cns$gpr[16] cns$gpr[16]+4 cns$gpr[17] cns$gpr[17]+4 cns$gpr[18] cns$gpr[18]+4 cns$gpr[19] cns$gpr[19]+4 cns$gpr[20] cns$gpr[20]+4 cns$gpr[21] cns$gpr[21]+4 . . . 4-34 ES45 Service Guide cpu00 cpu01 cpu02 cpu03 00004200 00004800 00004e00 00005400 00000001 00000001 00000001 00000001 : 0000 00000000 00000000 00000000 00000000 : 0004 00000000 00000000 00000000 00000000 : 0008 00000000 00000000 00000000 00000000 : 000c 00018000 00018000 00018000 00018000 : 0010 00000000 00000000 00000000 00000000 : 0014 0000001f 0000001f 0000001f 0000001f : 0018 00000000 00000000 00000000 00000000 : 001c 00004180 00004180 00004180 00004180 : 0020 00000000 00000000 00000000 00000000 : 0024 00001101 00001101 00001101 00001101 : 0028 00000000 00000000 00000000 00000000 : 002c 00000000 00000000 00000000 00000000 : 0030 00000000 00000000 00000000 00000000 : 0034 00000000 00000000 00000000 00000000 : 0038 00000000 00000000 00000000 00000000 : 003c 00000000 00000000 00000000 00000000 : 0040 00000000 00000000 00000000 00000000 : 0044 00000000 00000000 00000000 00000000 : 0048 00000000 00000000 00000000 00000000 : 004c 00000000 00000000 00000000 00000000 : 0050 00000000 00000000 00000000 00000000 : 0054 00000000 00008000 00008000 00008000 : 0058 00000000 00000000 00000000 00000000 : 005c 00008000 00008000 00008000 00008000 : 0060 00000000 00000000 00000000 00000000 : 0064 00000008 00000008 00000008 00000008 : 0068 00000000 00000000 00000000 00000000 : 006c 00004200 00004800 00004e00 00005400 : 0070 00000000 00000000 00000000 00000000 : 0074 00000000 ffffffff fffffffe fffffffd : 0078 00000000 ffffffff ffffffff ffffffff : 007c 00000000 00000000 00000000 00000000 : 0080 00000000 00000000 00000000 00000000 : 0084 00000001 000048d8 000048d8 000048d8 : 0088 00000000 00000000 00000000 00000000 : 008c 00000000 00000000 00000000 00000000 : 0090 00000000 00000000 00000000 00000000 : 0094 00000000 00000000 00000000 00000000 : 0098 00000000 2e313300 2e313300 2e313300 : 009c 000000b2 00000000 00000000 00000000 : 00a0 00000000 00000000 00000000 00000000 : 00a4 00000091 00800000 00800000 00800000 : 00a8 00000000 00000008 00000008 00000008 : 00ac 00000005 00000000 00000000 00000000 : 00b0 00000000 00000000 00000000 00000000 : 00b4 00000000 00000000 00000000 00000000 : 00b8 00000000 00000000 00000000 00000000 : 00bc cns$shadow23+4 cns$fpcr cns$fpcr+4 cns$va cns$va+4 cns$va_ctl cns$va_ctl+4 cns$exc_addr cns$exc_addr+4 cns$ier_cm cns$ier_cm+4 cns$sirr cns$sirr+4 cns$isum cns$isum+4 cns$exc_sum cns$exc_sum+4 cns$pal_base cns$pal_base+4 cns$i_ctl cns$i_ctl+4 cns$pctr_ctl cns$pctr_ctl+4 cns$process_context cns$process_context+ cns$i_stat cns$i_stat+4 cns$dtb_alt_mode cns$dtb_alt_mode+4 cns$mm_stat cns$mm_stat+4 cns$m_ctl cns$m_ctl+4 cns$dc_ctl cns$dc_ctl+4 cns$dc_stat cns$dc_stat+4 cns$write_many cns$write_many+4 cns$virbnd cns$virbnd+4 cns$sysptbr cns$sysptbr+4 cns$report_lam cns$report_lam+4 00000000 00000000 00000000 00000000 : 0314 00000000 00000000 00000000 00000000 : 0318 8ff00000 8ff00000 8ff00000 8ff00000 : 031c ffffffec fe00385f fe00385f fe00385f : 0320 ffffffff 00000801 00000801 00000801 : 0324 00000000 00000000 00000000 00000000 : 0328 00000000 00000000 00000000 00000000 : 032c 00600930 00000000 00000000 00000000 : 0330 00000000 00000000 00000000 00000000 : 0334 00000000 00000000 00000000 00000000 : 0338 00000020 00000020 00000020 00000020 : 033c 00000000 00000000 00000000 00000000 : 0340 00000000 00000000 00000000 00000000 : 0344 00000000 00000000 00000000 00000000 : 0348 00000020 00000020 00000020 00000020 : 034c 00001fc0 000010c0 000010c0 000010c0 : 0350 00000000 00000000 00000000 00000000 : 0354 00008000 00008000 00008000 00008000 : 0358 00000000 00000000 00000000 00000000 : 035c 16300386 16300386 16300386 16300386 : 0360 00000000 00000000 00000000 00000000 : 0364 00000000 00000000 00000000 00000000 : 0368 00000000 00000000 00000000 00000000 : 036c 00000004 00000004 00000004 00000004 : 0370 00000000 00000000 00000000 00000000 : 0374 c0000000 80000000 00000000 80000000 : 0378 0000013d 00000142 00000174 0000017f : 037c 00000000 00000000 00000000 00000000 : 0380 00000000 00000000 00000000 00000000 : 0384 00000290 000000e1 000000e1 000000e1 : 0388 00000000 00000000 00000000 00000000 : 038c 00000020 00000020 00000020 00000020 : 0390 00000000 00000000 00000000 00000000 : 0394 000000c3 000000c3 000000c3 000000c3 : 0398 00000000 00000000 00000000 00000000 : 039c 00000000 00000000 00000000 00000000 : 03a0 00000000 00000000 00000000 00000000 : 03a4 00000000 00000000 00000000 00000000 : 03a8 00000000 00000000 00000000 00000000 : 03ac 00000000 00000000 00000000 00000000 : 03b0 00000000 00000000 00000000 00000000 : 03b4 00000000 00000000 00000000 00000000 : 03b8 00000000 00000000 00000000 00000000 : 03bc 00000000 00000000 00000000 00000000 : 03c0 00000000 00000000 00000000 00000000 : 03c4 SRM Console Diagnostics 4-35 Example 4–15 show an info 6 display. Example 4–15 info 6 P00>>> info 6 per_cpu logout area mchk_crd__flag_frame mchk_crd__flag_frame+4 mchk_crd__offsets mchk_crd__offsets+4 mchk_crd__mchk_code mchk_crd__mchk_code+4 mchk_crd__i_stat mchk_crd__i_stat+4 mchk_crd__dc_stat mchk_crd__dc_stat+4 mchk_crd__c_addr mchk_crd__c_addr+4 mchk_crd__dc1_syndrome mchk_crd__dc1_syndrome+4 mchk_crd__dc0_syndrome mchk_crd__dc0_syndrome+4 mchk_crd__c_stat mchk_crd__c_stat+4 mchk_crd__c_sts mchk_crd__c_sts+4 mchk_crd__mm_stat mchk_crd__mm_stat+4 mchk_crd__os_flags mchk_crd__os_flags+4 mchk_crd__cchip_dirx mchk_crd__cchip_dirx+4 mchk_crd__cchip_misc mchk_crd__cchip_misc+4 mchk_crd__pachip0_serror mchk_crd__pachip0_serror+ mchk_crd__pachip0_aperror mchk_crd__pachip0_aperror mchk_crd__pachip0_gperror mchk_crd__pachip0_gperror mchk_crd__pachip0_agperro mchk_crd__pachip0_agperro mchk_crd__pachip1_serror mchk_crd__pachip1_serror+ mchk_crd__pachip1_aperror mchk_crd__pachip1_aperror mchk_crd__pachip1_gperror mchk_crd__pachip1_gperror mchk_crd__pachip1_agperro mchk_crd__pachip1_agperro mchk__flag_frame mchk__flag_frame+4 mchk__offsets mchk__offsets+4 mchk__mchk_code mchk__mchk_code+4 mchk__i_stat mchk__i_stat+4 mchk__dc_stat 4-36 ES45 Service Guide cpu00 00006000 00000000 : 0000 00000000 : 0004 00000000 : 0008 00000000 : 000c 00000000 : 0010 00000000 : 0014 00000000 : 0018 00000000 : 001c 00000000 : 0020 00000000 : 0024 00000000 : 0028 00000000 : 002c 00000000 : 0030 00000000 : 0034 00000000 : 0038 00000000 : 003c 00000000 : 0040 00000000 : 0044 00000000 : 0048 00000000 : 004c 00000000 : 0050 00000000 : 0054 00000000 : 0058 00000000 : 005c 00000000 : 0060 00000000 : 0064 00000000 : 0068 00000000 : 006c 00000000 : 0070 00000000 : 0074 00000000 : 0080 00000000 : 0084 00000000 : 0078 00000000 : 007c 00000000 : 0088 00000000 : 008c 00000000 : 0090 00000000 : 0094 00000000 : 00a0 00000000 : 00a4 00000000 : 0098 00000000 : 009c 00000000 : 00a8 00000000 : 00ac 000000f8 : 00b0 00000000 : 00b4 00000018 : 00b8 000000a0 : 00bc 00000202 : 00c0 00000001 : 00c4 00000000 : 00c8 00000000 : 00cc 00000000 : 00d0 mchk__dc_stat+4 mchk__c_addr mchk__c_addr+4 mchk__dc1_syndrome mchk__dc1_syndrome+4 mchk__dc0_syndrome mchk__dc0_syndrome+4 mchk__c_stat mchk__c_stat+4 mchk__c_sts mchk__c_sts+4 mchk__mm_stat mchk__mm_stat+4 mchk__exc_addr mchk__exc_addr+4 mchk__ier_cm mchk__ier_cm+4 mchk__isum mchk__isum+4 mchk__reserved_0 mchk__reserved_0+4 mchk__pal_base mchk__pal_base+4 mchk__i_ctl mchk__i_ctl+4 mchk__process_context mchk__process_context+4 mchk__reserved_1 mchk__reserved_1+4 mchk__reserved_2 mchk__reserved_2+4 mchk__os_flags mchk__os_flags+4 mchk__cchip_dirx mchk__cchip_dirx+4 mchk__cchip_misc mchk__cchip_misc+4 mchk__pachip0_serror mchk__pachip0_serror+4 mchk__pachip0_aperror mchk__pachip0_aperror+4 mchk__pachip0_gperror mchk__pachip0_gperror+4 mchk__pachip0_agperror mchk__pachip0_agperror+4 mchk__pachip1_serror mchk__pachip1_serror+4 mchk__pachip1_aperror mchk__pachip1_aperror+4 mchk__pachip1_gperror mchk__pachip1_gperror+4 mchk__pachip1_agperror mchk__pachip1_agperror+4 00000000 : 00d4 00000000 : 00d8 00000000 : 00dc 00000000 : 00e0 00000000 : 00e4 00000000 : 00e8 00000000 : 00ec 00000000 : 00f0 00000000 : 00f4 00000000 : 00f8 00000000 : 00fc 00000000 : 0100 00000000 : 0104 0009c250 : 0108 00000000 : 010c 80000000 : 0110 00000022 : 0114 00000000 : 0118 00000002 : 011c 00000000 : 0120 00000000 : 0124 00008000 : 0128 00000000 : 012c 16304386 : 0130 00000000 : 0134 00000004 : 0138 00000000 : 013c 00000000 : 0140 00000000 : 0144 00000000 : 0148 00000000 : 014c 00000001 : 0150 00000000 : 0154 00000000 : 0158 40000000 : 015c 00000000 : 0160 00000011 : 0164 00000000 : 0168 00000000 : 016c 00000000 : 0178 00000000 : 017c 00400002 : 0170 00000000 : 0174 00000000 : 0180 00000000 : 0184 00000000 : 0188 00000000 : 018c 00000000 : 0198 00000000 : 019c 00000000 : 0190 00000000 : 0194 00000000 : 01a0 00000000 : 01a4 SRM Console Diagnostics 4-37 Example 4–16 shows as info 7 display. Example 4–16 info 7 P00>>> info 7 Number of Errors Saved = 3 Error 1 0000 : 0001000400050018 0008 : 0000300a190f1324 0010 : 0000000300000170 0000 : 00010001000c0108 0008 : 0000000000000000 0010 : 00000000000000f8 0018 : 000000a000000018 0020 : 0000000100000098 0028 : 0000000020000000 0030 : 0000000000000000 0038 : 0000000000040000 0040 : 0000000000000000 0048 : 0000000000000000 0050 : 0000000000000000 0058 : 000000000000000d 0060 : 00000000000002d1 0068 : 00000000001caf00 0070 : 0000002280000000 0078 : 0000000200000000 0080 : 0000000000000000 0088 : 0000000000008000 0090 : 0000000016304386 0098 : 0000000000000004 00a0 : 0000000000000000 00a8 : 0000000000000000 00b0 : 0000000000000004 00b8 : 0000000000000000 00c0 : 0000000000000000 00c8 : 0000000000000000 00d0 : 0000000000000000 00d8 : 0000000000000000 00e0 : 0000000000000000 00e8 : 0000000000000000 00f0 : 0000000000000000 00f8 : 0000000000000000 0100 : 0000000000000000 4-38 ES45 Service Guide Console Uncorrectable Error Frame Header OCT 25 15:19:36 Processor Machine Check Frame CPU ID Frame Flag/Size Frame Offsets Frame Revision/Code I_STAT DC_STAT C_ADDR DC1_SYNDROME DC0_SYNDROME C_STAT C_STS MM_STAT EXC_ADDR IER_CM ISUM RESERVED PAL_BASE I_CTL PROCESS_CONTEXT Reserved Reserved OS Flags Cchip DIRx Cchip MISC PChip 0 SERROR PChip 0 GPERROR PChip 0 APERROR PChip 0 AGPERROR PChip 1 SERROR PChip 1 GPERROR PChip 1 APERROR PChip 1 AGPERROR 0000 : 00010004000c0010 0008 : 0009 : 000a : 000b : 000c : 000d : 000e : 000f : Clipper DPR Extended Memory Frame f0 DPR AAR0 Config 40 DPR AAR0 Size d2 DPR AAR1 Config 10 DPR AAR1 Size f1 DPR AAR2 Config 40 DPR AAR2 Size d3 DPR AAR3 Config 10 DPR AAR3 Size 0000 : 0001000a000c0058 0008 : 0000000000009305 0010 : 0000000200007105 0018 : 0000000100009305 0020 : 0000000240007105 0028 : 0000000002831611 0030 : 00000004c38000c0 0038 : 48000004c2c200c0 0040 : 0000000002831711 0048 : 00000004c34000c0 0050 : 00000004c1c200c0 Titan Extended Memory Frame AAR0 AAR1 AAR2 AAR3 SCTL GPCTL APCTL SCTL GPCTL APCTL Error 2 0000 : 0001000400050018 0008 : 0000300a190f1426 0010 : 0000000200000218 Console Uncorrectable Error Frame Header OCT 25 15:20:38 0000 : 00010002000c0108 0008 : 0000000000000000 0010 : 00000000000000f8 System Machine Check Frame CPU ID Frame Flag/Size . . . 0108 : 0000000000000000 GTBA3 Error 3 0000 : 0001000200050018 0008 : 0000300a190f1523 0010 : 0000000100000080 System Event Frame Header OCT 25 15:21:35 0000 : 00010003000c0080 0008 : 0000000000000000 0010 : 0000000000000070 0018 : 0000001800000018 0020 : 0000000100000206 0028 : 0000000000000000 0030 : 0000000000000680 0038 : 0000000000000060 0040 : 000000000000000f 0048 : 0000000000000007 0050 : 0000000000000000 0058 : 0000000000000000 0060 : 0000000000000000 0068 : 0000000000000000 0070 : 0000000000000000 0078 : 0000000000000000 System Event Frame CPU ID Frame Flag/Size Frame Offsets Frame Revision/Code OS Flags Cchip DIRx TIG SMIR TIG CPUIR TIG PSIR LM78 ISR Door Open Temperature Warning Fan Fault Power Down Code Reserved SRM Console Diagnostics 4-39 Example 4–17 shows an info 8. Example 4–17 info 8 P00>>> info 8 0. HWRPB MEMDSC 1. Console PTE 2. GCT/FRU 5 3. Dump System CSRs 4. IMPURE area (abbreviated) 5. IMPURE area (full) 6. LOGOUT area 7. Dump Error Log 8. Clear Error Log Enter selection: 4-40 ES45 Service Guide 4.12 kill and kill_diags The kill and kill_diags commands terminate diagnostics that are currently executing. Example 4–18 kill and kill_diags P00>>> memexer 3 P00>>> show_status ID Program Device Pass Hard/Soft Bytes Written Bytes Read -------- ------------ ------------ ------ --------- ------------- ----------00000001 idle system 0 0 0 0 0 0000125e memtest memory 12 0 0 6719275008 6719275008 00001261 memtest memory 12 0 0 6689914880 6689914880 00001268 memtest memory 11 0 0 6689914880 6689914880 0000126f exer_kid dka0.0.0.2.1 0 0 0 0 8612352 00001270 exer_kid dka100.1.0.2 0 0 0 0 8649728 00001271 exer_kid dka200.2.0.2 0 0 0 0 8649728 00001278 exer_kid dqa0.0.0.15. 0 0 0 0 3544064 00001280 exer_kid dfa0.0.0.2.1 84 0 0 0 8619520 00001281 exer_kid dfb0.0.0.102 1066 0 0 0 109256192 0000128e exer_kid dva0.0.0.100 0 0 0 0 980992 00001381 nettest ewa0.0.0.4.1 362 0 1 1018720 1018496 P00>>> kill_diags dva0.0.0.1000.0 exer completed packet size 512 IOs IOs 112 elapsed idle bytes read bytes written 28672 28672 /sec bytes/sec seconds 5 2748 21 secs 16 The kill command terminates a specified process. The kill_diags command terminates all diagnostics. Syntax kill_diags kill [PID. . . ] Arguments [PID. . . ] The process ID of the diagnostic to terminate. Use the show_status command to determine the process ID. SRM Console Diagnostics 4-41 4.13 memexer The memexer command runs a specified number of memory exercisers in the background. Nothing is displayed unless an error occurs. Each exerciser tests all available memory in twice the backup cache size blocks for each pass. The following example shows no errors. Example 4–19 memexer P00>>> memexer 3 P00>>> show_status ID Program Device Pass Hard/Soft Bytes Written Bytes Read -------- ------------ ------------ ------ --------- ------------- ----------00000001 idle system 0 0 0 0 0 0000125e memtest memory 12 0 0 6719275008 6719275008 00001261 memtest memory 12 0 0 6689914880 6689914880 00001268 memtest memory 11 0 0 6689914880 6689914880 0000126f exer_kid dka0.0.0.2.1 0 0 0 0 8612352 00001270 exer_kid dka100.1.0.2 0 0 0 0 8649728 00001271 exer_kid dka200.2.0.2 0 0 0 0 8649728 00001278 exer_kid dqa0.0.0.15. 0 0 0 0 3544064 00001280 exer_kid dfa0.0.0.2.1 84 0 0 0 8619520 00001281 exer_kid dfb0.0.0.102 1066 0 0 0 109256192 0000128e exer_kid dva0.0.0.100 0 0 0 0 980992 00001381 nettest ewa0.0.0.4.1 362 0 1 1018720 1018496 The following example shows a memory compare error indicating bad DIMMs. In most cases, the failing bank and DIMM position are specified in the error message. P00>>> memexer 3 *** Hard Error - Error #41 - Memory compare error Diagnostic Name memtest Expected value: Received value Failing addr: ID 00000193 25c07 35c07 a11848 Device Pass brd0 114 *** ERROR - DIMM J2 on MMB 1 Failed *** P00>>> kill_diags P00>>> 4-42 ES45 Service Guide Test 1 Hard/Soft 0 11-FEB-1999 12:00:01 If the memory configuration is very large, the console might not test all of the memory. The upper limit is 1 GB. Use the show_status command to display the progress of the tests. Use the kill or kill_diags command to terminate the test. Syntax memexer [number] Arguments [number] Number of memory exercisers to start. The default is 1. The number of exercisers, as well as the length of time for testing, depends on the context of the testing. SRM Console Diagnostics 4-43 4.14 memtest The memtest command exercises a specified section of memory. Typically memtest is run from the built-in console script. Advanced users may want to use the specific options described here. Example 4–20 memtest P00>>> sh mem Array Size --------- ---------0 256Mb 1 512Mb 2 256Mb 3 1024Mb Base Address Intlv Mode ---------------- ---------0000000060000000 2-Way 0000000040000000 2-Way 0000000070000000 2-Way 0000000000000000 2-Way 2048 MB of System Memory P00>>> memtest -sa 400000 -l 2000000 -p 10& *** Hard Error - Error #43 - Memory compare error Diagnostic Name memtest Expected value: Received value: Failing addr: ID 00000118 Device Pass brd0 1 fffffffe ffffffff 400004 *** Error - DIMM 3 on MMB 2 Failed *** 4-44 ES45 Service Guide Test 1 Hard/Soft 1 0 2-JAN-2000 12:00:01 Use the show memory command or an info 0 command to see where memory is located. Starting address Length of the section to test in bytes Passcount. In this example, the test will run for 10 passes. The test detected a failure on DIMM 3, which is located on MMB 2. Use the show_status command to display the progress of the test. Use the kill or kill_diags command to terminate the test. Memtest provides a graycode memory test. The test writes to memory and then reads the previously written value for comparison. The section of memory that is tested has its data destroyed. The -z option allows testing outside of the main memory pool. Use caution because this option can overwrite the console. Memtest may be run on any specified address. If the -z option is not included (default), the address is verified and allocated from the firmware's memory zone. If the -z qualifier is included, the test is started without verification of the starting address. When a starting address is specified, the memory is allocated beginning at the starting address -32 bytes for the length specified. The extra 32 bytes that are allocated are reserved for the allocation header information. Therefore, if a starting address of 0xa00000 and a length of 0x100000 is requested, the area from 0x9fffe0 through 0xb00000 is reserved. This may be confusing if you try to begin two memtest processes simultaneously with one beginning at 0xa00000 for a length of 0x100000 and the other at 0xb00000 for a length of 0x100000. The second memtest process will send a message that it is “Unable to allocate memory of length 100000 at starting address b00000.” Instead, the second process should use the starting address of 0xb00020. SRM Console Diagnostics 4-45 NOTE: If memtest is used to test large sections of memory, testing may take a while to complete. If you issue a Ctrl/C or kill PID in the middle of testing, memtest may not abort right away. For speed reasons, a check for a Ctrl/C or kill is done outside of any test loops. If this is not satisfactory, you can run concurrent memtest processes in the background with shorter lengths within the target range. Memtest Test 1 — Graycode Test Memtest Test 1 uses a graycode algorithm to test a specified section of memory. The graycode algorithm used is: data = (x>>1)^x, where x is an incrementing value. Three passes are made of the memory under test. • The first pass writes alternating graycode inverse graycode to each four longwords. This causes many data bits to toggle between each 16-byte write. For example graycode patterns for a 32 byte block would be: Graycode(0) 00000000 Graycode(1) 00000001 Graycode(2) 00000003 Graycode(3) 00000002 Inverse Graycode(4) FFFFFFF9 Inverse Graycode(5) FFFFFFF8 Inverse Graycode(6) FFFFFFFA Inverse Graycode(7) FFFFFFFB • The second pass reads each location, verifies the data, and writes the inverse of the data, one longword at a time. This causes all data bits to be written as a one and zero. • The third pass reads and verifies each location. You can specify the -f (fast) option so that the explicit data verify sections of the second and third loops are not performed. This does not catch address shorts but stresses memory with a higher throughput. The ECC/EDC logic can be used to detect failures. 4-46 ES45 Service Guide Syntax memtest ( [-sa <start_address>] [-ea <end_address>] [-l <length>] [-bs <block_size>] [-i <address_inc>] [-p <pass_count>] [-d <data_pattern>] [-rs <random_seed>] [-ba <block_address>] [-t <test_mask>] [-se <soft_error_threshold>] [-g <group_name>] [-rb] [-f] [-m] [-z] [-h] [-mb] ) Options -sa Start address. Default is first free space in memzone. -ea End address. Default is start address plus length size. -l Length of section to test in bytes, default is the zone size with the -rb option and the block_size for all other tests. -l has precedence over -ea. -bs Block (packet) size in bytes in hex, default 8192 bytes. This is used only for the random block test. For all other tests the block size equals the length. -i Specifies the address increment value in longwords. This value is used to increment the address through the memory to be tested. The default is 1 (longword). This is only implemented for the graycode test. An address increment of 2 tests every other longword. This option is useful for multiple CPUs testing the same physical memory. -p Passcount If 0 then run indefinitely or until Ctrl/C is issued. Default = 1 -t Test mask. Default = run all tests in selected group. -g Group name -se Soft error threshold -f Fast. If -f is included in the command line, the data compare is omitted. Detects only ECC/EDC errors. SRM Console Diagnostics 4-47 Options -m Timer. Prints out the run time of the pass. Default = off . -z Tests the specified memory address without allocation. Bypasses all checking but allows testing in addresses outside of the main memory heap. Also allows unaligned input. CAUTION: This flag can overwrite the console. If the system hangs, press the Reset button. -d Used only for march test (2). Uses this pattern as test pattern. Default = 5's -h Allocates test memory from the firmware heap. -rs Used only for random test (3). Uses this data as the random seed to vary random data patterns generated. Default = 0. -rb Randomly allocates and tests all of the specified memory address range. Allocations are done of block_size. -mb Memory barrier flag. Used only in the -f graycode test. When set an mb is done after every memory access. This guarantees serial access to memory. -ba Used only for block test (4). Uses the data stored at this address to write to each block. 4-48 ES45 Service Guide 4.15 net The net command performs maintenance operations on a specified Ethernet port. Net -ic initializes the MOP counters for the specified Ethernet port, and net -s displays the current status of the port, including the contents of the MOP counters. Example 4–21 net -ic and net -s P00>>> net -ic ewa0 P00>>> net -s ewa0 Status counts: ti: 72 tps: 0 tu: 47 tjt: 0 unf: 0 ri: 70 ru: 0 rps: 0 rwt: 0 at: 0 fd: 0 lnf: 0 se: 0 tbf: 0 tto: 1 lkf: 1 ato: 1 nc: 71 oc: 0 MOP BLOCK: Network list size: 0 MOP COUNTERS: Time since zeroed (Secs): 3 TX: Bytes: 0 Frames: 0 Deferred: 0 One collision: 0 Multi collisions: 0 TX Failures: Excessive collisions: 0 Carrier check: 0 Short circuit: 0 Open circuit: 0 Long frame: 0 Remote defer: 0 Collision detect: 0 RX: Bytes: 0 Frames: 0 Multicast bytes: 0 Multicast frames: 0 RX Failures: Block check: 0 Framing error: 0 Long frame: 0 Unknown destination: 0 Data overrun: 0 No system buffer: 0 No user buffers: 0 P00>>> SRM Console Diagnostics 4-49 Syntax net [-ic] net [-s] Arguments <port_name> 4-50 Specifies the Ethernet port on which to operate, either ei*0 or ew*0. ES45 Service Guide 4.16 nettest The nettest command tests the network ports using MOP loopback. Typically nettest is run from the built-in console script. Advanced users may want to use the specific options and environment variables described here. Example 4–22 nettest P00>>> nettest ei* P00>>> nettest -mode in ew* P00>>> nettest -mode ex -w 10 e* Internal loopback test on port ei*0 Internal loopback test on ports ewa0/ewb0 External loopback test on port eia0 or ewa0; wait 10 seconds between tests SRM Console Diagnostics 4-51 Nettest performs a network test. It can test the ei* or ew* ports in internal loopback, external loopback, or live network loopback mode. Nettest contains the basic options to run MOP loopback tests. Many environment variables can be set from the console to customize nettest before nettest is started. The environment variables, a brief description, and their default values are listed in the syntax table in this section. Each variable name is preceded by e*a0_ or e*b0_ to specify the desired port. You can change other network driver characteristics by modifying the port mode. See the -mode option. Use the show_status display to determine the process ID when terminating an individual diagnostic test. Use the kill or kill_diags command to terminate tests. 4-52 ES45 Service Guide Syntax nettest ( [-f <file>] [-mode <port_mode>] [-p <pass_count>] [-sv <mop_version>] [-to <loop_time>] [-w <wait_time>] [<port>] ) Arguments <port> Specifies the Ethernet port on which to run the test. Options -f <file> Specifies the file containing the list of network station addresses to loop messages to. The default file name is lp_nodes_e*a0 for port e*a0. The default file name is lp_nodes_e*b0 for port e*b0. The files by default have their own station address. -mode <port_mode> Specifies the mode to set the port adapter (TGEC). The default is ex (external loopback). Allowed values are: df : default, use environment variable values ex : external loopback in : internal loopback nm : normal mode nf : normal filter pr : promiscuous mc : multicast ip : internal loopback and promiscuous fc : force collisions nofc : do not force collisions nc : do not change mode -p <pass_count> Specifies the number of times to run the test. If 0, then run until terminated by a kill or kill_diags command The default is 1. NOTE: This is the number of passes for the diagnostic. Each pass will send the number of loop messages as set by the environment variable, eia*_loop_count or ewa*_loop_count. SRM Console Diagnostics 4-53 -sv <mop_version> Specifies which MOP version protocol to use. If 3, then MOP V3 (DECNET Phase IV) packet format is used. If 4, then MOP V4 (DECNET Phase V IEEE 802.3) format is used. -to <loop_time> Specifies the time in seconds allowed for the loop messages to be returned. The default is 2 seconds. -w <wait_time> Specifies the time in seconds to wait between passes of the test. The default is 0 (no delay). The network device can be very CPU intensive. This option will allow other processes to run. Environment Variables e*a*_loop_count Specifies the number (hex) of loop requests to send. The default is 0x3E8 loop packets. e*a*_loop_inc Specifies the number (hex) of bytes the message size is increased on successive messages. The default is 0xA bytes. e*a*_loop_patt Specifies the data pattern (hex) for the loop messages. The following are legitimate values. 0 : all zeros 1 : all ones 2 : all fives 3 : all 0xAs 4 : incrementing data 5 : decrementing data ffffffff : all patterns loop_size 4-54 ES45 Service Guide Specifies the size (hex) of the loop message. The default packet size is 0x2E. 4.17 set sys_serial_num The set sys_serial_num command sets the system serial number. This command is used by Manufacturing for establishing the system serial number, which is then propagated to all FRU devices that have EEPROMs. The sys_serial_num environment variable can be read by the operating system. IMPORTANT: The system serial number must be set correctly. Compaq Analyze will not work with an incorrect serial number. Example 4–23 set sys_serial_num P00>>> set sys_serial_num NI900100022 When the system motherboard (SMB) is replaced, you must use the set sys_serial_num command to restore the master setting. Syntax set sys_serial_num value Value is the system serial number, which is on a sticker on the back of the system chassis. SRM Console Diagnostics 4-55 4.18 show error The show error command reports errors logged to the FRU EEPROMs. Example 4–24 show error P00>>> show error SMB0 001f8408 SMB0 001f8408 001f8418 001f8428 001f8438 001f8448 001f8458 SMB0 001f8408 001f8418 001f8428 001f8438 SMB0 001f8408 001f8418 001f8428 001f8438 SMB0 001f8408 001f8418 001f8428 001f8438 001f8408 001f8418 001f8428 001f8438 SMB0 P00>>> 4-56 TDD - Type: 15 Test: 15 SubTest: 15 Error: 15 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F ................ SDD - Type: 14 LastLog: 0 Overwrite: 0 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F ................ 0F 0F 0F 0F 0F 0F 0F 0F 0F 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 FF 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 ........ Bad checksum 0 to 64 EXP:dc RCV:dd 80 08 00 01 53 00 01 00 00 00 00 00 00 00 00 00 ....S........... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ FF 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 DD ...............Y Bad checksum 64 to 126 EXP:e1 RCV:0f 4A FF FF FF FF FF FF FF 02 35 34 2D 31 32 33 34 J........54-1234 35 2D 30 31 2E 41 30 30 31 20 20 00 00 09 44 91 5-01.A001 ...D. 34 51 15 41 41 41 41 41 41 41 41 41 41 41 41 41 4Q.AAAAAAAAAAAAA 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F ................ Bad checksum 128 to 254 EXP:0c RCV:0d 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F ................ 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F ................ 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F ................ 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 0F 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ FF 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 4A 21 0D .............J!. SYS_SERIAL_NUM Mismatch ES45 Service Guide The output of the show error command is based on information logged to the serial control bus EEPROMs on the system FRUs. Both the operating system and the ROM-based diagnostics log errors to the EEPROMs. This functionality allows you to generate an error log from the console environment. No errors are displayed for fans or the OCP because these components do not have an EEPROM. Syntax show error All FRUs with errors are displayed. If no errors are logged, nothing is displayed and you are returned to the SRM console prompt. Example 4–24 shows TDD, SDD, checksum, and sys_serial_num mismatch errors logged to the EEPROM on the system motherboard (SMB0). Table 4–2 shows a reference to these errors. The bit masks correspond to the bit masks that would be displayed in the E field of the show fru command. FRU to which errors are logged; in this example the system motherboard, SMB0. A TDD error has been logged. TDDs (test-directed diagnostics) test specific functions sequentially. Typically, nothing else is running during the test. TDDs are performed in SROM or XSROM or early in the console power-up flow. An SDD error has been logged. SDDs (symptom-directed diagnostics) are generic diagnostic exercisers that try to cause random behavior and look for failures or “symptoms.” All SDDs are logged by Compaq Analyze. Three checksum errors have been logged. There was a mismatch between the serial number on the system motherboard and the system serial number. This could occur if a motherboard from a system with a different serial number was swapped into this system. SRM Console Diagnostics 4-57 Table 4–2 Show Error Message Translation Bit Mask (E Field) Text Message Meaning and Action 01 <fruname> Hardware Failure Module failure. FRUs that are known to be connected but are unreadable are considered hardware failures. An example is power supplies. 02 <fruname> TDD - Type:0 Test: 0 SubTest: Error: 0 Serious error. Run the Compaq Analyze GUI, if necessary, to determine what action to take. If you cannot run Compaq Analyze, replace the module. 04 <fruname> SDD - Type:0 LastLog: 0 Overwrite: 0 Serious error. Compaq Analyze (CA) has written a FRU callout into the SDD area and DPR global area. Follow the instructions given by Compaq Analyze. 08 <fruname> EEPROM Unreadable Reserved. 10 <fruname> Bad checksum 0 to 64 EXP:01 RCV:02 Informational. Use the clear_error command to clear the error unless TDD or SDD is also set. 20 <fruname> Bad checksum 64 to 126 EXP:01 RCV:02 Informational. Use the clear_error command to clear the error unless TDD or SDD is also set. 40 <fruname> Bad checksum 128 to 254 EXP:01 RCV:02 Informational. Use the clear_error command to clear the error unless TDD or SDD is also set. 80 <fruname> SYS_SERIAL_NUM Mismatch Informational. Use the clear_error command to clear the error unless TDD or SDD is also set. 4-58 ES45 Service Guide 4.19 show fru The show fru command displays the physical configuration of FRUs. Use show fru -e to display FRUs with errors. Example 4–25 show fru P00>>> build smb0 54-25385-01.e01 ay94412345 P00>>> show fru FRUname SMB0 SMB0.CPU0 SMB0.CPU1 SMB0.CPU2 SMB0.CPU3 SMB0.MMB0 SMB0.MMB0.J4 SMB0.MMB0.J8 SMB0.MMB0.J5 SMB0.MMB0.J9 SMB0.MMB0.J2 SMB0.MMB0.J6 SMB0.MMB0.J3 SMB0.MMB0.J7 SMB0.MMB1 SMB0.MMB1.J4 SMB0.MMB1.J8 SMB0.MMB1.J5 SMB0.MMB1.J9 SMB0.MMB1.J2 SMB0.MMB1.J6 SMB0.MMB1.J3 SMB0.MMB1.J7 SMB0.MMB2 SMB0.MMB2.J4 SMB0.MMB2.J8 SMB0.MMB2.J5 SMB0.MMB2.J9 SMB0.MMB2.J2 SMB0.MMB2.J6 SMB0.MMB2.J3 SMB0.MMB2.J7 SMB0.MMB3 SMB0.MMB3.J4 SMB0.MMB3.J8 SMB0.MMB3.J5 SMB0.MMB3.J9 SMB0.MMB3.J2 SMB0.MMB3.J6 SMB0.MMB3.J3 SMB0.MMB3.J7 SMB0.CPB0 E Part# Serial# Model/Other Alias/Misc 00 54-30292-02.A01 SW03300011 00 54-30466-01.A01 SW03200044 00 54-30466-01.A01 SW03200042 00 54-30466-01.A01 SW03200037 00 54-30466-02.A01 SW04300170 00 54-30348-02.A01 SW02800955 00 54-30350-DB.A02 CP SW03600466 00 54-30350-DB.A02 CP SW03600482 00 54-30350-DB.A02 CP SW03600472 00 54-30350-DB.A02 CP SW03600525 00 54-30350-DB.A02 CP SW03600468 00 54-30350-DB.A02 CP SW03600471 00 54-30350-DB.A02 CP SW03600473 00 54-30350-DB.A02 CP SW03600522 00 54-30348-02.A01 SW02800940 00 54-30350-AA.A01 CP SW02100172 00 54-30350-AA.A01 CP SW02100172 00 54-30350-AA.A01 CP NI01400121 00 54-30350-AA.A01 CP SW01400169 00 54-30350-AA.A01 CP SW01400214 00 54-30350-AA.A01 CP SW01400205 00 54-30350-AA.A01 CP SW02100172 00 54-30350-AA.A01 CP SW02100172 00 54-30348-02.A01 SW02800932 00 54-30350-DB.A02 CP SW03600523 00 54-30350-DB.A02 CP SW03600519 00 54-30350-DB.A02 CP SW03600527 00 54-30350-DB.A02 CP SW03600469 00 54-30350-DB.A02 CP SW03600465 00 54-30350-DB.A02 CP SW03600467 00 54-30350-DB.A02 CP SW03600479 00 54-30350-DB.A02 CP SW03600477 00 54-30348-02.A01 SW02800927 00 54-30350-AA.A01 CP SW02100172 00 54-30350-AA.A01 CP SW02100172 02 54-30350-AA.A01 CP NI01400118 00 54-30350-AA.A01 CP SW01400170 02 54-30350-AA.A01 CP NI01400122 00 54-30350-AA.A01 CP SW01400210 00 54-30350-AA.A01 CP SW02100172 00 54-30350-AA.A01 CP SW02100172 00 54-30418-01.A01 SW03000307 SRM Console Diagnostics 4-59 JIO0 SMB0.CPB0.PCI1 SMB0.CPB0.PCI2 SMB0.CPB0.PCI3 SMB0.CPB0.PCI5 SMB0.CPB0.PCI7 SMB0.CPB0.PCI8 SMB0.CPB0.PCI9 OCP0 PWR0 PWR1 PWR2 FAN1 FAN2 FAN3 FAN4 FAN5 FAN6 00 54-25575-01 00 00 00 00 00 00 00 00 70-33894-0x 00 30-49448-01. C05 00 30-49448-01. C05 00 30-49448-01. C05 00 70-40073-01 00 70-40073-01 00 70-40072-01 00 70-40071-01 00 70-40073-02 00 70-40074-01 - Junk I/O NCR 53C895 DECchip 21 DE500-AA N ELSA GLori DEGPA-SA DEGPA-SA NCR 53C895 OCP 30-49448-0 API-76 30-49448-0 API-76 30-49448-0 API-76 Fan Fan Fan Fan Fan Fan 7f 7f 7f P00>>> FRUname SMB = system motherboard; CPU = CPUs; MMB = memory motherboard; DIM = DIMMs; CPB = PCI backplane; PCI = PCI option; SBM = SCSI backplane; PWR = power supply; FAN = fans; JIO= I/O connector module (junk I/O). E Part # Serial # 4-60 The FRU name recognized by the SRM console. The name also indicates the location of that FRU in the physical hierarchy. Error field. Indicates whether the FRU has any errors logged against it. FRUs without errors show 00 (hex). FRUs with errors have a non-zero value that represents a bit mask of possible errors. See Table 4–3. The part number of the FRU in ASCII, either a Compaq part number or a vendor part number. The serial number. For Compaq FRUs, the serial number has the form XXYWWNNNNN. XX = manufacturing location code YWW = year and week NNNNN = sequence number. For vendor FRUs, the 4-byte sequence number is displayed in hex. ES45 Service Guide Model/Other Optional data. For Compaq FRUs, the Compaq part alias Alias/Misc number (if one exists). For vendor FRUs, the year and week of manufacture. Miscellaneous information about the FRUs. For Compaq FRUs, a model name, number, or the common name for the entry in the Part # field. For vendor FRUs, the manufacturer's name. Table 4–3 lists bit assignments for failures that could potentially be listed in the E (error) field of the show fru command. Because the E field is only two characters wide, bits are “or’ed” together if the device has multiple errors. For example, the E field for a FRU with both TDD (02) and SDD (04) errors would be 06: 010 | 100 = 110 (6) Table 4–3 Bit Assignments for Error Field Bit Mask (E Field) Meaning 01 Hardware failure 02 TDD error has been logged 04 SDD error has been logged 08 Reserved 10 Checksum failure on bytes 0-62 20 Checksum failure on bytes 64-126 40 Checksum failure on bytes 128-254 80 FRU’s system serial number does not match system’s SRM Console Diagnostics 4-61 4.20 show_status The show_status command displays the progress of diagnostics. The command reports one line of information per executing diagnostic. Many of the diagnostics run in the background and provide information only if an error occurs. Example 4–26 show status P00>>> show_status ID Program Device Pass Hard/Soft Bytes Written Bytes Read -------- ------------ ------------ ------ --------- ------------- ----------00000001 idle system 0 0 0 0 0 0000125e memtest memory 12 0 0 6719275008 6719275008 00001261 memtest memory 12 0 0 6689914880 6689914880 00001268 memtest memory 11 0 0 6689914880 6689914880 0000126f exer_kid dka0.0.0.2.1 0 0 0 0 8612352 00001270 exer_kid dka100.1.0.2 0 0 0 0 8649728 00001271 exer_kid dka200.2.0.2 0 0 0 0 8649728 00001278 exer_kid dqa0.0.0.15. 0 0 0 0 3544064 00001280 exer_kid dfa0.0.0.2.1 84 0 0 0 8619520 00001281 exer_kid dfb0.0.0.102 1066 0 0 0 109256192 0000128e exer_kid dva0.0.0.100 0 0 0 0 980992 00001381 nettest ewa0.0.0.4.1 362 0 1 1018720 1018496 P00>>> 4-62 ES45 Service Guide Process ID The SRM diagnostic for the particular device The ID of the device under test Number of diagnostic passes that have been completed Error count (hard and soft). Soft errors are not usually fatal; hard errors halt the system or prevent completion of the diagnostics. Bytes successfully written by the diagnostic. Bytes successfully read by the diagnostic. The following command string is useful for periodically displaying diagnostic status information for diagnostics running in the background: P00>>> while true;show_status;sleep n;done Where n is the number of seconds between show_status displays. Syntax show_status SRM Console Diagnostics 4-63 4.21 sys_exer The sys_exer command exercises the devices displayed with the show config command. Tests are run concurrently and in the background. Nothing is displayed after the initial test startup messages unless an error occurs. Example 4–27 sys_exer P00>>> sys_exer Default zone extended at the expense of memzone. Use INIT before booting Exercising the Memory Exercising the DK* Disks(read only) Exercising the DQ* Disks(read only) Exercising the DF* Disks(read only) Exercising the Floppy(read only) Testing the VGA (Alphanumeric Mode only) Exercising the EWA0 Network Type "show_status" to display testing progress Type "cat el" to redisplay recent errors Type "init" in order to boot the operating system P00>>> show_status ID Program Device Pass Hard/Soft Bytes Written Bytes Read -------- ------------ ------------ ------ --------- ------------- ----------00000001 idle system 0 0 0 0 0 0000125e memtest memory 12 0 0 6719275008 6719275008 00001261 memtest memory 12 0 0 6689914880 6689914880 00001268 memtest memory 11 0 0 6689914880 6689914880 0000126f exer_kid dka0.0.0.2.1 0 0 0 0 8612352 00001270 exer_kid dka100.1.0.2 0 0 0 0 8649728 00001271 exer_kid dka200.2.0.2 0 0 0 0 8649728 00001278 exer_kid dqa0.0.0.15. 0 0 0 0 3544064 00001280 exer_kid dfa0.0.0.2.1 84 0 0 0 8619520 00001281 exer_kid dfb0.0.0.102 1066 0 0 0 109256192 0000128e exer_kid dva0.0.0.100 0 0 0 0 980992 00001381 nettest ewa0.0.0.4.1 362 0 1 1018720 1018496 P00>>> init OpenVMS PALcode V1.91-33, Tru64 UNIX PALcode V1.87-27 ... starting console on CPU 0 4-64 ES45 Service Guide Use the show_status command to display the progress of diagnostic tests. The diagnostics started by the sys_exer command automatically reallocate memory resources, because these tests require additional resources. Use the init command to reconfigure memory before booting an operating system. Because the sys_exer tests are run concurrently and indefinitely (until you stop them with the init command), they are useful in flushing out intermittent hardware problems. When using the sys_exer command after shutting down an operating system, you must initialize the system to a quiescent state. Enter the following command at the SRM console: P00>>> init . . . P00>>> sys_exer By default, no write tests are performed on disk and tape drives. Media must be installed to test the floppy drive and tape drives. When the -lb argument is used, a loopback connector is required for the COM2 port (9-pin loopback connector, 12-27351-01) and parallel port (25-pin loopback connector). Syntax sys_exer [-lb] [-t] Arguments [-lb] The loopback option runs console loopback tests for the COM2 serial port and the parallel port during the test sequence. [-t] Number of seconds to run. The default is run until terminated by a kill or kill_diags command. SRM Console Diagnostics 4-65 4.22 test The test command verifies all the devices in the system. This command can be used on all supported operating systems. Example 4–28 test -lb P00>>> test -lb Testing the Memory Testing the DK* Disks(read only) No DU* Disks available for testing No DR* Disks available for testing Testing the DQ* Disks(read only) Testing the DF* Disks(read only) No MK* Tapes available for testing No MU* Tapes available for testing Testing the DV* Floppy Disks(read only) Testing the Serial Port 1(external loopback) Testing the parallel Port(external loopback) Testing the VGA (Alphanumeric Mode only) Testing the EW* Network P00>>> The test command also does a quick test on the system speaker. A beep is emitted as the command starts to run. The tests are run sequentially, and the status of each subsystem test is displayed to the console terminal as the tests progress. If a particular device is not available to test, a message is displayed. The test script does no destructive testing; that is, it does not write to disk drives. Syntax test [argument] Use the -lb (loopback) argument for console loopback tests. 4-66 ES45 Service Guide To run a complete diagnostic test using the test command, the system configuration must include: • A serial loopback connected to the COM2 port (not included) • A parallel loopback connected to the parallel port (not included) • A trial diskette with files installed • A trial CD-ROM with files installed The test script tests devices in the following order: 1. Memory tests (one pass) 2. Read-only tests: DK* disks, DR* disks, DQ* disks, MK* tapes, DV* floppy. NOTE: You must install media to test disks, tapes, and the floppy drive. Since no write tests are performed, it is safe to test disks and tapes that contain data. 3. Console loopback tests if -lb argument is specified: COM2 serial port and parallel port. 4. VGA console tests: These tests are run only if the console environment variable is set to serial. The VGA console test displays rows of the word compaq. 5. Network internal loopback tests for EW* networks. SRM Console Diagnostics 4-67 Chapter 5 Error Logs This chapter tells how to interpret error logs reported by the operating system. The following topics are covered: • Error Log Analysis with Compaq Analyze • Fault Detection and Reporting • Machine Checks/Interrupts • Environmental Errors Captured by SRM Error Logs 5-1 5.1 Error Log Analysis with Compaq Analyze Compaq Analyze (CA) is a fault management diagnostic tool that is used to determine the cause of hardware failures. Compaq Analyze performs system diagnostic processing of both single and multiple error/fault events. Compaq Analyze may or may not be installed on the customer's system with the operating system, depending on the release cycle. If CA is installed, the Compaq Analyze Director starts automatically as part of the system start-up. CA provides automatic background analysis. When an error event occurs, it triggers the firing of an analysis rule. The analysis engine collects and processes the information and typically generates a “problem found” report, if appropriate. The report can be automatically sent to users on a notification mailing list and, if DSNlink is installed, a call can be logged with the customer support center. Compaq Analyze has the capability to support the Tru64 UNIX and OpenVMS operating systems on Alpha platforms. NOTE: Compaq Analyze is a successor tool to DECevent and typically does not support the same systems as DECevent. 5-2 ES45 Service Guide UNIX Indictment For each CPU indictment that is sent to the operating system a callout report is generated. After the bad component is replaced the following commands must be executed to bring the new components on-line for use. The following is an example of using the Indictment command. #hwmgr –status comp -ngood STATUS ACCESS HWID: HOSTNAME SUMMARY STATE INDICT STATE LEVEL NAME 2: offline available high CPU0 mcsse1 critical #hwmgr –online –name CPU0 hwmgr:CPU0 is now online NOTE: The indicted problem state attached to the previous component is still in effect even though a new component may have been inserted. Use the command “ hwmgr –unindict –id <hwid> ” to clear the problem state when the component is operating properly. The following is an example of a CPU unindict command. #hwmgr –unindict –id 2 hwmgr:Unindict operation was successful Error Logs 5-3 5.1.1 WEB Enterprise Service (WEBES) Director Compaq Analyze uses the functionality contained in the WEBES Director, a process that manages all other WEBES processes and executes continuously on the machine when configured to do so. The Director manages the decomposition processing of system error events, provides required information to the analysis engine, and performs notification message routing for the system. Compaq Analyze provides the functionality for system event analysis and Bit-To-Text (BTT) translation. Compaq Analyze, includes common WEBES code. Subsequent releases of Compaq Analyze will continue to ship with the common WEBES code. The Director is started when the system is booted. Normally you do not need to start the Director. If the Director has stopped running, restart it by following the instructions in the WEBES Compaq Analyze User Guide documentation. Compaq Analyze includes a graphical user interface (WUI) that allows the user to interact with the Director. While only one Director process executes on the machine at any time, many WUI processes can run at the same time, connected to the single Director. Refer to the Compaq Analyze installation and user manuals for the respective operating system to launch the Compaq Analyze WUI. The Compaq service tools Web site available to customers is: http://www.support.compaq.com/svctools The applicable Compaq Analyze documentation includes the following: • Compaq Analyze User’ s Guide • Compaq Analyze Installation Guide for Tru64 UNIX • Compaq Analyze Installation Guide for OpenVMS • Compaq Analyze Releases Notes 5-4 ES45 Service Guide 5.1.2 Using Compaq Analyze After you have logged on to Compaq Analyze the following screen appears. If an event has occurred, it is listed under “localhost” events. See Figure 5–1. Figure 5–1 Compaq Analyze Initial Screen 1. In this example, the Other Logs file is selected and the Problem Reports display in Figure 5–2 appears. Error Logs 5-5 Figure 5–2 Problem Reports Screen 2. Cpu_Mem_630.sys is selected and the problem reports are listed. You may select any log listed in Other Logs to view a list of all problems found. You may also view each report by clicking on the underlined hot link under Problem Reports. 5-6 ES45 Service Guide 3. Figure 5–3 provides an example problem report. Figure 5–3 Compaq Analyze Problem Report Details Error Logs 5-7 Figure 5–3 Compaq Analyze Problem Report Details (Continued) Managed Entity The Managed Entity designator includes the system host name (typically a computer name for networking purposes), the type of computer system (“Compaq AlphaServer ES45”), and the error event identification. The error event identification uses new common event header Event_ID_Prefix and Event_ID_Count components. The Event_ID_Prefix refers to an OS specific identification for this event type. The Event_ID_Count indicates the number of this event and the event type. Service Obligation Data Provides Obligation number and validity, system serial number, and company name of service provider. Brief Description The Brief Description designator indicates whether the error event is related to the CPU, system (PCI, storage, and so on), or environmental subsystem. 5-8 ES45 Service Guide Callout ID The last 12 characters of the Callout ID designator can be used to determine the revision level of the analysis rule-set that is being used. Full Description The Full Description designator provides detailed error information, which can include a description of the detected fault or error condition, the specific address or data bit where this fault or error occurred, the probable FRU list, and service related information. FRU List The FRU List designator lists the most probable defective FRUs. This list indicates that one or more of these FRUs needs to be serviced. The information typically includes the FRU probability, manufacturer, system device type, system physical location, part number, serial number, and firmware revision level (if applicable). Error Logs 5-9 5.1.3 Bit to Test The following table is an example of the Common Event Header (CEH) for Cpu_Mem_630.sys. To access the CEH, select the Events tab for the problem report selected. Table 5–1 Common Event Header Example Table (CEH) V2.0 OS_Type Hardware_Arch CEH_Vendor_ID Hdwr_Sys_Type Logging_CPU CPUs_In_Active_Set Entry_Type 2 4 3,564 38 0 2 630 DSR_Msg_Num 1,972 Chip_Type CEH_Device CEH_Device_ID_0 CEH_Device_ID_1 CEH_Device_ID_2 Unique_ID_Count Unique_ID_Prefix 12 0 x0000 0000 x0000 0000 x0000 0000 79 1 TLV Section of CEH TLV_DDR_String TLV_DSR_String TLV_Sys_Serial_Num TLV_Time_as_Local TLV_OS_Version TLV_Computer_Name Generic IDE/ATAPI disk Compaq AlphaServer ES45 prqv Tue, 30 Jan 2001 14:20:17 -0500 X765-SSB MCSSE1 5-10 ES45 Service Guide -- OpenVMS AXP -- Alpha -- Compaq Computer Corp -- Titan Corelogic -- CPU Logging this Event -- Correctable Processor Event -- Compaq AlphaServer ES45 .... CPU Slots: 2 (1000 Mhz) .... AGP Slots: 1 .... PCI Slots: 8 .... MMB Slots: 8 (DIMMs) -- EV68CB - 21264C Logout_Frame_CPU_Section Frame_Size x0000 00B0 Frame_Flags x8000 0000 CPU_Area_Offset x0000 0018 System_Area_Offset x0000 0058 Mchk_Error_Code x0000 0086 Machine Check Logout Frame Error Code CPU Non-Fatal Value[31:0] x86 Frame_Rev I_STAT DC_STAT x0000 0001 x0000 0000 0000 0000 Ibox Status Register x0000 0000 0000 0008 Dcache Status Register Dcache ECC during load x1 instruction ECC_Err_Ld[3] Read Erred x0000 0000 3BDC 63C0 Cbox Address Register Access Reference xEF 718F Location x0 System Memory Access C_ADDR Error_Ref[42:6] Io_M[43] C_SYNDROME_1 x0000 0000 0000 0098 High QW Data Syndrome Data Bit 53 QW_Upper[7:0] x98 C_SYNDROME_0 QW_Lower[7:0] x0000 0000 0000 0000 Low QW Data Syndrome x0 No Syndrome C_STAT Read Status x0000 0000 0000 000C Cbox Register Single-bit Bcache ECC xC Fill to Icache Cbox_Error[4:0] C_STS Cblock_Status[3:0] MM_STAT x0000 0000 0000 000D xD Cache Block Access Status Register Shared, Valid, Parity x0000 0000 0000 0000 Memory Management Status Register Logout_Frame_System_Section SW_Error_Sum_Flags x0000 0000 0000 0004 Pchip0_PCI_Error[0] x0 Pchip1_PCI_Error[1] x0 Pchip_Mem_Error[2] x1 Hot_Plug_Slot[39:32] x0 Cchip_DIRx Cchip_MISC x0000 0000 0000 0000 x0000 0000 0000 0000 No Pchip0 PCI Error Detected No Pchip1 PCI Error Detected Pchip or CPU Memory Error Detected No PCI Hot Plug Slot Intervention Cchip Device Interrupt Request Register Cchip Miscellaneous Error Logs 5-11 Nxs[31:29] x0 Register CPU 0 Source Device P0_Serror x0000 0000 0000 0000 Bus_Source[53:52] x0 TransAction_Cmd[55:54] x0 ECC_Syndrome[63:56] x0 No Error Detected GPCI Bus DMA Read No Data Bit Error P0_GPerror PCI_Cmd[55:52] x0000 0000 0000 0000 x0 No Error Detected Interrupt Acknowledge P0_APerror PCI_Cmd[55:52] x0000 0000 0000 0000 x0 No Error Detected Interrupt Acknowledge P0_AGPerror AGP_Lost_Err[0] AGP_Cmd[52:50] x0000 0000 0000 0000 x0 x0 No Error Detected P1_Serror x0000 0000 0000 0000 Bus_Source[53:52] x0 TransAction_Cmd[55:54] x0 ECC_Syndrome[63:56] x0 No Error Detected GPCI Bus DMA Read No ECC Error P1_GPerror PCI_Cmd[55:52] x0000 0000 0000 0000 x0 No Error Detected Interrupt Acknowledge P1_APerror PCI_Cmd[55:52] x0000 0000 0000 0000 x0 No Error Detected Interrupt Acknowledge P1_AGPerror x0000 0000 0000 0000 No Error Detected Read START OF SUBPACKETS IN THIS EVENT ES4X Dual Port RAM Subpacket, Version 1 Non - Split, Set0 - 4 Dimms DPR_0 x40 Only, configured as lowest array Non - Split, Set0 - 4 Dimms DPR_2 x41 Only, configured as next lowest array Non - Split, Set0 - 4 Dimms DPR_4 x42 Only, configured as next highest array Non - Split, Set0 - 4 Dimms DPR_6 x43 Only, configured as highest array 5-12 ES45 Service Guide System Memory / IO Configuration Subpacket, Version 1 Array 0 AAR_0 x0000 0000 0000 6005 Memory Configuration Register Sa0[8] x0 Non - Split Array Asiz0[15:12] x6 512 Mb Base Address [34:24] Array0 Addr0[34:24] x0 Bits AAR_1 Sa1[8] Asiz1[15:12] Addr1[34:24] x0000 0000 2000 6005 Memory Array 1 Configuration Register x0 Non - Split Array x6 512 Mb Array1 Base Address [34:24] x20 Bits AAR_2 Sa2[8] Asiz2[15:12] Addr2[34:24] Array 2 x0000 0000 4000 6005 Memory Configuration Register x0 Non - Split Array x6 512 Mb Array2 Base Address [34:24] x40 Bits AAR_3 Sa3[8] Asiz3[15:12] Addr3[34:24] x0000 0000 6000 6005 Memory Array 3 Configuration Register x0 Non - Split Array x6 512 Mb Array3 Base Address [34:24] x60 Bits P0_SCTL REV[7:0] PID[8] RPP[9] ECCEN[10] SWARB[12:11] CRQMAX[19:16] CDQMAX[23:20] PTPMAX[27:24] INUM[28] PTPWAR[30] System Control x7265 5361 6870 6C41 Pchip0 Register x41 x0 Pchip ID Value x0 x1 DMA ECC Enabled x1 GPCI/APCI (RR) > AGPX x0 x7 x8 256K Max Downstream PTP/PIO x0 Writes to bypass PIO Read GPCI Enabled to Perform PTE x1 Fetch Xactions PTP Writes Enabled During x1 Pending Reads P0_GPCTL FBTB[0] THDIS[1] CHAINDIS[2] TGTLAT[4:3] Win_HOLE[5] MnStr_WIN_Enable[6] ARBENA[7] x3534 5345 2072 6576 Pchip 0 Gport Control Register x0 x1 TLB Anti-Thrash Disabled x1 PIO Write Chaining Disabled Target Latency Timer = 32 x2 PCI Clocks x1 512K - 1Mb Win-Hole ENabled x1 Monster Window Enabled x0 NEWAMU[29] Error Logs 5-13 PRIGRP[15:8] PPRI[16] PCISPD66[17] CNGSTLT[21:18] PTPDESTEN[29:22] DPCEN[30] x65 x0 x1 xC x81 x0 APCEN[31] DCR_Timer[33:32] EN_Stepping[34] x0 x1 x1 P0_APCTL PTPDESTEN[29:22] DPCEN[30] APCEN[31] DCR_Timer[33:32] EN_Stepping[34] AGP_Rate[53:52] AGP_SBA_Enabled[54] AGP_Enabled[55] AGP_Present[57] AGP_HP_RD[60:58] x3320 6C65 646F 4D20 Pchip0 Aport Control Register x0 x0 x0 x0 TGLAT = 128 PCI Clocks x1 Window-Hole Enabled x0 x0 x4D x1 APCI = 66MHz 11 DMA Reads Retry w/no xB delayed Completion x91 x1 Address Parity Error x0 Checking Disabled x1 DCRT Count = 2^11 PCI Config Address Stepping x1 Enabled x2 AGP Rate = 4X x0 x0 x1 AGP Bus Enabled 4 Cchip HP Outstanding x4 Reads AGP_LP_RD[63:61] x1 P1_SCTL x3A54 4556 2D51 0046 Pchip1 System Control Register x46 x0 Pchip PID = 0 x0 x0 Pchip1 ECC Disabled x0 GPCI > APCI > AGPX Max Cchip Requests from x1 both Pchips Max Dchip Data Xfrs from x5 both Pchips Max PTP Reqs from both xD Pchips 256K MAX PTP/PIO Writes x0 Enabled to bypass PIO Read FBTB[0] THDIS[1] CHAINDIS[2] TGLAT[4:3] HOLE_Enable[5] MWIN_Enable[6] ARBENA[7] PRIGRP[15:8] PCISPD66[17] CNGSTLT[21:18] REV[7:0] PID[8] RPP[9] ECCEN[10] SWARB[12:11] CRQMAX[19:16] CDQMAX[23:20] PTPMAX[27:24] INUM[28] 5-14 ES45 Service Guide GPCI Frequency = 66 MHz 12 DMA Reads Retry w/no delayed Completion Data Parity Checking Disabled Address Parity Checking Disabled DCR Timer Count = 2^11 Address Stepping Enabled 1 Cchip LP Outstanding Read AMU Enabled to Perform PTE Fetch Xactions PTP Writes Disabled During Pending Reads NEWAMU[29] x1 PTPWAR[30] x0 P1_GPCTL Gport Control x7261 7473 2E2E 2E0A Pchip1 Register PCI Fast Back-To_Back x0 Xactions Disabled x1 TLB Anti-Thrashing Enabled GPCI PIO Write Chaining x0 Disabled Target RetryTimer = 64 PCI x1 Clocks 512Kb - 1Mb Window Hole x0 Disabled x0 Monster Window Disabled x0 Internal Arbitor Disabled x2E x1 GPCI Frequency = 66 Mhz 11 DMA Reads Retry w/No xB Delayed Completion Enabled xB8 Data Parity Error Detection x0 Disabled Address Parity Error x0 Detection Disabled x3 DCR Timer = 2^8 Counts x0 Address Stepping Disabled FBTB[0] THDIS[1] CHAINDIS[2] TGLAT[4:3] WIN_Hole[5] Mnstr_Win_Enable[6] ARBENA[7] PRIGRP[15:8] PCISPD66[17] CNGSTLT[21:18] PTPDESTEN[29:22] DPCEN[30] APCEN[31] DCRTV[33:32] EN_Stepping[34] P1_APCTL FBTB[0] THDIS[1] CHAINDIS[2] TGLAT[4:3] Win_Hole[5] Mnstr_Win_Enable[6] ARBENA[7] PRIGRP[15:8] PPRI[16] PCISPD66[17] CNGSLT[21:18] PTPDESTEN[29:22] DPCEN[30] APCEN[31] DCRTV[33:32] EN_Stepping[34] AGP_Rate[53:52] x7250 5B20 676E 6974 Pchip1 Aport Control Register PCI Fast Back-To-Back x0 Xactions Disabled x0 TLB Anti-Thrashing Enabled APCI PIO Write Chaining x1 Disabled Target Latency Timer = 32 x2 PCI Clocks x1 512Kb - 1Mb Hole Enabled x1 Monster Window Enabled x0 Arbitor Disabled x69 x0 x1 APCI Frequency = 66 Mhz 11 DMA Read Retry w/No xB Delayed Completion Enabled x9D Data Parity Error Detection x1 Enabled Address Command Parity x0 Error Detection Disabled x0 DCR Timer = 2^15 Counts x0 Address Stepping Disabled x1 AGP Rate = 2X Error Logs 5-15 AGP_SBA_EN[54] AGP_EN[55] AGP_Present[57] AGP_HP_RD[60:58] AGP_LP_RD[63:61] 5-16 x1 x0 x1 x4 x3 ES45 Service Guide SideBand Addressing Enabled AGP Xactions Disabled agp_present = 1 4 Cchip Pending HP Reads 3 Cchip Pending LP Reads 5.2 Fault Detection and Reporting Table 5–2 provides a summary of the fault detection and correction components of ES45 systems. Generally, PALcode handles exceptions/interrupts as follows: 1. The PALcode determines the cause of the exception/interrupt. 2. If possible, it corrects the problem and passes control to the operating system for error notification, reporting, and logging before returning the system to normal operation. If PALcode is unable to correct the problem, it • Logs double error halt error frames into the flash ROM • Logs uncorrectable error logout frames to the DPR • For single error halts, logs the uncorrectable logout frame into the DPR. 3. If error/event logging is required, control is passed through the OS Privileged Architecture Library (PAL) handler. The operating system error handler logs the error condition into the binary error log. Compaq Analyze should then diagnose the error to the defective FRU. Error Logs 5-17 Table 5–2 ES45 Fault Detection and Correction Component Fault Detection/Correction Capability Alpha 21264 (EV68) microprocessor Contains error checking and correction (ECC) logic for data cycles. Check bits are associated with all data entering and exiting the microprocessor. A single-bit error on any of the four longwords being read can be corrected (per cycle). A double-bit error on any of the four longwords being read can be detected (per cycle). Backup cache (B-cache) ECC check bits on the data store, and parity on the tag address store and tag control store. Memory DIMMs ECC logic protects data by detecting and correcting data cycle errors. A single-bit error on any of the four longwords can be corrected (per cycle). A double-bit error on any of the four longwords being read can be detected (per cycle). PCI SCSI controller adapter SCSI data parity is generated. 5-18 ES45 Service Guide 5.3 Machine Checks/Interrupts The exceptions that result from hardware system errors are called machine checks/interrupts. They occur when a system error is detected during the processing of a data request. During the error-handling process, errors are first handled by the appropriate PALcode error routine and then by the associated operating system error handler. PALcode transfers control to the operating system through the PAL handler. Table 5–3 lists the machine checks/interrupts that are related to error events. The designations — 630, 670, 620, 660, and 680 — indicate a system control block (SCB) offset to the fatal system error handler for Tru64 UNIX and OpenVMS. Table 5–3 Machine Checks/Interrupts Error Type Error Descriptions CPU Correctable Error (630) B-cache probe hit single-bit ECC error D-cache tag parity error on issue I-cache tag or data parity error D-cache victim single-bit ECC error B-cache single-bit ECC fill error to I-stream or D-stream Memory single-bit ECC fill error to I-stream or Dstream Generic Alpha 21264 (EV68 ) correctable errors. CPU Uncorrectable Error (670) Fatal microprocessor machine check errors that result in a system crash. PAL detected bugcheck error Operating system detected bugcheck error EV68 detected second D-cache store EEC error EV68 detected D-cache tag parity error in pipeline 0 or 1 EV68 detected duplicate D-cache tag parity error EV68 detected double-bit ECC memory fill error EV68 detected double-bit probe hit EEC error EV68 detected B-cache tag parity error Error Logs 5-19 Table 5–3 Machine Checks/Interrupts (Continued) Error Type Error Descriptions System Correctable Error (620) System detected ECC single-bit error ES45-specific correctable errors. System Uncorrectable Error (660) A system-detected machine check that occurred as a result of an “off-chip” request to the system. System Environmental Error (680) System-detected machine check caused by an overtemperature condition, fan failure, or power supply failure. Uncorrectable ECC error Nonexistent memory reference PCI system bus error (SERR) PCI read data parity error (RDPE) PCI address/command parity error (APE) PCI no device select (NDS) PCI target abort (TA) Invalid scatter/gather page table entry (SGE) error PCI data parity error (PERR) Flash ROM write error PCI target delayed completion retry time-out (DCRTO) PCI master retry time-out (RTO 2**24) error PCI-ISA software NMI error Overtemperature failure (>50° C) (see Note) Uncorrectable Fan 5 failure Complete power supply failure Fan failure (redundant fan) Power supply failure (redundant supply) High temperature warning (>45° C and <50° C) NOTE: For overtemperature failure, the position of jumper J26 determines whether the failure is fatal or nonfatal. See Appendix B. 5-20 ES45 Service Guide 5.3.1 Error Logging and Event Log Entry Format The operating system error handlers generate several entry types. Entries can be of variable length based on the number of registers within the entry. Each entry consists of an operating system header, several device frames, and an end frame. Most entries have a PAL-generated logout frame, and may contain frames for CPU, memory, and I/O. Table 5–4 shows an event structure map for a system uncorrectable PCI target abort error. NOTE: See Appendix D for the source data Compaq Analyze uses to isolate to the FRUs. Error Logs 5-21 Table 5–4 Sample Error Log Event Structure Map (ES45 with 10 PCI Slots) OFFSET(hex) 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 ech0000 NEW COMMON OS HEADER ech+nnnn lfh0000 lfh+nnnn lfEV680000 lfEV68+nn nn lfctt_A0[u] lfctt_A8[u] lfctt_B0[u] lfctt_B8[u] lfctt_C0[u] STANDARD LOGOUT FRAME HEADER COMMON PAL EV68 SECTION (first 8 QWs Zeroed) SESF<63:32> = Reserved(MBZ) <39:32>= (MBZ) SESF<31:16> = Reserved(MBZ) SESF<15:0>= 0002(hex) Cchip CPUx Device Interrupt Request Register (DIRx<61> = 1) Cchip Miscellaneous Register (MISC) Pchip0 Error Register (P0_PERROR<63:0> = 0) Pchip1 Error Register (P1_PERROR<51>=0;<47:18>=PCI Addr;<17:16>=PCI Opn; <6>=1) lfett_C8[u] Pchip1 Extended Titan/Typhoon System Packet lfett_138[u] eelcb_140 Pchip 1 PCI Slot 4 Single Device Bus Snapshot Packet eelcb_190 Pchip 1 PCI Slot 5 Single Device Bus Snapshot Packet eelcb_1E0 Pchip 1 PCI Slot 6 Single Device Bus Snapshot Packet eelcb_230 Pchip 1 PCI Slot 7 Single Device Bus Snapshot Packet eelcb_280 Pchip 1 PCI Slot 8 Single Device Bus Snapshot Packet eelcb_2D0 Pchip 1 PCI Slot 9 Single Device Bus Snapshot Packet 2D8 5-22 Termination or End Packet ES45 Service Guide 5.4 Environmental Errors Captured by SRM If an environmental error occurs while the SRM console is running, a logout frame similar to Example 5–1 is sent to the console output device. The logout frame is preceded by the message “***unexpected system event through vector 680 on CPU n.” (usually CPU 0.) For register definitions, see Appendix D. Example 5–1 Console Level Environmental Error Logout Frame P00>>> *** unexpected system event through vector 680 on CPU 0 os_flags 0000000000000000 cchip_dirx 0004000000000000 tig_smir 0000000000000008 tig_cpuir 000000000000000f tig_psir 0000000000000003 lm78_isr 0000000000000000 door_open 0000000000000004 temp_warning 0000000000000000 fan_ctrl_fault 0000000000000000 power_down_code 0000000000000000 reserved_1 0000000000000000 This example shows a fan door open event. P00>>> *** unexpected system event through vector 680 on CPU 0 os_flags 0000000000000000 cchip_dirx 0004000000000000 tig_smir 0000000000000008 tig_cpuir 000000000000000f tig_psir 0000000000000003 lm78_isr 0000000000000000 door_open 0000000000000040 temp_warning 0000000000000000 fan_ctrl_fault 0000000000000000 power_down_code 0000000000000000 reserved_1 0000000000000000 This example shows a fan door closing event. Error Logs 5-23 Chapter 6 System Configuration and Setup This chapter describes how to configure and set up ES45 systems. The following topics are covered: • System Consoles • Displaying the Hardware Configuration • Setting Environment Variables • Setting Automatic Booting • Changing the Default Boot Device • Setting SRM Security • Configuring Devices • Booting Linux System Configuration and Setup 6-1 6.1 System Consoles The SRM console program is located in a flash ROM on the system motherboard. From the console interface, you can set up and boot the operating system, display the system configuration, and run diagnostics. For complete information, see the ES45 Owner’s Guide. SRM Console Systems running the Tru64 UNIX or OpenVMS operating systems are configured from the SRM console, a command-line interface (CLI). From the CLI you can enter commands to configure the system, view the system configuration, boot the system, and run ROM-based diagnostics. NOTE: The operating systems use different algorithms for system time. If you switch between operating systems (for example, between UNIX and OpenVMS), be sure to reset the time at the operating system level. Linux The procedure for installing Linux on an Alpha system is described in the Alpha Linux installation document for your Linux distribution. The installation document can be downloaded from the following Web site: http://www.compaq.com/alphaserver/linux RMC CLI The remote management console (RMC) provides a command-line interface (CLI) for controlling the system. You can use the CLI either locally or remotely (modem connection) to power the system on and off, halt or reset the system, and monitor the system environment. You can also use the dump, env, and status commands to help diagnose errors. See Chapter 7 for details. 6-2 ES45 Service Guide 6.1.1 Selecting the Display Device The SRM console environment variable determines to which display device (VT-type terminal or VGA monitor) the console display is sent. The console terminal that displays the SRM user interface can be either a serial terminal (VT320 or higher, or equivalent) or a VGA monitor. The SRM console environment variable determines the display device. • If console is set to serial, and a VT-type device is connected, the SRM console powers on in serial mode and sends power-up information to the VT device. The VT device can be connected to the MMJ port or to COM2. • If console is set to graphics, the SRM console expects to find a VGA card and, if so, displays power-up information on the VGA monitor after VGA initialization has been completed. You can verify the display device with the SRM show console command and change the display device with the SRM set console command. If you change the display device setting, you must reset the system (with the Reset button or the init command) to put the new setting into effect. In the following example, the user displays the current console device (a graphics device) and then resets it to a serial device. After the system initializes, output will be displayed on the serial terminal. P00>>> show console console graphics P00>>> set console serial P00>>> init . . . System Configuration and Setup 6-3 6.1.2 Setting the Control Panel Message You can create a customized message to be displayed on the operator control panel after startup self-tests and diagnostics have been completed. When the operating system is running, the control panel displays the console revision. It is useful to create a customized message if you have a number of systems and you want to identify each system by a node name. You can use the SRM set ocp_text command to change this message (see Example 6–1). The message can be up to 16 characters and must be entered in quotation marks. Example 6–1 set ocp_text P00>>> set ocp_text “Node Alpha1” 6-4 ES45 Service Guide 6.2 Displaying the Hardware Configuration View the system hardware configuration by entering commands from the SRM console. It is useful to view the hardware configuration to ensure that the system recognizes all devices, memory configuration, and network connections. Use the following SRM console commands to view the system configuration. See the Owner’s Guide for details. show boot* Displays the boot environment variables. show config Displays the logical configuration of interconnects and buses on the system and the devices found on them. show device Displays the bootable devices and controllers in the system. show fru Displays the physical configuration of FRUs (field-replaceable units). show memory Displays configuration of main memory. System Configuration and Setup 6-5 6.3 Setting Environment Variables Environment variables pass configuration information between the console and the operating system. Their settings determine how the system powers up, boots the operating system, and operates. • To check the setting for a specific environment variable, enter the show envar command, where the name of the environment variable is substituted for envar. • To reset an environment variable, use the set envar command, where the name of the environment variable is substituted for envar. 6-6 ES45 Service Guide set envar The set command sets or modifies the value of an environment variable. It can also be used to create a new environment variable if the name used is unique. Environment variables pass configuration information between the console and the operating system. Their settings determine how the system powers up, boots the operating system, and operates. The syntax is: set envar value envar The name of the environment variable to be modified. value The new value of the environment variable. New values for the following environment variables take effect only after you reset the system by pressing the Reset button or issuing the init command. auto_action console cpu_enabled os_type pk*0_fast pk*0_host_id pk*0_soft_term console_memory_allocation show envar The show envar command displays the current value (or setting) of an environment variable. The syntax is: show envar envar The name of the environment variable to be displayed. The wildcard * displays all environment variables. Table 6–1 summarizes the SRM environment variables used most often on the ES45 system. System Configuration and Setup 6-7 Table 6–1 SRM Environment Variables Variable Attributes Description auto_action 1 NV,W Action the console should take following an error halt or power failure. Defined values are: boot—Attempt bootstrap. halt—Halt, enter console I/O mode. restart—Attempt restart. If restart fails, try boot. bootdef_dev NV,W Device or device list from which booting is to be attempted when no path is specified. Set at factory to disk with factory-installed software; otherwise NULL. boot_file NV,W Default file name used for the primary bootstrap when no file name is specified by the boot command. The default value is NULL. boot_osflags NV,W Default parameters to be passed to system software during booting if none are specified by the boot command. OpenVMS: Additional parameters are the root_number and boot flags. The default value is NULL. root_number: Directory number of the system disk on which OpenVMS files are located. 0 (default)—[SYS0.SYSEXE] 1—[SYS1.SYSEXE] 2—[SYS2.SYSEXE] 3—[SYS3.SYSEXE] 1 NV—Nonvolatile. The last value saved by system software or set by console commands is preserved across cold bootstraps (when the system goes through a full initialization), and long power outages. W—Warm nonvolatile. The last value set by system software is preserved across warm bootstraps (UNIX shutdown -r command, OpenVMS REBOOT command, or a crash and reboot; not all of the SRM initialization is run) and restarts. 6-8 ES45 Service Guide Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description boot_osflags (continued) NV,W boot_flags: The hexadecimal value of the bit number or numbers to set. To specify multiple boot flags, add the flag values (logical OR). 1—Bootstrap conversationally (enables you to modify SYSGEN parameters in SYSBOOT). 2—Map XDELTA to running system. 4—Stop at initial system breakpoint. 8—Perform a diagnostic bootstrap. 10—Stop at the bootstrap breakpoints. 20—Omit header from secondary bootstrap file. 80—Prompt for the name of the secondary bootstrap file. 100—Halt before secondary bootstrap. 10000—Display debug messages during booting. 20000—Display user messages during booting. Tru64 UNIX: The following parameters are used with this operating system: a—Autoboot. Boots /vmunix from bootdef_dev, goes to multi-user mode. Use this for a system that should come up automatically after a power failure. s—Stop in single-user mode. Boots /vmunix to single-user mode and stops at the # (root) prompt. i—Interactive boot. Requests the name of the image to boot from the specified boot device. Other flags, such as -kdebug (to enable the kernel debugger), may be entered using this option. Continued on next page System Configuration and Setup 6-9 Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description D—Full dump; implies s as well. By default, if Tru64 UNIX crashes, it completes a partial memory dump. Specifying D forces a full dump at system crash. boot_osflags (continued) Common settings are a, autoboot, and Da, autoboot and create full dumps if the system crashes. com1_baud NV,W Sets the baud rate of the COM1 (MMJ) port. The default baud rate is 9600. Baud rate values are 1800, 2000, 2400, 3600, 4800, 7200, 9600, 19200, 38400, 57600. com2_baud NV,W Sets the baud rate of the COM2 port. The default baud rate is 9600. Baud rate values are 1800, 2000, 2400, 3600, 4800, 7200, 9600, 19200, 38400, 57600. com1_flow com2_flow NV,W The com1_flow and com2_flow environment variables indicate the flow control on the serial ports. Defined values are: none—No data flows in or out of the serial ports. Use this setting for devices that do not recognize XON/XOFF or that would be confused by these signals. software—Use XON/XOFF(default). This is the setting for a standard serial terminal. hardware—Use modem signals CTS/RTS. Use this setting if you are connecting a modem to a serial port. com1_mode NV 6-10 ES45 Service Guide Specifies the COM1 data flow paths so that data either flows through the RMC or bypasses it. Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description com1_modem com2_modem NV,W Used to tell the operating system whether a modem is present on the COM1 or COM2 ports, respectively. On—Modem is present. Off—Modem is not present (default value). console NV Sets the device on which power-up output is displayed. Graphics—Sets the power-up output to be displayed at a VGA monitor or device connected to the VGA module. Serial—Sets the power-up output to be displayed on the device that is connected to the COM1 (MMJ) port. console_memory _allocation NV Determines which memory locations the SRM console will allocate for its private use. Old—For 1 gigabyte or less, the console carves memory from 0–2 megabytes and at the end of memory, leaving all memory in between available to the operating system. If there is more than 1 gigabyte, the console creates a “memory hole” for the operating system just under 1 gigabyte. New—The console takes all needed memory from 0 megabytes to whatever amount is needed. It does not matter how much memory is installed and no holes are ever created. Continued on next page System Configuration and Setup 6-11 Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description cpu_enabled NV Enables or disables a specific secondary CPU. All CPUs are enabled by default. The primary CPU cannot be disabled. The primary CPU is the lowest numbered working CPU. ei*0_inet_init or ew*0_inet_init NV Determines whether the interface's internal Internet database is initialized from nvram or from a network server (via the bootp protocol). ei*0_mode or ew*0_mode NV Sets the Ethernet controller to the default Ethernet device type. aui—Sets the default device to AUI. bnc—Sets the default device to ThinWire. fast—Sets the default device to fast 100BaseT. fastfd—Sets the default device to fast full duplex 100BaseT. full—Set the default device to full duplex twisted pair. Twisted-pair— Sets the default device to 10BaseT (twisted-pair). ei*0_protocols or ew*0_protocols NV Determines which network protocols are enabled for booting and other functions. Mop—Sets the network protocol to MOP for systems using the OpenVMS operating system. Bootp—Sets the network protocol to bootp for systems using the Tru64 UNIX operating system. Bootp,mop—When the settings are used in a list, the mop protocol is attempted first, followed by bootp. 6-12 ES45 Service Guide Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description heap_expand NV Increases the amount of memory available for the SRM console's heap. Valid selections are: NONE (default) 64KB 128KB 256KB 512KB 1MB 2MB 3MB 4MB kbd_hardware type NV Sets the keyboard hardware type as either PCXAL or LK411 and enables the system to interpret the terminal keyboard layout correctly. kzpsa_host_id W Specifies the default value for the KZPSA host SCSI bus node ID. language NV Specifies the console keyboard layout. The default is English (American). memory_test NV Specifies the extent to which memory will be tested on Tru64 UNIX. The options are: Full—Full memory test will be run. Required for OpenVMS. Partial—First 256 MB of memory will be tested. None—Only first 32 MB will be tested. ocp_text NV Overrides the default control panel display text with specified text. Continued on next page System Configuration and Setup 6-13 Table 6–1 SRM Environment Variables (Continued) Variable Attributes Description os_type NV Sets the default operating system. vms or unix—Sets system to boot the SRM firmware. password NV Sets a console password. Required for placing the SRM into secure mode. pci_parity NV Disable or enable parity checking on the PCI bus. On—PCI parity enabled (default value) Off—PCI parity disabled Some PCI devices do not implement PCI parity checking, and some have a parity-generating scheme in which the parity is sometimes incorrect or is not fully compliant with the PCI specification. In such cases, the device functions properly so long as parity is not checked. pk*0_fast NV Enables fast SCSI devices on a SCSI controller to perform in standard or fast mode. 0—Sets the default speed for devices on the controller to standard SCSI. If a controller is set to standard SCSI mode, both standard and fast SCSI devices will perform in standard mode. 1—Sets the default speed for devices on the controller to fast SCSI mode. Devices on a controller that connects to both standard and Fast SCSI devices will automatically perform at the appropriate rate for the device, either fast or standard mode. 6-14 ES45 Service Guide Table 6–1 SRM Environment Variables (Continued) Variable Attribute Description pk*0_host_id NV Sets the controller host bus node ID to a value between 0 and 7. 0 to 7—Assigns bus node ID for specified host adapter. pk*0_soft_term NV Enables or disables SCSI terminators for optional SCSI controllers. This environment variable applies to systems using the Qlogic SCSI controller, though it does not affect the onboard controller. The Qlogic SCSI controller implements the 16-bit wide SCSI bus. The Qlogic module has two terminators, one for the 8 low bits and one for the high 8 bits. There are five possible values: off—Turns off both low 8 bits and high 8 bits. Low—Turns on low 8 bits and turns off high 8 bits. High—Turns on high 8 bits and turns off low 8 bits. On—Turns on both low 8 bits and high 8 bits. sys_serial_num NV Sets the system serial number, which is then propagated to all FRUs that have EEPROMs. The serial number can be read by the operating system. tt_allow_login NV Enables or disables login to the SRM console firmware on alternative console ports. 0—Disables login on alternative console ports. 1—Enables login on alternative console ports (default setting). If the console output device is set to serial, set tt_allow_login 1 allows you to log in on the primary COM1(MMJ) port, or alternate COM2 port, or the VGA monitor. If the console output device is set to graphics, set tt_allow_login 1 allows you to log in through either the COM1(MMJ) or COM2 console port. System Configuration and Setup 6-15 6.4 Setting Automatic Booting Tru64 UNIX and OpenVMS systems are factory set to halt in the SRM console. You can change these defaults, if desired. Systems can boot automatically (if set to autoboot) from the default boot device under the following conditions: • When you first turn on system power • When you power cycle or reset the system • When system power comes on after a power failure • After a bugcheck (OpenVMS) or panic (Linux or Tru64 UNIX) 6.4.1 Setting the Operating System to Auto Start The SRM auto_action environment variable determines the default action the system takes when the system is power cycled, reset, or experiences a failure. The factory setting for auto_action is halt. The halt setting causes the system to stop in the SRM console. You must then boot the operating system manually. For maximum system availability, auto_action can be set to boot or restart. • With the boot setting, the operating system boots automatically after the SRM init command is issued or the Reset button is pressed. • With the restart setting, the operating system boots automatically after the SRM init command is issued or the Reset button is pressed, and it also reboots after an operating system crash. To set the default action to boot, enter the following SRM commands: P00>>> set auto_action boot P00>>> init See the Owner’s Guide for more information. 6-16 ES45 Service Guide 6.5 Changing the Default Boot Device You can change the default boot device with the set bootdef_dev command. You can designate a default boot device. You change the default boot device by using the set bootdef_dev SRM console command. For example, to set the boot device to the IDE CD-ROM, enter commands similar to the following: P00>>> show bootdef_dev bootdef_dev dka400.4.0.1.1 P00>>> set bootdef_dev dqa500.5.0.1.1 P00>>> show bootdef_dev bootdef_dev dqa500.5.0.1.1 See the Owner’s Guide for more information. System Configuration and Setup 6-17 6.6 Setting SRM Security The set password and set secure commands set SRM security. The login command turns off security for the current session. The clear password command returns the system to user mode. The SRM console has two modes, user mode and secure mode. • User mode allows you to use all SRM console commands. User mode is the default mode. • Secure mode allows you to use only the boot and continue commands. The boot command cannot take command-line parameters when the console is in secure mode. The console boots the operating system using the environment variables stored in NVRAM (boot_file, bootdef_dev, boot_flags). Example 6–2 Set Password P00>>> set password Please enter the password: Please enter the password again: P00>>> P00>>> set password Please enter the password: Please enter the password again: Now enter the old password: P00>>> P00>>> set password Please enter the password: Password length must be between 15 and 30 characters P00>>> Setting a password. If a password has not been set and the set password command is issued, the console prompts for a password and verification. The password and verification are not echoed. Changing a password. If a password has been set and the set password command is issued, the console prompts for the new password and verification, then prompts for the old password. The password is not changed if the validation password entered does not match the existing password stored in NVRAM. 6-18 ES45 Service Guide The password length must be between 15 and 30 alphanumeric characters. Any characters entered after the 30th character are not stored. Example 6–3 set secure P00>>> set secure Console is secure. Please login. P00>>> login Please enter the password: P00>>> b dkb0 The set secure command console puts the console into secure mode. A password must be set before you can issue set secure. Once the console is secure, only the boot and continue commands can be used. The boot command cannot take command-line parameters. Entering the login command turns off security features for the current console session. This allows the operator to enter any SRM command—in this case, a boot command with command-line parameters. Example 6–4 clear password P00>>> clear password Please enter the password: Password successfully cleared. P00>>> Clearing the password returns the system to user mode. If You Forget the Password If you forget the current password, use the login command in conjunction with the control panel Halt button to clear the password, as follows: 1. Enter the login command: P00>>> login 2. When prompted for the password, press the Halt button to the latched position and then press the Return (or Enter) key. 3. Press the Halt button to release the halt. The password is now cleared and the console cannot be put into secure mode unless you set a new password. System Configuration and Setup 6-19 6.7 Configuring Devices Become familiar with the configuration requirements for CPUs and memory before removing or replacing those components. See Chapter 8 for removal and replacement procedures. WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. These measures include: 1. Remove any jewelry that may conduct electricity. 2. If accessing the system card cage, power down the system and wait 2 minutes to allow components to cool. 3. Wear an anti-static wrist strap when handling internal components. 6-20 ES45 Service Guide 6.7.1 CPU Configuration Figure 6–1 CPU Slot Locations (Pedestal/Rack) CPU 3 CPU 1 CPU 0 CPU 2 PK0228B System Configuration and Setup 6-21 Figure 6–2 CPU Slot Locations (Tower) CPU 2 CPU 0 CPU 1 CPU 3 PK0229A CPU Configuration Rules 1. A CPU must be installed in slot 0. The system will not power up without a CPU in slot 0. 2. CPU cards must be installed in numerical order, starting at CPU slot 0. See Figure 6–1 and Figure 6–2. 3. CPUs must be identical in speed. 6-22 ES45 Service Guide 6.7.2 Memory Configuration Become familiar with the rules for memory configuration before adding DIMMs to the system. Refer to Figure 6–4 or Figure 6–5 and observe the following rules for installing DIMMs. • You can install up to 16 DIMMs or up to 32 DIMMs. • An option consists of a set of 4 DIMMs. You must install all 4 DIMMs to populate a set. • Fill sets in numerical order. Populate all 4 slots in Set 0, then populate Set 1, and so on. • An “array” is one set for systems that support 16 DIMMs and two sets for systems that support 32 DIMMs. • DIMMs in an array must be the same capacity and type. For example, suppose you have populated Sets 0, 1, 2, and 3. When you populate Set 4, the DIMMs must be the same capacity and type as those installed in Set 0. Similarly, Set 5 must be populated with DIMMs of the same capacity and type as are in Set 1, and so on, as indicated in the following table. Array Systems Supporting 32 DIMMs Systems Supporting 16 DIMMs 0 Set 0 and Set 4 Set 0 1 Set 1 and Set 5 Set 1 2 Set 2 and Set 6 Set 2 3 Set 3 and Set 7 Set 3 CAUTION: Using different DIMMs may result in loss of data. System Configuration and Setup 6-23 DIMM Information for Two System Types DIMMs are manufactured with two types of SRAMs, stacked and unstacked (see Figure 6–3). Stacked DIMMs provide twice the capacity of unstacked DIMMs, and, at the time of shipment, are the highest capacity DIMMs offered by Compaq. The system may have either stacked or unstacked DIMMs. A memory option consists of a “set” of four DIMMs. The system supports two sets per “array” and four arrays per system. You can mix stacked and unstacked DIMMs within the system, but not within an array. The DIMMs within an array must be of the same capacity and type (stacked or unstacked) because of different memory addressing. When installing sets 0, 1, 2, and 3, an incorrect mix will not occur. When installing sets 4, 5, 6, or 7, however, you must ensure that the four DIMMs being installed match the capacity and type of DIMMs in the existing array. If necessary, rearrange DIMMs for proper configuration. Figure 6–3 Stacked and Unstacked DIMMs Unstacked DIMMS Stacked DIMMS PK1209 6-24 ES45 Service Guide Only the following DIMMs and DIMM options can be used in the ES45 system. Density DIMM DIMM Option (4 DIMMs per) 128 MB 20-01CBA-09 MS620-AA (512 MB) 256 MB 20-01DBA-09 MS620-BA (1 GB) 512 MB 20-01EBA-09 MS620-CA (2 GB) 1 GB 20-L0FBA-09 MS620-DA (4 GB)∗ ∗ Toshiba specific DIMM and option. CAUTION: Using different DIMMs may result in loss of data. System Configuration and Setup 6-25 Memory Performance Considerations Interleaved operations reduce the average latency and increase the memory throughput over non-interleaved operations. With one memory option (4 DIMMs) installed, memory interleaving will not occur. For 2-way interleaving, array 0 & 2 and 1 & 3 must have the same size memory. For 4-way interleaving, array 0 through 3 must have the same size memory. The output of the show memory command provides the memory interleaving status of the system. P00>>> show memory Array Size --------- ---------0 4096Mb 1 1024Mb 2 4096Mb 3 1024Mb Base Address Intlv Mode ---------------- ---------0000000000000000 2-Way 0000000200000000 2-Way 0000000100000000 2-Way 0000000240000000 2-Way 10240 MB of System Memory The show memory display does not indicate the number of DIMMs or their size. Array 3 could consist of two sets of 128 MB DIMMs (eight DIMMs) or one set of 256 MB DIMMs (four DIMMs). Either combination provides 1024 MB of memory. For optimum memory utilization and performance, load memory arrays in the following order: 0, 1, 2, 3, 4, 6, 5, and 7. See Figure 6–4 for array locations. 6-26 ES45 Service Guide Figure 6–4 Memory Configuration (Pedestal/Rack) Set # 0 4 6 2 0 4 6 2 MMB 0 Array 0 Set # 0 & 4 Set # 1 5 7 3 1 5 7 3 MMB 1 Set # 0 4 6 2 0 4 6 2 Set # 1 5 7 3 1 5 7 3 Array 1 Set # 1 & 5 Array 2 Set # 2 & 6 MMB 2 Array 3 Set # 3 & 7 MMB 3 J9 J8 J7 J6 J5 J4 J3 J2 PK0202A System Configuration and Setup 6-27 Figure 6–5 Memory Configuration (Tower) J9 J8 J7 J6 73 15 Set # 73 15 J5 J4 J3 J2 MMB 3 Set # 0 46 2 0 46 2 MMB 2 Set # 73 15 73 15 MMB 1 Set # 0 4 62 0 46 2 MMB 0 Array 0 Set # 0 & 4 Array 2 Set # 2 & 6 Array 1 Set # 1 & 5 Array 3 Set # 3 & 7 PK0203A 6-28 ES45 Service Guide 6.7.3 PCI Configuration PCI modules are either designed for 5.0 volts or 3.3 volt slots, or are universal in design and can plug into either 3.3 or 5.0 volt slots. Figure 6–6 PCI Slot Locations (Pedestal/Rack) 1 2 3 4 5 6 7 8 9 10 10-Slot PCI PK0226C CAUTION: Check the keying before you install the PCI module and do not force it in. Plugging a module into a wrong slot can damage it. System Configuration and Setup 6-29 The PCI slots are split across four independent 64-bit PCI buses, three buses at 66 MHz and one bus at 33 MHz. These buses correspond to Hose 0 through Hose 3 in the system logical configuration. The slots on each bus are listed below. Some PCI options require drivers to be installed and configured. These options come with a floppy or a CD-ROM. Refer to the installation document that came with the option and follow the manufacturer's instructions. There is no direct correspondence between the physical numbers of the slots on the I/O backplane and the logical slot identification reported with the SRM console show config command (described in Chapter 2). The table in Figure 6–7 maps the physical slot numbers to the SRM logical ID numbers for the 10-slot backplane. Figure 6–7 PCI Slot Voltages and Hose Numbers 10-Slot PCI I/O Backplane 1 2 3 Max Speed Voltage Hot-Plug SRM Console Hose 2 Slot ID 1 66 MHz 3.3V No Hose 2 Slot ID 2 66 MHz 3.3V No Hose 0 Slot ID 11 33 MHz 5.0V No 4 5 6 7 8 9 10 66 MHz 66 MHz 33 MHz 66 MHz 66 MHz 33 MHz 33 MHz 3.3V 3.3V 5.0V 3.3V 3.3V 5.0V 5.0V Yes Yes Yes Yes Yes Yes Yes Hose 3 Slot ID 2 Hose 3 Slot ID 1 Hose 0 Slot ID 10 Hose 1 Slot ID 2 Hose 1 Slot ID 1 Hose 0 Slot ID 9 Hose 0 Slot ID 8 Quick Reference SRM Console to Physical Slot Location SRM Console Physical Slot Hose 0 Slot ID 8 Slot ID 9 Slot ID 10 Slot ID 11 Hose 1 Slot ID 1 Slot ID 2 Hose 2 Slot ID 1 Slot ID 2 Hose 3 Slot ID 1 Slot ID 2 10 9 6 3 8 7 1 2 5 4 PK0974B For more information, see http://www.compaq.com/alphaserver/. 6-30 ES45 Service Guide PCI modules are either designed for 5.0 volts or 3.3 volt slots, or are universal in design and can plug into either 3.3 or 5.0 volt slots. CAUTION: Check the keying before you install the PCI module and do not force it in. Plugging a module into a wrong slot can damage it. Figure 6–8 PCI Slot Locations (Tower) 1 2 3 4 5 6 7 8 9 10 10-Slot PCI PK0227B System Configuration and Setup 6-31 6.7.4 PCI Module LEDs CAUTION: Hot plug is not currently supported by the operating systems. Figure 6–9 PCI Status LEDs 1 2 1 No Hot Plug 3 3 4 4 5 5 6 7 Green 2 6 Hot Plug 8 9 7 8 Green Amber 9 10 10 Amber Side View Rear View LED Status Green Power applied Amber Power fault 6-32 ES45 Service Guide MR0073 6.7.5 Power Supply Configurations Figure 6–10 Power Supply Locations Pedestal/Rack Tower 0 1 0 1 2 2 PK0207B The system can have the following power configurations: Two Power supply System (minimum configuration) • Two CPUs • One storage cage • Four to sixteen DIMMs Redundant Power Supply. If a power supply fails, the redundant supply provides power and the system continues to operate normally. A third power supply adds redundancy for an entry-level system. Recommended Installation Order. Generally, power supply 0 is installed first, power supply 1 second, and power supply 2 third, but the supplies can be installed in any order. See Figure 6–10. The power supply numbering corresponds to the numbering displayed by the SRM show power command. System Configuration and Setup 6-33 6.8 Booting Linux Obtain the Linux installation document and install Linux on the system. Then verify the firmware version, boot device, and boot parameters, and issue the boot command. The procedure for installing Linux on an Alpha system is described in the Alpha Linux installation document for your Linux distribution. The installation document can be downloaded from the following Web site: http://www.compaq.com/alphaserver/linux You need V5.6-3 or higher of the SRM console to install Linux. If you have a lower version of the firmware, you will need to upgrade. For instructions and the latest firmware images, see the following URL. http://ftp.digital.com/pub/DEC/Alpha/firmware/ Linux Boot Procedure 1. Power up the system to the SRM console and enter the show version command to verify the firmware version. P00>> show version version P00>> V5.6-3 June 15 2001 08:36:11 2. Enter the show device command to determine the unit number of the drive for your boot device, in this case dka0.0.0.17.0. P00>>> sh dev dka0.0.0.17.0 dka200.2.0.7.1 dqa0.0.0.105.0 dva0.0.0.0.0 ewa0.0.0.9.0 pka0.7.0.7.1 pkb0.7.0.6.0 pkc0.7.0.106.0 P00>>> 6-34 ES45 Service Guide DKA0 DKA200 DQA0 DVA0 EWA0 PKA0 PKB0 PKC0 COMPAQ BD018122C9 COMPAQ BD018122C9 CD-224E 00-00-F8-1B-9C-47 SCSI Bus ID 7 SCSI Bus ID 7 SCSI Bus ID 7 B016 B016 9.5B 3. After installing Linux, set boot environment variables appropriately for your installation. The typical values indicating booting from dka0 with the first aboot.conf entry are shown in this example. P00>>> set bootdef_dev dka0 P00>>> set boot_file P00>>> set boot_osflags 0 P00>>> show boot* boot_dev dka0.0.0.17.0 boot_file boot_osflags 0 boot_reset OFF bootdef_dev booted_dev booted_file booted_osflags 4. From SRM enter the boot command. The following example shows abbreviated boot output. Example 6–5 Linux Boot Output This example shows messages similar to what you will see when booting Linux. The example is from a RedHat Linux 7.0 boot. >>> boot (boot dka0.0.0.8.0 -flags 0) block 0 of dka0.0.0.8.0 is a valid boot block reading 163 blocks from dka0.0.0.8.0 bootstrap code read in base = 2d4000, image_start = 0, image_bytes = 14600 initializing HWRPB at 2000 initializing page table at 7fff0000 initializing machine state setting affinity to the primary CPU jumping to bootstrap code aboot: Linux/Alpha SRM bootloader version 0.7 aboot: switching to OSF/1 PALcode version 1.87 aboot: booting from device 'SCSI 0 8 0 0 0 0 0' aboot: valid disklabel found: 3 partitions. aboot: loading uncompressed vmlinuz-2.4.3-7privateer2smp... aboot: loading compressed vmlinuz-2.4.3-7privateer2smp... aboot: zero-filling 369720 bytes at 0xfffffc0000ce9400 aboot: starting kernel vmlinuz-2.4.3-7privateer2smp with arguments root=/dev/sda2 console=ttyS0 Linux version 2.4.3-7privateer2smp (root@privateer) (gcc version 2.96 20000731 (Red Hat Linux 7.1 2.96-85)) #1 SMP Thu May 24 11:01:14 EDT 2001 Booting GENERIC on Titan variation Privateer using machine vector PRIVATEER from SRM Command line: root=/dev/sda2 console=ttyS0 System Configuration and Setup 6-35 memcluster 0, usage 1, start 0, end 362 memcluster 1, usage 0, start 362, end 262135 memcluster 2, usage 1, start 262135, end 262144 freeing pages 362:1024 freeing pages 1700:262135 SMP: 4 CPUs probed -- cpu_present_mask = f On node 0 totalpages: 262144 zone(0): 262144 pages. zone(1): 0 pages. zone(2): 0 pages. Kernel command line: root=/dev/sda2 console=ttyS0 Using epoch = 1900 Console: colour dummy device 80x25 Calibrating delay loop... 1993.00 BogoMIPS Memory: 2044536k/2097080k available (2321k kernel code, 41456k reserved, 2133k data, 432k init) Dentry-cache hash table entries: 262144 (order: 9, 4194304 bytes) Buffer-cache hash table entries: 131072 (order: 7, 1048576 bytes) Page-cache hash table entries: 262144 (order: 9, 4194304 bytes) Inode-cache hash table entries: 131072 (order: 8, 2097152 bytes) VFS: Diskquotas version dquot_6.5.0 initialized POSIX conformance testing by UNIFIX Using heuristic of 2147483647 cycles. SMP starting up secondaries. Calibrating delay loop... 1997.12 BogoMIPS Calibrating delay loop... 1997.12 BogoMIPS Calibrating delay loop... 1993.00 BogoMIPS SMP: Total of 4 processors activated (7987.49 BogoMIPS). got res[8000:80ff] for resource 0 of Symbios Logic Inc. (formerly NCR) 53c895 got res[8400:843f] for resource 1 of Intel Corporation 82557 . . . autorun ... ... autorun DONE. NET4: Linux TCP/IP 1.0 for NET4.0 IP Protocols: ICMP, UDP, TCP, IGMP IP: routing cache hash table of 16384 buckets, 256Kbytes TCP: Hash tables configured (established 524288 bind 65536) Linux IP multicast router 0.06 plus PIM-SM NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. VFS: Mounted root (ext2 filesystem) readonly. Freeing unused kernel memory: 432k freed . . . login: 6-36 ES45 Service Guide Chapter 7 Using the Remote Management Console You can manage the system through the remote management console (RMC). The RMC is implemented through an independent microprocessor that resides on the system motherboard. The RMC also provides access to the repository for all error information in the system. This chapter explains the operation and use of the RMC. Sections are: • RMC Overview • Operating Modes • Terminal Setup • Connecting to the RMC CLI • SRM Environment Variables for COM1 • RMC Command-Line Interface • Resetting the RMC to Factory Defaults • Troubleshooting Tips Using the Remote Management Console 7-1 7.1 RMC Overview The remote management console provides a mechanism for monitoring the system (voltages, temperatures, and fans) and manipulating it on a low level (reset, power on/off, halt). It also provides functionality to read and write configuration and error log information to FRU error log devices. The RMC performs monitoring and control functions to ensure the successful operation of the system. • Monitors thermal sensors on the CPUs, the PCI backplane, and the power supplies • Monitors voltages, power supplies, and fans • Handles hot swap of power supplies and fans • Controls the operator control panel (OCP) display and writes status messages on the display • Detects alert conditions such as excessive temperature, fan failure, and power supply failure. On detection, RMC displays messages on the OCP, pages an operator, and sends an interrupt to SRM, which then passes the interrupt to the operating system or an application. • Shuts down the system if any fatal conditions exist. For example: The temperature reaches the failure limit. The cover to the system card cage is removed. The main fan (Fan 6) and the redundant fan (Fan 5) fail. • Retrieves and passes information about a system shutdown to SRM at the next power-up. SRM displays a message regarding the last shutdown. • Provides a command-line interface (CLI) for the user to control the system. From the CLI you can power the system on and off, halt or reset the system, and monitor the system environment. • Passes error log information to the DPR so that this information can be accessed by the system. • Retrieves information from the DPR and stores it in FRU EEROMs. 7-2 ES45 Service Guide The RMC logic is implemented using an 8-bit microprocessor, PIC17C44, as the primary control device. The firmware code is resident within the microprocessor and in flash memory. If the RMC firmware should ever become corrupted or obsolete, you can update it manually using the Loadable Firmware Update Utility. See Chapter 3 for details. The microprocessor can also communicate with the system power control logic to turn on or turn off power to the rest of the system. The RMC is powered by an auxiliary 5V supply. You can gain access to the RMC as long as AC power is available to the system (through an AC outlet). Thus, if the system fails, you can still access the RMC and gather error/fault information about the failure. DPR Error Repository The RMC manages an extensive network of FRU I2C EEPROMs. Information from these EEPROMs is stored in dual-port RAM (DPR)—a shared RAM that facilitates interaction between the RMC and the system—and can be accessed to diagnose hardware failures. At system power-up, the RMC reads 256 bytes of data from each FRU EEPROM and stores it in the DPR. The EEPROM data contains information on configuration and errors. The data is accessible through the TIG chip on the system motherboard. As one of its functions, the TIG provides interfaces for the firmware and the operating system to communicate with the server management logic. The data accessed from DPR provides configuration information to the firmware during start-up. Remote or local applications can read the DPR system error and configuration repository. The error log information is written to the DPR by an error handling agent and then written back to the EEPROMs by the RMC. This arrangement ensures that the error log is available on a FRU after power has been lost. The RMC console provides several commands for accessing error information in the DPR. See Section 7.6. Compaq Analyze, described in Chapter 5, can access the FRU EEPROM error logs to provide diagnostic information for system FRUs. Using the Remote Management Console 7-3 7.2 Operating Modes The RMC can be configured to manage different data flow paths defined by the com1_mode environment variable. In Through mode (the default), all data and control signals flow from the system COM1 port through the RMC to the active external port. You can also set bypass modes so that the signals partially or completely bypass the RMC. The com1_mode environment variable can be set from either SRM or the RMC. See Section 7.6.1. Figure 7–1 Data Flow in Through Mode System SRM Console Operating System DUART COM1 COM1 Port UART RMC PIC Processor Modem Port UART RMC Modem Port (Remote) UART Modem Modem RMC> Remote Serial Terminal or Terminal Emulator RMC COM1 Port (Local) RMC> Local Serial Terminal (MMJ Port) PK0908C 7-4 ES45 Service Guide Through Mode Through mode is the default operating mode. The RMC routes every character of data between the internal system COM1 port and the active external port, either the local COM1 serial port (MMJ) or the 9-pin modem port. If a modem is connected, the data goes to the modem. The RMC filters the data for a specific escape sequence. If it detects the escape sequence, it connects to the RMC CLI. Figure 7–1 illustrates the data flow in Through mode. The internal system COM1 port is connected to one port of the DUART chip, and the other port is connected to a 9-pin external modem port, providing full modem controls. The DUART is controlled by the RMC microprocessor, which moves characters between the two UART ports. The local MMJ port is always connected to the internal UART of the microprocessor. The escape sequence signals the RMC to connect to the CLI. Data issued from the CLI is transmitted between the RMC microprocessor and the active port that connects to the RMC CLI. NOTE: The internal system COM1 port should not be confused with the external COM1 serial port on the back of the system. The internal COM1 port is used by the system software to send data either to the COM1 port on the system or to the RMC modem port if a modem is connected. Local Mode You can set a Local mode in which only the local channel can communicate with the system COM1 port. In Local mode the modem is prevented from sending characters to the system COM1 port, but you can still connect to the RMC CLI from the modem. Using the Remote Management Console 7-5 7.2.1 Bypass Modes For modem connection, you can set the operating mode so that data and control signals partially or completely bypass the RMC. The bypass modes are Snoop, Soft Bypass, and Firm Bypass. Figure 7–2 Data Flow in Bypass Mode System DUART SRM Console Operating System COM1 COM1 Port UART RMC PIC Processor Bypass Modem Port UART RMC Modem Port (Remote) UART Modem Modem RMC COM1 Port (Local) RMC> Remote Serial Terminal or Terminal Emulator RMC> Local Serial Terminal (MMJ Port) PK0908B 7-6 ES45 Service Guide Figure 7–2 shows the data flow in the bypass modes. Note that the internal system COM1 port is connected directly to the modem port. NOTE: You can connect a serial terminal to the modem port in any of the bypass modes. The local terminal is still connected to the RMC and can still connect to the RMC CLI to switch the COM1 mode if necessary. Snoop Mode In Snoop mode data partially bypasses the RMC. The data and control signals are routed directly between the system COM1 port and the external modem port, but the RMC taps into the data lines and listens passively for the RMC escape sequence. If it detects the escape sequence, it connects to the RMC CLI. The escape sequence is also passed to the system on the bypassed data lines. If you decide to change the default escape sequence, be sure to choose a unique sequence so that the system software does not interpret characters intended for the RMC. In Snoop mode the RMC is responsible for configuring the modem for dial-in as well as dial-out alerts and for monitoring the modem connectivity. Because data passes directly between the two UART ports, Snoop mode is useful when you want to monitor the system but also ensure optimum COM1 performance. Soft Bypass Mode In Soft Bypass mode all data and control signals are routed directly between the system COM1 port and the external modem port, and the RMC does not listen to the traffic on the COM1 data lines. The RMC is responsible for configuring the modem and monitoring the modem connectivity. If the RMC detects loss of carrier or the system loses power, it switches automatically into Snoop mode. If you have set up the dial-out alert feature, the RMC pages the operator if an alert is detected and the modem line is not in use. Soft Bypass mode is useful if management applications need the COM1 channel to perform a binary download, because it ensures that RMC does not accidentally interpret some binary data as the escape sequence. Using the Remote Management Console 7-7 After downloading binary files, you can set the com1_mode environment variable from the SRM console to switch back to Snoop mode or other modes for accessing the RMC, or you can hang up the current modem session and reconnect it. Firm Bypass Mode In Firm Bypass mode all data and control signals are routed directly between the system COM1 port and the external modem port. The RMC does not configure or monitor the modem. Firm Bypass mode is useful if you want the system, not the RMC, to fully control the modem port and you want to disable RMC remote management features such as remote dial-in and dial-out alert. You can switch to other modes by resetting the com1_mode environment variable from the SRM console, but you must then set up the RMC again from the local terminal. 7-8 ES45 Service Guide 7.3 Terminal Setup You can use the RMC from a modem hookup or the serial terminal connected to the system. As shown in Figure 7–3, a modem is and a terminal is connected to the dedicated 9-pin modem port connected to the COM1 serial port/terminal port (MMJ) . Figure 7–3 Terminal Setup for RMC (Tower View) 1 VT 2 PK0934A Using the Remote Management Console 7-9 7.4 Connecting to the RMC CLI You type an escape sequence to connect to the RMC CLI. You can connect to the CLI from any of the following: a modem, the local serial console terminal, the local VGA monitor, or the system. The “system” includes the operating system, SRM, or an application. • You can connect to the RMC CLI from the local terminal regardless of the current operating mode. • You can connect to the RMC CLI from the modem if the RMC is in Through mode, Snoop mode, or Local mode. In Snoop mode the escape sequence is passed to the system and displayed. NOTE: Only one RMC CLI session can be active at a time. Connecting from a Serial Terminal Invoke the RMC CLI from a serial terminal by typing the following default escape sequence: ^[^[ rmc This sequence is equivalent to typing Ctrl/left bracket, Ctrl/left bracket, rmc. On some keyboards, the Esc key functions like the Ctrl/left bracket combination. To exit, enter the quit command. This action returns you to whatever you were doing before you invoked the RMC CLI. In the following example, the quit command returns you to the system COM1 port. RMC> quit Returning to COM port 7-10 ES45 Service Guide Connecting from the Local VGA Monitor To connect to the RMC CLI from the local VGA monitor, the console environment variable must be set to graphics and the SRM console must be running. Invoke the SRM console and enter the rmc command. P00>>> rmc You are about to connect to the Remote Management Console. Use the RMC reset command or press the front panel reset button to disconnect and to reload the SRM console. Do you really want to continue? [y/(n)] y Please enter the escape sequence to connect to the Remote Management Console. After you enter the escape sequence, the system connects to the CLI and the RMC> prompt is displayed. When the RMC CLI session is completed, reset the system with the Reset button on the operator control panel or issue the RMC reset command. RMC> reset Returning to COM port Using the Remote Management Console 7-11 7.5 SRM Environment Variables for COM1 Several SRM environment variables allow you to set up the COM1 serial port (MMJ) for use with the RMC. You may need to set the following environment variables from the SRM console, depending on how you decide to set up the RMC. com1_baud Sets the baud rate of the COM1 serial port and the modem port. The default is 9600. com1_flow Specifies the flow control on the serial port. The default is software. com1_mode Specifies the COM1 data flow paths so that data either flows through the RMC or bypasses it. This environment variable can be set from either the SRM or the RMC. com1_modem Specifies to the operating system whether or not a modem is present. 7-12 ES45 Service Guide 7.6 RMC Command-Line Interface The remote management console supports setup commands and commands for managing the system. The RMC commands are listed below. clear {alert, port} dep disable {alert, remote} dump enable {alert, remote} env halt {in, out} hangup help or ? power {on, off} quit reset send alert set {alert, com1_mode, dial, escape, init, logout, password, user} status The commands for setting up and using the RMC are described in the following sections. The dep command is reserved. For an RMC commands reference, see the Owner’s Guide. Using the Remote Management Console 7-13 Command Conventions Observe the following conventions for entering RMC commands: • Enter enough characters to distinguish the command. NOTE: The reset and quit commands are exceptions. You must enter the entire string for these commands to work. • For commands consisting of two words, enter the entire first word and at least one letter of the second word. For example, you can enter disable a for disable alert. • For commands that have parameters, you are prompted for the parameter. • Use the Backspace key to erase input. • If you enter a nonexistent command or a command that does not follow conventions, the following message is displayed: *** ERROR - unknown command *** • If you enter a string that exceeds 14 characters, the following message is displayed: *** ERROR - overflow *** • Use the Backspace key to erase input. 7-14 ES45 Service Guide 7.6.1 Defining the COM1 Data Flow Use the set com1_mode command from SRM or RMC to define the COM1 data flow paths. You can set com1_mode to one of the following values: through All data passes through RMC and is filtered for the escape sequence. This is the default. snoop Data partially bypasses RMC, but RMC taps into the data lines and listens passively for the escape sequence. soft_bypass Data bypasses RMC, but RMC switches automatically into Snoop mode if loss of carrier occurs. firm_bypass Data bypasses RMC. RMC remote management features are disabled. local Changes the focus of the COM1 traffic to the local MMJ port if RMC is currently in one of the bypass modes or is in Through mode with an active remote session. Example 7–1 set com1_mode RMC> set com1_mode Com1_mode (THROUGH, SNOOP, SOFT_BYPASS, FIRM_BYPASS, LOCAL): local Using the Remote Management Console 7-15 7.6.2 Displaying the System Status The RMC status command displays the current RMC settings. Table 7–1 explains the status fields. Example 7–2 status RMC> status PLATFORM STATUS On-Chip Firmware Revision: V1.0 Flash Firmware Revision: V1.2 Server Power: ON System Halt: Deasserted RMC Power Control: ON Escape Sequence: ^[^[RMC Remote Access: Enabled RMC Password: set Alert Enable: Disabled Alert Pending: YES Init String: AT&F0E0V0X0S0=2 Dial String: ATXDT9,15085553333 Alert String: ,,,,,,5085553332#; Com1_mode: THROUGH Last Alert: CPU door opened Logout Timer: 20 minutes User String: 7-16 ES45 Service Guide Table 7–1 Status Command Fields Field Meaning On-Chip Firmware Revision: Revision of RMC firmware on the microcontroller. Flash Firmware Revision: Revision of RMC firmware in flash ROM. Server Power: ON = System is on. OFF = System is off. System Halt: Asserted = System has been halted. Deasserted = Halt has been released. RMC Power Control: ON= System has powered on from RMC. OFF = System has powered off from RMC. Escape Sequence: Current escape sequence for access to RMC console. Remote Access: Enabled = Modem for remote access is enabled. Disabled = Modem for remote access is disabled. RMC Password: Set = Password set for modem access. Not set = No password set for modem access. Alert Enable: Enabled = Dial-out enabled for sending alerts. Disabled = Dial-out disabled for sending alerts. Alert Pending: YES = Alert has been triggered. NO = No alert has been triggered. Init String: Initialization string that was set for modem. Dial String: Pager string to be dialed when an alert occurs. Alert String: Identifies the system that triggered the alert to the paging service. Usually the phone number of the monitored system. Com1_mode: Identifies the current COM1 mode. Last Alert: Type of alert (for example, power supply 1 failed). Logout Timer: The amount of time before the RMC terminates an inactive modem connection. The default is 20 minutes. User String: Notes supplied by user. Using the Remote Management Console 7-17 7.6.3 Displaying the System Environment The RMC env environment. command provides a snapshot of the system Example 7–3 env RMC> env System Hardware Monitor Temperature (warnings at 48.00C, power-off at 53.00C) CPU0: 27.00C Zone0: 26.00C Fan RPM CPU1: 28.00C Zone1: 28.00C CPU2: 27.00C CPU3: 28.00C Zone2: 26.00C Fan1: 2149 Fan2: 2177 Fan3: 2136 Fan4: 2163 Fan5: OFF Fan6: 2033 Power Supply(OK, FAIL, OFF, '----' means not present) PS0 : OK PS1 : OK PS2 : OK CPU0: OK CPU1: OK CPU2: OK CPU3: OK CPU CORE voltage CPU0: +1.640V CPU1: +1.640V CPU2: +1.640V CPU3: +1.630V CPU IO voltage CPU0: +1.640V CPU1: +1.640V CPU2: +1.640V CPU3: +1.630V CPU CACHE voltage CPU0: +2.444V CPU1: +2.405V CPU2: +2.431V CPU3: +2.418V Bulk voltage +3.3V Bulk: +3.213V +5V Bulk: +4.888V +12V Bulk: +11.907V Vterm: +1.580V Cterm: +1.580V -12V Bulk: -11.466V +2.5V Bulk: +2.457V +1.5V Bulk: ---- RMC> 7-18 ES45 Service Guide CPU temperature. In this example four CPUs are present. of PCI backplane: Zone 0 includes PCI slots 1–3, Zone 1 Temperature includes PCI slots 7–10, and Zone 2 includes PCI slots 4–6. Fan RPM. With the exception of Fan 5, all fans are powered as long as the system is powered on. Fan 5 is OFF unless Fan 6 fails. The normal power supply status is either OK (system is powered on) or OFF (system is powered off or the power supply cord is not plugged in). FAIL indicates a problem with a supply. CPU CORE voltage and CPU I/O voltage. In a healthy system, the core voltage for all CPUs should be the same, and the I/O voltage for all CPUs should be the same. Bulk power supply voltage. The Vterm and Cterm voltage regulators are located on the system motherboard. Using the Remote Management Console 7-19 7.6.4 Dumping DPR Data The dump command dumps unformatted data from DPR locations 0–3FFF hex. The information might be useful for system troubleshooting. Use the DPR address table in Appendix C to analyze the data. Example 7–4 dump RMC> dump Address: 10 Count: ee 0010:03 31 07 28 01 09 00 00 00 00 00 00 00 00 00 00 0020:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0030:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0040:01 80 01 01 01 01 01 01 00 00 00 00 00 00 00 00 0050:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0060:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0070:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0080:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0090:00 00 00 00 00 00 00 00 00 00 1D 00 19 18 19 00 00A0:00 00 00 00 00 00 00 00 00 00 00 FF FF FA FA 3B 00B0:00 00 00 00 00 00 00 00 00 00 BA 00 00 00 00 00 00C0:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00D0:00 00 00 00 00 00 00 00 00 00 22 00 00 00 00 00 00E0:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00F0:00 00 00 00 00 00 00 00 00 10 00 00 00 0A 03 0A RMC> 7-20 ES45 Service Guide DPR address Number of bytes dumped (in hex). In the example the dump command dumps EF bytes from address 10. Bytes 10:15 are the time stamp. See Appendix C for the meaning of other locations. The dump command allows you to dump data from the DPR. You can use this command locally or remotely if you are not able to access the SRM console because of a system crash. The dump command accepts two arguments: Address: Prompts for the starting address Count: Prompts for the number of following consecutive bytes. If no count is specified, the count defaults to 0. Using the Remote Management Console 7-21 7.6.5 Power On and Off, Reset, and Halt The RMC power {on, off}, halt {in, out}, and reset commands perform the same functions as the buttons on the operator control panel. Power On and Power Off The RMC power on command powers the system on, and the power off command powers the system off. The Power button on the OCP, however, has precedence. • If the system has been powered off with the Power button, the RMC cannot power the system on. If you enter the power on command, the message “Power button is OFF” is displayed, indicating that the command will have no effect. • If the system has been powered on with the Power button, and the power off command is used to turn the system off, you can toggle the Power button to power the system back on. When you issue the power on command, the terminal exits RMC and reconnects to the server’s COM1 port. Example 7–5 power on/off RMC> power on Returning to COM port RMC> power off 7-22 ES45 Service Guide Halt In and Halt Out The halt in command halts the system. The halt out command releases the halt. When you issue either the halt in or halt out command, the terminal exits RMC and reconnects to the server's COM1 port. Example 7–6 halt in/out RMC> halt in Returning to COM port RMC> halt out Returning to COM port The halt out command cannot release the halt if the Halt button is latched in. If you enter the halt out command, the message “Halt button is IN” is displayed, indicating that the command will have no effect. Toggling the Power button on the operator control panel overrides the halt in condition. Reset The RMC reset command restarts the system. The terminal exits RMC and reconnects to the server’s COM1 port. Example 7–7 reset RMC> reset Returning to COM port Using the Remote Management Console 7-23 7.6.6 Configuring Remote Dial-In Before you can dial in through the RMC modem port or enable the system to call out in response to system alerts, you must configure RMC for remote dial-in. Connect your modem to the 9-pin modem port and turn it on. Connect to the RMC CLI from either the local serial terminal or the local VGA monitor to set up the parameters. Example 7–8 Dial-In Configuration RMC> set password RMC Password: **** Verification: **** RMC> set init Init String: AT&F0E0V0X0S0=2 RMC> enable remote RMC> status . . Remote Access: Enabled . . . 7-24 ES45 Service Guide Sets the password that is prompted for at the beginning of a modem session. The string cannot exceed 14 characters and is not case sensitive. For security, the password is not echoed on the screen. When prompted for verification, type the password again. Sets the initialization string. The string is limited to 31 characters and can be modified depending on the type of modem used. Because the modem commands disallow mixed cases, the RMC automatically converts all alphabetic characters entered in the init string to uppercase. The RMC automatically configures the modem's flow control according to the setting of the SRM com1_flow environment variable. The RMC also enables the modem carrier detect feature to monitor the modem connectivity. Enables remote access to the RMC modem port by configuring the modem with the setting stored in the initialization string. Verifies the settings. Check that the Remote Access field is set to Enabled. Dialing In The following example shows the screen output when a modem connection is established. ATDT915085553333 RINGING RINGING CONNECT 9600/ARQ/V32/LAPM RMC Password: ********* Welcome to RMC V1.2 P00>>> ^[^[rmc RMC> 1. At the RMC> prompt, enter commands to monitor and control the remote system. 2. When you have finished a modem session, enter the hangup command to cleanly terminate the session and disconnect from the server. Using the Remote Management Console 7-25 7.6.7 Configuring Dial-Out Alert When you are not monitoring the system from a modem connection, you can use the RMC dial-out alert feature to remain informed of system status. If dial-out alert is enabled, and the RMC detects alarm conditions within the managed system, it can call a preset pager number. You must configure remote dial-in for the dial-out feature to be enabled. See Section 7.6.6. To set up the dial-out alert feature, connect to the RMC CLI from the local serial terminal or local VGA monitor. Example 7–9 Dial-Out Alert Configuration RMC> set dial Dial String: ATXDT9,15085553333 RMC> set alert Alert String: ,,,,,,5085553332#; RMC> enable alert RMC> clear alert RMC> send alert Alert detected! RMC> clear alert RMC> status . . Alert Enable: Enabled . . A typical alert situation might be as follows: • The RMC detects an alarm condition, such as over temperature warning. • The RMC dials your pager and sends a message identifying the system. • You dial the system from a remote serial terminal. • You connect to the RMC CLI, check system status with the env command, and, if the situation requires, power down the managed system. • When the problem is resolved, you power up and reboot the system. 7-26 ES45 Service Guide The elements of the dial string and alert string are shown in Table 7–2. Paging services vary, so you need to become familiar with the options provided by the paging service you will be using. The RMC supports only numeric messages. Sets the string to be used by the RMC to dial out when an alert condition occurs. The dial string must include the appropriate modem commands to dial the number. Sets the alert string, typically the phone number of the modem connected to the remote system. The alert string is appended after the dial string, and the combined string is sent to the modem when an alert condition is detected. Enables the RMC to page a remote system operator. Clears any alert that may be pending. This ensures that the send alert command will generate an alert condition. Forces an alert condition. This command is used to test the setup of the dial-out alert function. It should be issued from the local serial terminal or local VGA monitor. As long as no one connects to the modem and there is no alert pending, the alert will be sent to the pager immediately. If the pager does not receive the alert, re-check your setup. Clears the current alert so that the RMC can capture a new alert. The last alert is stored until a new event overwrites it. The Alert Pending field of the status command becomes NO after the alert is cleared. Verifies the settings. Check that the Alert Enable field is set to Enabled. NOTE: If you do not want dial-out paging enabled at this time, enter the disable alert command after you have tested the dial-out alert function. Alerts continue to be logged, but no paging occurs. Using the Remote Management Console 7-27 Table 7–2 Elements of Dial String and Alert String Dial String The dial string is case sensitive. The RMC automatically converts all alphabetic characters to uppercase. ATXDT AT = Attention. X = Forces the modem to dial “blindly” (not seek the dial tone). Enter this character if the dial-out line modifies its dial tone when used for services such as voice mail. D = Dial T = Tone (for touch-tone) 9, The number for an outside line (in this example, 9). Enter the number for an outside line if your system requires it. , = Pause for 2 seconds. 15085553333 Phone number of the paging service. Alert String ,,,,,, Each comma (,) provides a 2-second delay. In this example, a delay of 12 seconds is set to allow the paging service to answer. 5085553332# A call-back number for the paging service. The alert string must be terminated by the pound (#) character. ; A semicolon (;) must be used to terminate the entire string. 7-28 ES45 Service Guide 7.6.8 Resetting the Escape Sequence The RMC set escape command sets a new escape sequence. The new escape sequence can be any character string, not to exceed 14 characters. A typical sequence consists of two or more control characters. It is recommended that control characters be used in preference to ASCII characters. Use the status command to verify the new escape sequence before exiting the RMC. The following example consists of two instances of the Esc key and the letters “FUN.” The “F” is not displayed when you set the sequence because it is preceded by the escape character. Enter the status command to see the new escape sequence. Example 7–10 set escape RMC> set escape Escape Sequence: un RMC> status . . . Escape Sequence: ^[^[FUN CAUTION: Be sure to record the new escape sequence. Restoring the default sequence requires moving a jumper on the system motherboard. Using the Remote Management Console 7-29 7.7 Resetting the RMC to Factory Defaults If the non-default RMC escape sequence has been lost or forgotten, RMC must be reset to factory settings to restore the default escape sequence. Figure 7–4 RMC Jumpers (Default Positions) 1 2 3 J7 J6 J5 J4 1 2 J3 J2 J1 PK0211A NOTE: J1, J2, and J3 are reserved. 7-30 ES45 Service Guide The following procedure restores the default settings: 1. Shut down the operating system and press the Power button on the operator control panel to the OFF position. 2. Unplug the power cord from each power supply. Wait until the +5V Aux LEDs on the power supplies go off before proceeding. 3. Remove enclosure panels as described in Chapter 8. 4. Remove the system card cage cover and fan cover from the system chassis, as described in Chapter 8. 5. Remove CPU 1 as described in Chapter 8. 6. On the system motherboard, install jumper J6 over pins 1 and 2. See Figure 7–4. (The default jumper positions are shown.) 7. Plug a power cord into one power supply and wait for the control panel to display the message “System is down.” 8. Unplug the power cord. Wait until the +5V Aux LED on the power supply goes off before proceeding. 9. Move J6 from pins 1 and 2 and install it over pins 2 and 3. 10. Reinstall CPU 1, the card cage cover and fan cover, and the enclosure panels. 11. Plug the power cord into each of the power supplies. NOTE: After the RMC has been reset to defaults, perform the setup procedures to enable remote dial-in and call-out alerts. See Section 7.6.6. Using the Remote Management Console 7-31 7.8 Troubleshooting Tips Table 7–3 lists possible causes and suggested solutions for symptoms you might see. Table 7–3 RMC Troubleshooting Symptom Possible Cause Suggested Solution You cannot connect to the RMC CLI from the modem. The RMC may be in Soft Bypass or Firm Bypass mode. Issue the show com1_mode command from SRM and change the setting if necessary. If in Soft Bypass mode, you can disconnect the modem session and reconnect it. The terminal cannot communicate with the RMC correctly. System and terminal baud rates do not match. Set the baud rate for the terminal to be the same as for the system. For first-time setup, suspect the console terminal, since the RMC and system default baud is 9600. RMC will not answer when the modem is called. Modem cables may be incorrectly installed. Check modem phone lines and connections. RMC remote access is disabled or the modem was power cycled since last being initialized. From the local serial terminal or VGA monitor, enter the set password and set init commands, and then enter the enable remote command. The modem is not configured correctly. Modify the modem initialization string according to your modem documentation. Continued on next page 7-32 ES45 Service Guide Table 7–3 RMC Troubleshooting (Continued) Symptom Possible Cause Suggested Solution RMC will not answer when modem is called. (continued from previous page) On AC power-up, RMC defers initializing the modem for 30 seconds to allow the modem to complete its internal diagnostics and initializations. Wait 30 seconds after powering up the system and RMC before attempting to dial in. After the system is powered up, the COM1 port seems to hang or you seem to be unable to execute RMC commands. There is a normal delay while the RMC completes the system power-on sequence. Wait about 40 seconds. New escape sequence is forgotten. RMC console must be reset to factory defaults. During a remote connection, you see a “+++” string on the screen. The modem is confirming whether the modem has really lost carrier. This is normal behavior. The message “unknown command” is displayed when you enter a carriage return by itself. The terminal or terminal emulator is including a line feed character with the carriage return. Change the terminal or terminal emulator setting so that “new line” is not selected. Using the Remote Management Console 7-33 Chapter 8 FRU Removal and Replacement This chapter describes the procedures for removing and replacing FRUs on ES45 systems. Unless otherwise specified, install a FRU by reversing the steps shown in the removal procedures. WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. These measures include: 1. Remove any jewelry that may conduct electricity. 2. If accessing the system card cage, power down the system and wait 2 minutes to allow components to cool. 3. Wear an anti-static wrist strap when handling internal components. NOTE: If you are installing or replacing CPU cards, memory DIMMs, or PCI cards, become familiar with the location of the card slots and configuration rules. See Chapter 6. FRU Removal and Replacement 8-1 CAUTION: Static electricity can damage integrated circuits. Always use a grounded wrist strap (29-26246) and grounded work surface when working with internal parts of a computer system. Remove jewelry before working on internal parts of the system. IMPORTANT! 8-2 After you have replaced FRUs and determined that the system has been restored to its normal operating condition, you must clear the system error information repository (error information logged to the DPR). Use the clear_error all command to clear all errors logged in the FRU EEPROMs and to initialize the central error repository. See Section 4.4 for details on clear_error. ES45 Service Guide 8.1 FRUs Table 8–1 lists the FRUs by part number and description. Figure 8–1 shows the location of FRUs in the pedestal/rack systems, and Figure 8–2 shows the location of FRUs in the tower system. Table 8–1 FRU List Part # Description Cables 17-04787-01 Power and signal harness assembly 17-04785-01 Fan harness assembly 17-04786-01 Sensor cable harness assembly 17-03971-07 OCP cable assembly 17-04867-01 68-conductor SCSI cable (six drive cage) 17-04009-02 68-pin to 50-pin adapter cable (SCSI removable media) 17-03970-04 Floppy cable assembly 17-04400-06 Junk I/O connector cable 17-04705-03 SCSI removable media device to PCI card SCSI controller 17-03971-11 10-pin storage subsystem management cable (30 inch) 17-05042-01 PCI hot swap module to PCI backplane 17-05021-01 IDE cable (CD ROM) Fans 70-40074-01 Fan assembly, 172 MM Fan 6 70-40073-01 Fan assembly, 120 MM Fans 1 and 2 70-40073-02 Fan assembly, 120 MM Fan 5 70-40072-01 Fan assembly, 120 MM Fan 3 70-40071-01 Fan assembly, 120 MM Fan 4 FRU Removal and Replacement 8-3 Table 8–1 FRU List (Continued) Part # Description Fans 70-40074-01 Fan assembly, 172 MM Fan 6 70-40073-01 Fan assembly, 120 MM Fans 1 and 2 70-40073-02 Fan assembly, 120 MM Fan 5 70-40072-01 Fan assembly, 120 MM Fan 3 70-40071-01 Fan assembly, 120 MM Fan 4 CPU Module 54-30466-03 EV68 CB LGA CPU 1 GHz with 8 MB L2 Cache Memory DIMMs 20-01CBA-09 128 MB Mono 200 pin Sync DIMM 133 MHz 20-01DBA-09 256 MB Mono 200 pin Sync DIMM 133 MHz 20-01EBA-09 512 MB Mono 200 pin Sync DIMM 133 MHz 20-L0FBA-09 1 GB Stacked 200 pin Sync DIMM 133 MHz 8-4 ES45 Service Guide Table 8–1 FRU List (Continued) Part # Description Other Modules and Components 70-33894-02 OCP 54-30414-02 PCI Hot swap module 54-30348-02 8-slot MMB for 200-pin DIMMs 54-30348-03 4-slot MMB for 200-pin DIMMs 70-31349-01 Speaker assembly 30-50802-01 Hard drive cage assembly, 6 slot, 1-in. universal drives 54-30292-02 System motherboard 54-25575-02 I/O connector module 54-30418-01 PCI backplane, 10-slot 54-30414-02 Switch / LED HP PCI lever (PCI hot swap module) 3R-A1629-AA SCSI environmental module (NILE) 30-49448-01 Power supply, 720 Watts SN-LKQ46-Ax Keyboard, OpenVMS SN-LKQ47-Ax Keyboard, Tru64 UNIX SN-PBQWS-WA Mouse, 3-button 12-37977-02 Key for doors 3R-A2503-AA CD-ROM drive, 40x half-height 3R-A2753-AA Floppy drive FRU Removal and Replacement 8-5 8.1.1 Power Cords Tower enclosures ordered in North America include a 220 V power cord. Non-North American orders require one country-specific power cord. Pedestal systems ordered in North American include two 220 V power cords. Non-North American orders require two country-specific power cords. Table 8–2 lists the country-specific power cords for tower and pedestal systems. Table 8–2 Country-Specific Power Cords Power Cord Country Length BN26J-1K North American 220 V 75 in. 3X-BN46F-02 Japan 2.5 m BN19H-2E Australia, New Zealand 2.5 m BN19C-2E Central Europe 2.5 m BN19A-2E UK, Ireland 2.5 m BN19E-2E Switzerland 2.5 m BN19K-2E Denmark 2.5 m BN19M-2E Italy 2.5 m BN19S-2E Egypt, India, South Africa 2.5 m 8-6 ES45 Service Guide 8.1.2 FRU Locations Figure 8–1 and Figure 8–2 show the location of FRUs in the pedestal and rackmount configurations. Figure 8–1 FRUs — Front/Top (Pedestal/Rack View) Memory DIMMs CPU Cards Fans OCP PCI Backplane Fans Primary Drive Cage Floppy Drive Secondary Drive Cage CD-ROM Drive PK0285A FRU Removal and Replacement 8-7 Figure 8–2 FRUs — Rear (Pedestal/Rack View) I/O Connector Module (Junk I/O) Speaker Power Harness Access Cover Power Supplies 8-8 ES45 Service Guide System Motherboard PK0286A 8.1.3 Important Information Before Replacing FRUs The system must be shut down before you replace most FRUs. The exceptions are power supplies, individual fans, universal hard drives, and PCI cards in slots 4 – 10 (when the operating system supports this function). After replacing FRUs you must clear the system error information repository with the SRM clear_error all command. Tools You need the following tools to remove or replace FRUs. • Phillips #1 (10-inches) and #2 screwdrivers (magnetic screwdrivers are recommended) • Allen wrench (3 mm) • Anti-static wrist strap Hot-Plug FRUs The following are hot-plug FRUs. You can replace them while the system is operating. • Power supplies • Individual fans FRU Removal and Replacement 8-9 Before Replacing Non Hot-Plug FRUs Follow the procedure below before replacing non hot-plug FRUs. For universal disk drives, you must shut down the operating system, but you do not need to turn off system power. 1. Shut down the operating system. 2. Shut down power to external options, where appropriate. 3. Turn off power to the system. 4. Unplug the power cord from each power supply. WARNING: To prevent injury, unplug the power cord from each power supply before installing components. After Replacing FRUs IMPORTANT! After you have replaced FRUs and have determined that the system has been restored to its normal operating condition, you must clear the system error information repository (error information logged to the DPR). Use the clear_error all command to clear all errors and initialize the central error repository. See Section 4.4 for details. 8-10 ES45 Service Guide 8.2 Removing Enclosure Panels Figure 8–3 Enclosure Panel Removal (Tower) 1 5 2 4 3 PK0221B FRU Removal and Replacement 8-11 To Remove Enclosure Panels from a Tower The enclosure panels are secured by captive screws. , lift up and away to remove the front door . 2. To remove the top panel, loosen the top left and top right screws . Slide the top panel back and lift it off the system. 3. To remove the left panel, loosen the screw at the top and the screw at 1. From the open position the bottom. Slide the panel back and then tip it outward. Lift it off the system. 8-12 ES45 Service Guide Figure 8–4 Enclosure Panel Removal (Pedestal) 1 3 2 PK0234A FRU Removal and Replacement 8-13 To Remove Enclosure Panels from a Pedestal The enclosure panels are secured by captive screws. to remove the front door (the 1. From the open position, lift up and away bottom door is removed in the same way). 2. Remove the top enclosure panel by loosening the captive screws shown in . Slide the top panel back and lift it off the system. 3. To remove the right enclosure panel, loosen the captive screw shown in . Slide the panel back and then tip it outward. Lift the panel from the three tabs. 8-14 ES45 Service Guide 8.3 Accessing the System Chassis in a Cabinet In a rackmount system, the system chassis is mounted to slides. WARNING: Pull out the stabilizer bar and extend the leveler foot to the floor before you pull out the system. This precaution prevents the cabinet from tipping over. Figure 8–5 Accessing the Chassis in a Cab 1 2 4 4 3 PK0288B FRU Removal and Replacement 8-15 WARNING: 1. Make sure that all other hardware in the cabinet is pushed in and attached. 2. The system is very heavy. Do not attempt to lift it manually. Use a material lift or other mechanical device. 3. The inner race must be moved forward prior to installing the system. Failure to do so may cause bodily harm. 1. Move the inner race all the way forward so that it is touching the tabs on both rails as shown in Figure 8–6. Figure 8–6 Moving the Inner Race Forward 1 2 MR0075 8-16 ES45 Service Guide 8.4 Removing Covers from the System Chassis The system chassis has three covers: the fan cover, the system card cage cover, and the PCI card cage cover. Remove a cover by loosening the quarter-turn captive screw, pulling up on the ring, and sliding the cover from the system chassis. V @ >240VA WARNING: High current area. Currents exceeding 240 VA can cause burns or eye injury. Avoid contact with parts or remove power prior to access. WARNING: Contact with moving fan can cause severe injury to fingers. Avoid contact or remove power prior to access. To Gain Access to the System Chassis 1. Open the front door of the cabinet. 2. Pull out the stabilizer bar 3. Extend the leveler foot at the bottom of the cabinet until it stops. at the end of the stabilizer bar to the floor. 4. Snap out the front bezel . 5. Remove and set aside the two screws (one per side), if present, that secure the system to the cabinet. 6. Pull the system out until it locks. FRU Removal and Replacement 8-17 Figure 8–7 and Figure 8–8 show the location and removal of covers on the tower and pedestal/rackmount systems, respectively. The numbers in the illustrations correspond to the following: 3mm Allen captive quarter-turn screw that secures each cover. Spring-loaded ring that releases cover. Each cover has a ring. Fan area cover. This area contains the 6.75-in main system fan and a redundant fan. System card cage cover. This area contains CPUs, memory DIMMs, MMBs, and system motherboard. To remove the system card cage cover, you must first remove the fan area cover . An interlock switch shuts the system down when you remove the system card cage cover. PCI card cage cover. This area contains PCI cards, the PCI backplane, I/O connector assembly and four fans. 8-18 ES45 Service Guide Figure 8–7 Covers on the System Chassis (Tower) 5 2 1 2 3 1 4 2 2 1 PK0216A FRU Removal and Replacement 8-19 Figure 8–8 Covers on the System Chassis (Pedestal/Rack) 4 2 1 2 1 2 3 1 2 5 8-20 ES45 Service Guide PK0215A 8.5 Power Supply Figure 8–9 Replacing or Adding a Power Supply 1 2 4 5 3 0 1 2 PK0232A FRU Removal and Replacement 8-21 WARNING: Hazardous voltages are contained within the power supply. Do not attempt to service. Return to factory for service. The power supply is a hot-plug component. As long as the system has a redundant supply, you can replace a supply while the system is running. Replacing a Power Supply 1. Unplug the AC power cord. 2. Loosen the three Phillips screws that secure the power supply bracket. (Do not remove the screws.) Remove the bracket . 3. Loosen the captive screw on the latch power supply. 4. Pull the power supply and swing the latch to unlock the out of the system. NOTE: When installing an additional supply, remove the screw and blank cover on the slot into which you are installing the supply. Verification 1. Plug the AC power cord into the supply. Wait a few seconds for the POK LED to light. 2. Check that both power supply LEDs are lit. 8-22 ES45 Service Guide 8.6 Fans Figure 8–10 Replacing Fans 5 6 1 2 3 4 PK0208a FRU Removal and Replacement 8-23 The fans are hot-plug components. You can replace individual fans while the system is running. WARNING: Contact with moving fan can cause severe injury to fingers. Avoid contact or remove power prior to access. V @ >240VA WARNING: High current area. Currents exceeding 240 VA can cause burns or eye injury. Avoid contact with parts or remove power prior to access. Replacing Fans Remove the cover from the fan area (fans , , , and . and ) or the PCI card cage (fans 1. Pull the pop-up latch to unlock it, and lift the fan out of the system. Fan has no pop-up latch. It is held in place by fan . 2. Install the new fan, taking care to align it as it slides in. Press the pop-up latch to lock the fan in place. 3. Replace the cover to the fan area or the PCI card cage. Verification — RMC 1. Invoke the remote management console. 2. Enter the env command to verify the fan status. 8-24 ES45 Service Guide 8.7 Universal Hard Disk Drives The system uses hot-pluggable universal hard disk drives. Hotpluggable drives can be replaced without removing power from the system or interrupting the transfer of data over the SCSI bus. ! WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. FRU Removal and Replacement 8-25 Figure 8–11 Replacing or Adding a Hard Drive 3 1 4 2 MR0064 8-26 ES45 Service Guide Installing a Drive 1. Access the storage drive area and remove the drive blank for the next available slot (Drives are installed left to right, SCSI ID 0 – 5). 3. Push the release lever in until it engages the ejector button . 2. Insert the new drive into the cage and push it in while pivoting the release lever in toward the drive. Replacing a Drive in and pivot the release lever to the open posi- 1. Press the ejector button tion. 2. Pull out on the drive until it is disconnected from the backplane connector. CAUTION: Do not remove the drive while the disk is spinning. 3. When you are sure that the disk is no longer spinning, remove the drive from the enclosure. 4. Insert the replacement drive in until it is against the backplane connector. Continue to push it while pivoting the release lever to the full upright position. 5. Push the release lever in until it engages the ejector button . Observe the drive status LEDs to ensure that the new drive or replacement drive is functioning properly. The SRM console polls for SCSI devices every 30 seconds. If the device does not appear to be working, access the SRM console and enter the show device command to view a list of the bootable devices. FRU Removal and Replacement 8-27 8.8 Removing the Shipping Bracket The shipping bracket provides protection and stabilization for CPU modules during shipment. Figure 8–12 Removing the Shipping Bracket 1 3 2 MR0059 8-28 ES45 Service Guide Complete the following procedure to remove the shipping bracket: that holds the CPUs. Push the bracket toward point to release it and then pull up at . Save the Unscrew and loosen the slide shipping bracket for possible future shipment of the server. NOTE: The shipping bracket is only needed when shipping the server. You do not need to reinstall it, but save the strap for possible future use. If you ship the server in the future, reinstall the strap by reversing the removing procedure. FRU Removal and Replacement 8-29 8.9 CPUs Shut the system down before adding or replacing a CPU. Figure 8–13 Adding or Replacing CPU Cards Slot 3 3 Slot 1 Slot 0 1 Slot 2 2 PK0240B 8-30 ES45 Service Guide ! WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. WARNING: Do not remove CPUs or memory modules until the green LEDs are off (approximately 20 seconds after a power-down. ! WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. WARNING: To prevent injury, unplug the power cord from each power supply before installing components. FRU Removal and Replacement 8-31 1. Shut down the operating system and turn off power to the system. Unplug the power cord from each power supply. 2. Access the system chassis by following the instructions in Section 8.2 or 8.3. 3. Remove the covers from the fan area and the system card cage as explained in Section 8.4. 4. When adding a CPU module, install it in the next lowest numbered slot available (See Figure 8–13). 5. When adding a CPU module, remove and discard the airflow deflector plate from the CPU slot. 6. Insert the CPU card simultaneously. into the connector and push down on both latches 7. Replace the system card cage cover, fan cover, and enclosure covers. 8. Reconnect the power cords. Verification 1. Turn on power to the system. 2. During power-up, observe the screen display. The newly installed CPU should appear in the display. 3. Issue the show config command to display the status of the new CPU. 8-32 ES45 Service Guide 8.10 Memory DIMMs DIMMs are manufactured with two types of SRAMs, stacked and unstacked. Stacked DIMMs provide twice the capacity of unstacked DIMMs and, at the time of shipment, are the highest capacity DIMMs offered by Compaq. Your system may have either stacked or unstacked DIMMs. A memory option consists of a “set” of four DIMMs. The system supports two sets per “array” and four arrays per system. You can mix stacked and unstacked DIMMs within the system, but not within an array. The DIMMs within an array must be of the same capacity and type (stacked or unstacked) because of different memory addressing. When installing sets 0, 1, 2, and 3, an incorrect mix will not occur. When installing sets 4, 5, 6, or 7, however, you must ensure that the four DIMMs being installed match the type of DIMMs in the existing array. If necessary, rearrange DIMMs for proper configuration. Only the following DIMMs and DIMM options can be used in the ES45 system. Density DIMM DIMM Option (4 DIMMs per) 128 MB 20-01CBA-09 MS620-AA (512 MB) 256 MB 20-01DBA-09 MS620-BA (1 GB) 512 MB 20-01EBA-09 MS620-CA (2 GB) 1 GB 20-L0FBA-09 MS620-DA (4 GB)∗ ∗ Toshiba specific DIMM and option. CAUTION: Using different DIMMs may result in loss of data. FRU Removal and Replacement 8-33 Figure 8–14 Installing and Removing MMBs and DIMMs 1 1 Pedestal/Rack 1 1 Tower 1 2 1 8-34 ES45 Service Guide PK0205B ! WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. ! WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. WARNING: To prevent injury, unplug the power cord from each power supply before installing components. 1. Shut down the operating system and turn off power to the system. Unplug the power cord from each power supply. 2. Access the system chassis by following the instructions in Section 8.2, Removing Enclosure Panels. 3. Remove the fan cover and the system card cage cover. 4. Use Figure 8–16 or Figure 8–17 to determine where sets of memory DIMMs should be installed. Begin with the next available lowest numbered set. 5. Release the clips MMB. securing the appropriate MMB and slide out the FRU Removal and Replacement 8-35 6. Release the clips the DIMM . (Figure 8–15) on the MMB slot where you will install 7. Install the DIMM and align the notches on the gold fingers with the connector keys. 8. Secure the DIMM with the clips on the MMB slot. Figure 8–15 Aligning DIMM in MMB 1 2 3 1 4 3 PK0953A 9. Reinstall the MMB. 10. Replace the system card cage cover and enclosure covers. 11. Reconnect the power cords. Verification 1. Turn on power to the system. 2. During power-up, observe the screen display for memory. The display shows how much memory is in each array. 3. Issue the show memory command to display the total amount of memory in the system. 8-36 ES45 Service Guide 8.10.1 Determining Memory Configuration For optimum memory utilization and performance load memory DIMMs into arrays in the following order: 0, 1, 2, 3, 4, 6, 5, and 7. Figure 8–16 Pedestal/Rack Memory Configuration Set # 0 4 6 2 0 4 6 2 MMB 0 Array 0 Set # 0 & 4 Set # 1 5 7 3 1 5 7 3 Set # 0 4 6 2 0 4 6 2 MMB 1 Set # 1 5 7 3 1 5 7 3 Array 1 Set # 1 & 5 Array 2 Set # 2 & 6 MMB 2 Array 3 Set # 3 & 7 MMB 3 J9 J8 J7 J6 J5 J4 J3 J2 PK0202A FRU Removal and Replacement 8-37 Figure 8–17 Tower Memory Configuration J9 J8 J7 J6 73 15 Set # 73 15 J5 J4 J3 J2 MMB 3 Set # 0 4 62 0 46 2 MMB 2 Set # 73 15 73 15 MMB 1 Set # 0 2 46 0 46 2 MMB 0 Array 0 Set # 0 & 4 Array 2 Set # 2 & 6 Array 1 Set # 1 & 5 Array 3 Set # 3 & 7 PK0203A 8-38 ES45 Service Guide 8.11 PCI Cards Some PCI options require drivers to be installed and configured. These options come with a floppy or a CD-ROM. Refer to the installation document that came with the option and follow the manufacturer's instructions. ! WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. ! WARNING: To prevent fire, use only modules with current limited outputs. See National Electrical Code NFPA 70 or Safety of Information Technology Equipment, Including Electrical Business Equipment EN 60 950. V @ >240V ! WARNING: High current area. Currents exceeding 240 VA can cause burns or eye injury. Avoid contact with parts or remove power prior to access. WARNING: The I/O area houses parts that operate at high temperatures. Avoid contact with components to prevent a possible burn. FRU Removal and Replacement 8-39 Figure 8–18 Installing or Replacing a PCI Card 6 2 1 4 3 5 MR0027 8-40 ES45 Service Guide Adding or Replacing a PCI Card CAUTION: Hot plug is not currently supported on the operating system. Do not press switches or on the hot swap board. Pressing switches can result in loss of data. Complete the following procedure to add or remove a PCI option card. 1. Turn off the system power. 2. Press in latch button and open the latch. 3. When replacing a PCI module, remove the bad module by pulling straight out. 4. When adding a PCI option card into an unused slot, remove the blank bulkhead . 5. Install the new PCI option card 6. Close latch . . NOTE: Some full-length PCI cards may have extender brackets for installing into ISA/EISA-style card cages. Remove the extender brackets before installing such a card. Verification 1. Turn on power to the system. 2. During power-up, observe the screen display for PCI information. The new option should be listed in the display. 3. Issue the SRM show config command. Examine the PCI bus information in the display to make sure that the new option is listed. 4. Enter the SRM show device command to display the device name of the new option. FRU Removal and Replacement 8-41 Figure 8–19 PCI Module Hot Swap Assembly 2 1 3 3 Closed Position MR0074 8-42 ES45 Service Guide 8.11.1 Replacing the PCI Hot Swap Module Shut the system down before adding or replacing the PCI hot swap module. 1. Halt all applications and power down the system. 2. Unscrew and remove the three M3x 6mm screws and attached washers. (Note: make sure you remove the three lower screws that hold the switch board in place.) 3. Pull the module straight back removing the six tabs in the module in the metal. the slots from 4. To assemble a new module, reverse the procedure. FRU Removal and Replacement 8-43 8.12 OCP Assembly Figure 8–20 Replacing the OCP Assembly 1 2 PK0282a Replacing the OCP Assembly Shut the system down before removing the OCP assembly. 1. Press the two tabs on the top of the OCP assembly to release it. 2. Rotate the assembly toward you and lift it out of the two bottom tabs. 3. Disconnect the OCP cable . 4. Reverse steps 1 through 3 to replace the OCP assembly. 8-44 ES45 Service Guide 8.13 Installing Disk Cages Figure 8–21 Cabling and Preparation for Installing Disk Cages 2 4 6 3 6 5 6 1 7 PKO974-0A FRU Removal and Replacement 8-45 ! WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. WARNING: To prevent injury, unplug the power cord from each power supply before installing components. Installing the Right Cage (or Top Cage) 1. Shut down the operating system and turn off power to the system. Unplug the power cord from each power supply. 2. Remove enclosure panels and remove the cover from the PCI card cage. NOTE: When installing a drive cage into the left cage area, remove fans 3 and 4. 3. Install a SCSI controller in the PCI backplane. and set them 5. Attach the 10-pin cable (17-03971-11) to the environmental card . 4. Unscrew the four screws securing the drive cage filler plate aside. Remove and discard the filler plate. 6. Snap the environmental card onto the rearmost set of four pop inserts in the top of the PCI card cage. 8. Plug the shorter end of the 68-pin cable (17-04867-01) into the SCSI con 7. Snap open the cable management clip . Thread the 10-pin cable through the opening and the clip and route as shown. troller. Plug the middle connector into the environmental card, route as shown, and close the clip. 8-46 ES45 Service Guide Figure 8–22 Disk Cage Installation 8 9 3 7 PKO975-0B FRU Removal and Replacement 8-47 CAUTION: Always plug the cables into the cage that is on the same side as the cage that was removed. 9. Partially slide the drive cage 10. Connect the power source 11. Attach the 10-pin cable into the system chassis. (located inside enclosure) to the drive cage. and 68-pin cable to the drive cage. 12. Slide the cage in the rest of the way and attach it with the four screws set aside previously. 13. Replace fans 3 and 4 (if removed previously), PCI card cage cover, and enclosure panels. 14. Install disk drives. CAUTION: Disk drives must be installed from left to right. Otherwise, the system will not find the system disk. 13. Plug in the power cords. Verification 1. Turn on power to the system. 2. At the P00>>> prompt, enter the SRM show device command to display the devices and controllers in the system. The list should include the SCSI controller and disk drives that you installed. 8-48 ES45 Service Guide 8.13.1 Cabling a Second Disk Drive Cage If you are installing a second drive cage, refer to the following illustration for cable routing. Figure 8–23 Cabling a Second Disk Cage PKO976-00 FRU Removal and Replacement 8-49 8.14 Adding or Replacing Removable Media Figure 8–24 Adding a 5.25-Inch Device 3 2 5 1 6 4 4 PK0235A 8-50 ES45 Service Guide ! WARNING: To prevent injury, access is limited to persons who have appropriate technical training and experience. Such persons are expected to understand the hazards of working within this equipment and take measures to minimize danger to themselves or others. WARNING: To prevent injury, unplug the power cord from each power supply before installing components. ! WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. 1. Shut down the operating system and turn off power to the system. Unplug the power cord from each power supply. 2. Remove the cover to the PCI card cage area. 3. Unplug the signal and power cables to the CD. 4. Remove and set aside the four screws cage. securing the removable media CAUTION: Be careful not to tangle the wires to the CD-ROM and floppy. 5. Slide the cage out far enough to gain access to the floppy cables. Unplug the cables and remove the cage. 6. Remove a blank storage panel for the desired storage slot by pushing from behind the panel. If you are installing a full-height device, remove two panels. 7. Remove the divider plate between the two slots by pressing the center of the plate and bending it sufficiently to free it from the slots. 8. Set the SCSI ID on the device as desired. FRU Removal and Replacement 8-51 9. Slide the storage device into the desired storage slot and secure the device to the unit with four of the screws provided inside the removable media drive cage. 10. Pull the floppy cables back in. 11. Slide the removable media cage back in and replace the four screws set aside previously. , route it into the PCI cage, and attach it to the 13. Plug the power cable (4-conductor) into the storage device. 12. Plug in the signal cable appropriate controller. 14. Plug the signal and power cables back into the CD. 15. Replace the PCI card cage cover and enclosure covers. 16. Reconnect the power cords. Verification 1. Turn on power to the system. 2. When the system powers up to the P00>>> prompt, enter the SRM show device command to determine the device name. For example, look for dq, dk, ew, and so on. 8-52 ES45 Service Guide 8.15 Floppy Drive Figure 8–25 Replacing the Floppy Drive 2 1 3 4 6 4 5 6 PK0281A WARNING: To prevent injury, unplug the power cord from each power supply before installing components. ! WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. FRU Removal and Replacement 8-53 Replacing the Floppy Drive Shut the system down before removing the floppy drive. 1. Remove the cover to the PCI card cage. and power cable from all devices except the 3. Remove and set aside the four screws that secure the removable media 2. Unplug the signal cable floppy. cage. 4. Slide the cage out far enough to gain access to the floppy cables. Unplug the cables and remove the cage. 5. Remove the cage. that secure the floppy drive, and slide the floppy 7. Remove the mounting brackets (two screws in each bracket) from the 6. Remove the four screws out. floppy. 8-54 ES45 Service Guide 8.16 I/O Connector Assembly Figure 8–26 Replacing the I/O Connector Assembly 1 2 PK0284A WARNING: To prevent injury, unplug the power cord from each power supply before installing components. ! WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. FRU Removal and Replacement 8-55 Replacing the I/O Connector Assembly Shut the system down before removing the I/O connector assembly. 1. Unplug all I/O connectors from the rear of the unit. 2. Remove the cover from the PCI card cage. 3. Remove PCI cards as needed for access. 4. Unplug the 68-pin signal cable 5. Remove the two screws . that secure the assembly to the back of the unit. 6. Pull the assembly out through the PCI area. 8-56 ES45 Service Guide 8.17 PCI Backplane Figure 8–27 Cables Connected to PCI Backplane 1 4 2 3 8 5 6 7 PK0279 17-05021-01 17-03970-04 17-04400-06 70-31349-01 17-04785-01 17-04786-01 17-03971-07 17-05042-01 Cable Connects To: CD-ROM Floppy I/O controller module Speaker Fans Cover sensors OCP Hot swap module FRU Removal and Replacement 8-57 ! WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. WARNING: To prevent injury, unplug the power cord from each power supply before installing components. Disconnecting the Cables Shut the system down before accessing the PCI area. 1. Remove the cover to the PCI card cage. 2. Record the location of installed PCI cards. 3. Remove all external cables from the PCI bulkheads in the rear of the unit. Remove internal cables from PCI cards. 4. Unlatch and remove the cards from the card cage. 5. Remove the separators that are between the PCI slots by pressing on the tab (see Figure 8–28) and sliding the separator to the right so that the holes ( and ) slide out from connection tabs ( and ). 6. Disconnect cables connected to the PCI backplane. See Figure 8–27. 7. Remove the top fan (pedestal/rack orientation) or left fan (tower orientation). This permits access to an ejector lever needed for removing the PCI backplane. 8-58 ES45 Service Guide Figure 8–28 Removing the Separators 3 5 2 4 1 MR0060 FRU Removal and Replacement 8-59 Figure 8–29 Replacing the PCI Backplane 4 2 6 1 3 5 5 7 PK0280A 8-60 ES45 Service Guide Replacing the PCI Backplane CAUTION: When removing the PCI backplane, be careful not to flex the board. Flexing the board may damage the BGA component connections. 1. Remove the three screws ers. that secure the base holding the PCI divid- on the base and remove the base. 3. Remove the other nine screws that secure the PCI backplane to the chas2. Press up the two tabs sis. CAUTION: Do not remove the four additional nonwashered screws . Removing them inactivates the built-in mechanism for extracting the PCI backplane from the system. 4. Use the ejector lever in the fan area to separate the PCI backplane from the system motherboard; then lift the backplane out of the chassis. NOTE: When installing a new PCI backplane, align the backplane on the guide pins , and press the board firmly until it is seated. Seating the PCI backplane requires considerable pressure. When seating the PCI backplane in a cabinet, a second person should brace the chassis to ensure that no excessive stress is placed on the rails. Reinstalling the Separators Reference Figure 8–29 for this procedure. 1. Align the left hole with the corresponding connection tab in the divider. Note that the right corresponding hole and connection tab are aligned. 2. Push the separator in and slide it to the left. FRU Removal and Replacement 8-61 8.18 System Motherboard Figure 8–30 Replacing the System Motherboard 8 2 8 7 5 4 9 6 1 3 4 9 8-62 ES45 Service Guide PK1207a WARNING: To prevent injury, unplug the power cord from each power supply before installing components. WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. ! CAUTION: When removing the system motherboard, be careful not to flex the board. Flexing the board may damage the BGA component connections. NOTE: Replacing the system motherboard requires the removal of other FRUs. Review the removal procedures for the fans, MMBs, CPUs, and drive cage before beginning the system motherboard removal procedure. 1. Remove the three covers from the system chassis. 2. Remove fans 3 and 4 in the PCI area (the inner fans). 3. Record the positions of the MMBs and CPUs, and remove the MMBs and CPUs. 4. Remove the CPU airflow deflectors 5. Loosen the three captive Phillips screws holding the middle support bracket . The screws pop up when sufficiently loosened. Pull the bracket straight out. 6. Remove the drive cage (left cage in pedestal/rack, bottom cage in tower), if installed, or the blank panel. 7. Remove the two Phillips flat-head screws that secure the small cover to the left side (pedestal/rack) or bottom (tower) of the system and remove the cover. Set aside the screws. (The cover provides access to the power harness.) , if present. FRU Removal and Replacement 8-63 8. Remove the power harness bracket as follows: Push up on the spring latch to release the bracket, slide the bracket forward, and remove it. on the bottom of the system motherboard. 10. Using a nut driver, loosen the three nuts (7 mm) on the flange over the 9. Unplug the five connectors intermodule connector so that it can move freely. Move the flange up from the connector and tighten one of the flange nuts to keep the flange out of the way. NOTE: After replacing the motherboard, loosen the flange nut and push the flange down to the intermodule connector. Retighten the nuts on the flange. While tightening the nuts, put pressure on the flange to compress it into the connector. 11. Remove the three Phillips screws that secure the system motherboard. 12. A white plastic ejector and two holes in the sheet metal under it are used to help disengage the motherboard. Insert a screwdriver through the hole in the ejector into the closest hole and pry the system motherboard away from the PCI backplane. Insert the screwdriver into the second hole that is now exposed and pry again to fully disengage the system motherboard connector from the PCI backplane. 13. Extract the system motherboard. After installing a new motherboard: 1. Power up to the P00>>> prompt. 2. Enter the clear_error all command. 3. Enter the set sys_serial_num command to set the system serial number. (The serial number is on a label on the back of the system.) For example: P00>>> set sys_serial_num NI900100022 IMPORTANT: The system serial number must be set correctly. Compaq Analyze will not work with an incorrect serial number. The serial number propagates to all FRU devices that have EEPROMs. 8-64 ES45 Service Guide 8.19 Power Harness Figure 8–31 Replacing the Power Harness 7 8 9 2 5 1 Front 4 6 3 8 7 Back PK1208 FRU Removal and Replacement 8-65 WARNING: To prevent injury, unplug the power cord from each power supply before installing components. WARNING: Modules have parts that operate at high temperatures. Wait 2 minutes after power is removed before touching any module. ! NOTE: Replacing the power harness requires the removal of other system FRUs. Review the removal procedures for the power supplies, fans, and drive cage before beginning the harness removal procedure. 1. Remove the power supplies and any blank power supply panels. 2. Remove the cover to the PCI card cage. 3. Remove fans 4 and 3 (the inner fans). 4. Unplug the connectors to each removable media device (except the floppy). 5. Remove the four screws that secure the removable media cage. Slide out the cage to access the floppy power connector. Disconnect the floppy power connector and slide the cage back in. to the drive cage or cages. 7. Remove the harness from the cable clamps . 6. Unplug the power connector 8. Remove the drive cage (left cage in pedestal/rack, bottom cage in tower), if installed, or the blank panel. 9. Remove the two Phillips flat-head screws that secure the small cover to the left side (pedestal/rack) or bottom (tower) of the system and remove the panel. Set aside the screws. (Removing the small cover provides better access to the power harness bracket.) as follows: Push up on the spring 10. Remove the power harness bracket latch to release the bracket, slide the bracket forward, and remove it . 11. Unplug the five connectors 8-66 ES45 Service Guide on the bottom of the system motherboard. 12. Remove the two screws and two plastic bushings on each of the three power supply connectors . The screws are located deep inside the power supply cavity. Set aside the screws and bushings for reinstallation. 13. Starting with the left connector (as viewed from the rear of the system), pull the connector to the right and angle it so that you can push the left end out through the opening. 14. Remove the power harness. FRU Removal and Replacement 8-67 Appendix A SRM Console Commands This appendix lists the SRM console commands that are most frequently used with the ES4x family of systems. Table A–1 SRM Commands Used on ES45 Systems Command Function boot Loads and starts the operating system. buildfru Initializes I Cbus EEPROM data structures for the named FRU. cat el Displays the console event log. Same as more el, but scrolls rapidly. The most recent errors are at the end of the event log and are visible on the terminal screen. clear_error Clears errors logged in the FRU EEPROMs as reported by the show error command. clear_error all Clears all errors. continue Resumes program execution on the specified processor or on the primary processor if none is specified. crash Forces a crash dump at the operating system level. deposit Writes data to the specified address of a memory location, register, or device. edit Invokes the console line editor on a RAM file or on the user powerup script, “nvram,” which is always invoked during the power-up sequence. examine Displays the contents of a memory location, register, or device. 2 Continued on next page SRM Console Commands A-1 Table A–1 SRM Commands Used on ES45 Systems (Continued) Command Function exer Exercises one or more devices by performing specified read, write, and compare operations. floppy_write Runs a write test on the floppy drive to determine whether you can write on the diskette. galaxy Same as lpinit. grep Searches for “regular expressions”—specific strings of characters—and prints any lines containing occurrences of the strings. hd Dumps the contents of a file (byte stream) in hexadecimal and ASCII. help command Displays information about the specified console command. info Displays registers and data structures. init Resets the SRM console and reinitializes the hardware. kill Terminates a specified process. kill_diags Terminates all executing diagnostics. lpinit Used in an OpenVMS Galaxy environment. Initializes the hardware resources into zero, one, or two partitions. man Displays information about the specified console command. memexer Runs a requested number of memory tests in the background. memtest Tests a specified section of memory. more el Same as cat el, but displays the console event log one screen at a time. net -ic Initialize the MOP counters for the specified Ethernet port. net -s Displays the MOP counters for the specified Ethernet port. nettest Runs loopback tests for PCI-based Ethernet ports. Also used to test a port on a “live” network. A-2 ES45 Service Guide Table A–1 SRM Commands Used on ES45 Systems (Continued) Command Function rmc Invokes the remote management console from the local VGA monitor. set envar Sets or modifies the value of an environment variable. set sys_serial_num Sets the system serial number. show envar Displays the state of the specified environment variable. show config Displays the logical configuration at the last system initialization. show device Displays a list of controllers and bootable devices in the system. show error Reports errors logged in the FRU EEPROMs. show fru Displays information about field replaceable units (FRUs). show memory Displays information about system memory. show pal Displays the versions of Tru64 UNIX and OpenVMS PALcode. show power Displays information about system environmental characteristics, including power supplies, system fans, CPU fans, and temperature. show_status Displays the progress of diagnostic tests. Reports one line of information for each executing diagnostic. show version Displays the version of the SRM console program installed on the system. sys_exer Exercises the devices displayed with the show config command sys_exer -lb Runs console loopback tests for the COM2 serial port and the parallel port during the sys_exer test sequence. test Verifies the configuration of the devices in the system. test -lb Runs loopback tests for the COM2 serial port and the parallel port in addition to verifying the configuration of devices. SRM Console Commands A-3 Appendix B Jumpers and Switches This chapter lists and describes the configuration jumpers and switches on the system motherboard and PCI board. Sections are as follows: • RMC and SPC Jumpers on System Motherboard • TIG/SROM Jumpers on System Motherboard • Clock Generator Switch Settings • Jumper on PCI Board • Setting Jumpers Jumpers and Switches B-1 B.1 RMC and SPC Jumpers on System Motherboard The RMC jumpers can be used to override the RMC defaults. For example, if a high-speed modem is connected to COM1, you can disable J4 to prevent RMC from receiving characters that might cause interference. The SPC jumpers are reserved. Figure B–1 RMC and SPC Jumpers 1 2 3 J7 J6 J5 J4 1 2 J3 J2 J1 PK0211A B-2 ES45 Service Guide Table B–1 RMC/SPC Jumper Settings Jumper Description J7 1–2: Disables RMC flash update 2–3: Enables RMC flash update (default) Disabling RMC flash update prevents other operators from erasing or updating the RMC. J6 1–2: Sets RMC back to defaults 2–3: Normal RMC operating mode (default) If the RMC escape sequence is set to something other than the default, and you have forgotten the sequence, RMC must be reset to factory settings to restore the default escape sequence. See Chapter 8 for the reset procedure. J5 1–2: Causes system to shut down if overtemperature limit is reached (default) 2–3: Permits system to continue running at overtemperature. J4 1–2: Disables COM1 bypass 2–3: Allows RMC to control COM1 bypass (default) No jumper installed: Forces COM1 bypass J1 Not installed (default). When installed, bypasses power-up checks of processors by system power controller. J2 Reserved (not installed). J3 Reserved (not installed). Jumpers and Switches B-3 B.2 TIG/FSL Jumpers on System Motherboard TIG/SROM jumpers allow you to load the TIG if flash RAM is corrupted or load the fail-safe loader (FSL) if SRM firmware is corrupted. Figure B–2 TIG/FSL Jumpers FSL J24 1 2 3 J23 1 2 3 J22 1 2 3 J21 1 2 3 J20 1 2 3 J19 1 2 3 1 2 3 4 5 6 7 8 9 10 ON OFF SW2 Configuration Switch SC0033B NOTE: See Chapter 3 for instructions on activating the FSL. B-4 ES45 Service Guide Table B–2 TIG/FSL Jumper Descriptions Jumper Description J24 1–2: Load TIG from flash ROM (default) 2–3: Load TIG from serial ROM. This setting allows you to load the TIG if the flash ROM is corrupted. J21 Jumper for enabling fail-safe loader (FSL) FIR_FUNC1 (bit 1) 1–2= 0, 2–3= 1 J22 Must be in default positions over pins 1 and 2 to enable FSL. 1–2 = 0, 2–3 = 1 J23 Must be in default positions over pins 1 and 2 to enable FSL. FIR_FUNC2 (bit 2) 1–2= 0, 2–3 = 1 J20 Allows writes to flash ROM (normal). 1 – 2= 1, 2 – 3= 0 J19 Allows writes to flash ROM (normal). 1 – 2= 1, 2 – 3= 0 Table B–3 Firmware Function Table (FIR_FUNC) Bits 210 Meaning 000 001 010 111 Normal Prevent flash loads. Load from SROM. Load from floppy Lock console. Prevents the writing of flash from CPUs. The configuration switchpack (SW2) sets the clock speed for the system motherboard. The settings should not be changed. Jumpers and Switches B-5 SW1 SW2 SW3 SW4 SW5 SW6 SW7 SW8 SW9 SW10 B-6 SYS_EXT_DELAY1 (off) SYS_EXT_DELAY0 (on) SYS_FILL_DELAY (on) CPU_CFWD_PSET (off) Reserved Reserved Y_DIV3 (on) Y_DIV2 (on) Y_DIV1 (on) Y_DIV0 (on) ES45 Service Guide B.3 Clock Generator Switch Settings Switchpack E16 on the system motherboard sets the frequency of the main clock on the system motherboard. The settings should not be changed. Figure B–3 CSB Switchpack E16 J19 J16 J15 OFF ON 1 SW1 2 3 4 5 6 7 8 9 10 SC0034B Jumpers and Switches B-7 Table B–4 Clock Generator Settings SW1 M0 (on) SW2 M1 (off) SW3 M2 (off) SW4 M3 (off) SW5 M4 (off) SW6 M5 (on) SW7 M6 (on) SW8 N0 (off) SW9 N1 (on) SW10 XTAL_SEL (off) B-8 ES45 Service Guide B.4 Jumper on PCI Board You can set J19 on the PCI board to force DTR so that a modem will not be disconnected if the system is power cycled. Figure B–4 PCI Board Jumper 1 2 3 1 1 2 4 5 6 7 8 9 10 J19 3 SC0044B Jumpers and Switches B-9 Table B–5 PCI Board Jumper Description Jumper J19 B-10 Description 1–2: Do not force COM1 DTR 2–3: Force COM1 DTR (default) This jumper allows you to force DTR. The default position prevents disconnection of the modem on a power cycle. ES45 Service Guide B.5 Setting Jumpers Review the material in the previous sections of this appendix before setting any system jumpers. First, shut down the system and remove the power cord from each power supply. CAUTION: Static electricity can damage integrated circuits. Always use a grounded wrist strap (29-26246) and grounded work surface when working with internal parts of a computer system. Remove jewelry before working on internal parts of the system. Setting Jumpers 1. Shut down the operating system. 2. Shut down power on all external options connected to the system. 3. Turn off power to the system. 4. Unplug the power cord from each power supply and wait for all LEDs to turn off. 5. Remove enclosure panels and chassis covers to gain access to the system motherboard or PCI board. • If you are setting RMC jumpers, remove CPU 1 to gain access to the jumpers. • If you are setting TIG/SROM jumpers, remove MMB 1 to gain access to the jumpers. • If you are setting PCI jumpers, you typically do not need to remove any PCI cards. However, if you have a full-length card in slot 10, remove it. 6. Locate the jumper you need to set. Refer to the illustrations in this appendix. Set the jumpers as needed. 7. Reinstall any modules you removed. 8. Reinstall the chassis covers and enclosure panels. 9. Plug the power cords into the supplies. Jumpers and Switches B-11 Appendix C DPR Address Layout This appendix shows the address layout of the dual-port RAM (DPR). Use the SRM examine dpr:address command (where address is the offset from the base of the DPR) or use the RMC dump command to view locations in the DPR. See Appendix D for definitions of locations written when environmental error events occur. DPR Address Layout C-1 Table C–1 DPR Address Layout Location Logical (Hex) Indicator Written By 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 SROM SROM SROM SROM SROM SROM SROM SROM SROM EV6 BIST status 1=good 0=bad Bit[7]=Master Bits[0,1]=CPU_ID Test STR status 1=good 0=bad Test CSC status 1=good 0=bad Test Pchip 0 PCTL status1=good 0=bad Test Pchip 1 PCTL status 1=good 0=bad Test DIMx status 1=good 0=bad Test TIG bus status Dual-Port RAM test DD= started 9 A B C D:F 10:15 9 A B C - SROM SROM SROM SROM SROM Status of DPR test 1=good 0=bad Status of CPU speed function FF=good 0=bad Lower byte of CPU speed in MHz Upper byte of CPU speed in MHz Reserved Power On Time Stamp for CPU 0—written as BCD Byte 10 = Hours (0-23) Byte 11 = Minutes (0-59) Byte 12 = Seconds (0-59) Byte 13 = Day of Month (1-31) Byte 14 = Month (1-12) Byte 15 = Year (0-99) C-2 ES45 Service Guide Used For Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By 16 17:1D 1E 1F 20:3F 40:5F 60:7F 80 SROM SROM SROM 20 20 20 80 SROM Used For SROM Power On Error Indication for CPU is “alive.” For example; 0 = no error, 2 = Secondary time-out Error, 3 = Bcache Error Unused Last “sync state” reached; 80=Finished GOOD Size of Bcache in MB Repeat for CPU1 of CPU0 0-1F Repeat for CPU2 of CPU0 0-1F Repeat for CPU3 of CPU0 0-1F Array 0 (AAR 0) Configuration Bits<3:0> Bits<7:4> 4 = non split 0 = Configured lower set only Lowest array 5 = split 1 = Configured lower set only Next lowest array 9 = split 2 = Configured upper set only Second highest D = split array 8 DIMMs 3 = Configured F = Twice split Highest array 8 DIMMs 4 = Misconfigured Missing DIMM(s) 8 = Misconfigured Illegal DIMM(s) C = Misconfigured Incompatible DIMM(s) Continued on next page DPR Address Layout C-3 Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By 81 81 SROM 82 83 84 85 86 87 88:8B 82 83 84 85 86 87 SROM SROM SROM SROM SROM SROM SROM 8C:8F 8C-8F SROM 90 91 92 90 91 92 RMC RMC RMC C-4 ES45 Service Guide Used For Array 0 (AAR 0)Size (x64 Mbytes) 0 = no good memory 1 = 64 Mbyte 2 = 128 Mbyte 4 = 256 Mbyte 8 = 512 Mbyte 10 = 1 Gbyte 20 = 2 Gbyte 40 = 4 Gbyte 80 = 8 Gbyte Array 1 (AAR 1) Configuration Array 1 (AAR 1) Size (x64 Mbytes) Array 2 (AAR 2) Configuration Array 2 (AAR 2) Size (x64 Mbytes) Array 3 (AAR 3) Configuration Array 3 (AAR 3) Size (x64 Mbytes) Byte to define failed DIMMs for MMBs 88 - MMB 0 89 - MMB 1 8A - MMB 2 8B - MMB 3 Bit set indicates failure. Bit definitions ( bit 0 = DIMM 1, bit 1 = DIMM2, bit 2 = DIMM 3, bit 7 = DIMM 8) Byte to define misconfigured DIMMs for MMBs 8C – MMB 0 8D – MMB 1 8E – MMB 2 8F – MMB 3 Bit definitions ( bit 0 = DIMM 1, bit 1 = DIMM2, bit 2 = DIMM 3, bit 7 = DIMM 8) Power Supply/VTERM present Power Supply PS_POK bits AC input value from Power Supply Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By 93:96 97:99 9A:9F A0:A9 93 97 9A A0 RMC RMC RMC RMC AA RMC AB RMC AC AD AE AF RMC RMC RMC RMC B0 RMC B1 RMC Used For Temperature from CPU(x) in BCD Temperature Zone(x) from 3 PCI temp sensors Fan Status; Raw Fan speed value Failure registers used as part of the 680 machine check logout frame. See Appendix D. Fan status (bit 0 = fan 1, bit 1 = fan 2, 1- indicates good; 0 indicates fan failure Status of RMC to read I2C bus of MMB0 DIMMs Definition: Bit 7 - DIMM 8 0=OK 1=Fail Bit 6 - DIMM 7 Bit 5 - DIMM 6 ……………… Bit 0 - DIMM 1 Status of RMC to read I2C bus of MMB1 DIMMs Status of RMC to read I2C bus of MMB2 DIMMs Status of RMC to read I2C bus of MMB3 DIMMs Status of RMC to read MMB and CPU I2C buses Definition: Bit 7 - MMB3 0=OK 1=Fail Bit 6 - MMB2 Bit 5 - MMB1 Bit 4 - MMB0 Bit 3 - CPU3 Bit 2 - CPU2 Bit 1 - CPU1 Bit 0 - CPU0 Status of RMC to read CPB (PCI backplane) I2C EEROM0=OK 1 = fail Status of RMC to read CSB (motherboard) I2C EEROM 0=OK 1 = fail Continued on next page DPR Address Layout C-5 Table C–1 DPR Address Layout (Continued) Location Logical Written (Hex) Indicator By B2 RMC B3:B9 Unused BA BB RMC RMC BC BD BE RMC RMC RMC BF C0 C1:D8 D9 DA RMC DB:E3 E4:EC ED:F5 F6:F8 F9 FA:FB RMC RMC RMC Unused Firmware Firmware C-6 RMC TIG FA ES45 Service Guide Used For Status of RMC to read SCSI backplane Definition: Bit 0 — SCSI backplane 0 Bit 1 — SCSI backplane 1 Bit 4 — Power supply 0 Bit 5 — Power supply 1 Bit 6 — Power supply 2 Unused I2C done, BA = finished RMC Power on Error indicates error during power-up (1=Flash Corrupted) RMC flash update error status Copy of PS input Value. See Appendix D. Copy of the byte from the I/O expanders on the SPC loaded by the RMC on fatal errors. See Appendix D. Reason for system failure. See Appendix D. Reason for system failure. Unused Baud rate Indicates TIG finished loading its code (0xAA indicates done) Fan/Temp info from PS1 Fan/Temp info from PS2 Fan/Temp info from PS3 Unused Buffer Size (0-0xFF) or 1 to 256 bytes Command address qualifier FA = lower byte, FB = upper byte Table C–1 DPR Address Layout (Continued) Location (Hex) Logical Written Indicator By FC FC RMC FD FD RMC FE FE Firmware FF FF Firmware 100:1FF 100 RMC 200:2FF 300:3FF 400:4FF 500:5FF 600:7FF 700:7FF 800:8FF 900:9FF A00:AFF B00:BFF C00:CFF D00:DFF E00:EFF F00:FFF 200 300 400 500 600 700 800 900 A00 B00 C00 D00 E00 F00 RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC Used For Command status associated with the RMC response to a request from the firmware 0 = successful completion 80 = unsuccessful completion 81 = invalid command code 82 = invalid command qualifier Command ID associated with the RMC response to a request from the firmware Command Code associated with a “command” sent to the RMC 1 = update I2C EEROM 2 = update baud rate 3 = display to OCP F0 = update RMC flash Command ID associated with a “command” sent to the RMC Copy of EEROM on MMB0 J4 DIMM 1, initially read on I2C bus by RMC when 5 volts supply turned on. Written by Compaq Analyze after error diagnosed to particular FRU Copy of EEROM on MMB0 J8 Copy of EEROM on MMB0 J5 Copy of EEROM on MMB0 J9 Copy of EEROM on MMB0 J2 Copy of EEROM on MMB0 J6 Copy of EEROM on MMB0 J3 Copy of EEROM on MMB0 J7 Copy of EEROM on MMB1 J4 Copy of EEROM on MMB1 J8 Copy of EEROM on MMB1 J5 Copy of EEROM on MMB1 J9 Copy of EEROM on MMB1 J2 Copy of EEROM on MMB1 J6 Copy of EEROM on MMB1 J3 Continued on next page DPR Address Layout C-7 Table C–1 DPR Address Layout (Continued) Location (Hex) Logical Written Indicator By 1000:10FF 1100:11FF 1200:12FF 1300:13FF 1400:14FF 1500:15FF 1600:16FF 1700:17FF 1800:18FF 1900:19FF 1A00:1AFF 1B00:1BFF 1C00:1CFF 1D00:1DFF 1E00:1EFF 1F00:1FFF 2000:20FF 2100:21FF 2200:22FF 2300:23FF 2400:24FF 2500:25FF 2600:26FF 2700:27FF 2800:28FF 2900:29FF 2A00:2AFF 2B00:2BFF 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 1A00 1B00 1C00 1D00 1E00 1F00 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 2A00 2B00 C-8 RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC RMC ES45 Service Guide Used For Copy of EEROM on MMB1 J7 Copy of EEROM on MMB2 J4 Copy of EEROM on MMB2 J8 Copy of EEROM on MMB2 J5 Copy of EEROM on MMB2 J9 Copy of EEROM on MMB2 J2 Copy of EEROM on MMB2 J6 Copy of EEROM on MMB2 J3 Copy of EEROM on MMB2 J7 Copy of EEROM on MMB3 J4 Copy of EEROM on MMB3 J8 Copy of EEROM on MMB3 J5 Copy of EEROM on MMB3 J9 Copy of EEROM on MMB3 J2 Copy of EEROM on MMB3 J6 Copy of EEROM on MMB3 J3 Copy of EEROM on MMB3 J7 Copy of EEROM from CPU0 Copy of EEROM from CPU1 Copy of EEROM from CPU2 Copy of EEROM from CPU3 Copy of MMB 0 J5 FRU EEROM Copy of MMB 1 J7 FRU EEROM Copy of MMB 2 J6 FRU EEROM Copy of MMB 3 J8 FRU EEROM Copy of EEROM on CPB (PCI backplane) Copy of EEROM on CSB (motherboard) Last EV68 Correctable Error—ASCII character string that indicates correctable error occurred, type, FRU, and so on. Backed up in CSB (motherboard) EEROM. Written by Compaq Analyze Table C–1 DPR Address Layout (Continued) Location (Hex) Logical Written Indicator By 2C00:2CFF 2C00 RMC 2D00:2DFF 2D00 RMC 2E00:2FFF 2E00 RMC 3000:3008 3009:300B SROM RMC 300C:300E RMC 300F:3010 3011:30FF 3100:31FF 3200:32FF 3300:33FF 3400 3401 300F RMC Unused RMC RMC RMC SROM SROM 3402 3403:340F SROM SROM/SRM 3410:3417 SROM/SRM Used For Last Redundant Failure—ASCII character string that indicates redundant failure occurred, type, FRU, and so on. Backed up in system CSB (motherboard) EEROM. Written by Compaq Analyze Last System Failure—ASCII character string that indicates system failure occurred, type, FRU, and so on. Backed up in CSB (motherboard) EEROM. Written by Compaq Analyze. Uncorrectable machine logout frame (512 bytes) SROM Version (ASCII string) Rev Level of RMC first byte is letter Rev [x/t/v] second 2 bytes are major/minor. This is the rev level of the RMC on-chip code. Rev Level of RMC first byte is letter Rev [x/t/v] second 2 bytes are major/minor. This is the rev level of the RMC flash code. Revision Field of the DPR Structure Unused Copy of PS0 EEROM (first 256 bytes) Copy of PS1 EEROM (first 256 bytes) Copy of PS2 EEROM (first 256 bytes) Size of Bcache in MB Flash SROM is valid flag; 8 = valid, 0 = invalid System’s errors determined by SROM Reserved for future SROM/SRM communication Jump to address for CPU0 Continued on next page DPR Address Layout C-9 Table C–1 DPR Address Layout (Continued) Location (Hex) Logical Written Indicator By 3418 3419 SROM/SRM SROM 341A:341E SROM 341F SROM/SRM 3420:342F 3430:343F 3440:344F 3450:349F 34A0:34A7 SROM/SRM SROM/SRM SROM/SRM SROM/ RMC SROM 34A8:34AF SROM 34B0:34B7 SROM 34B8:34CF SROM 34C0:34FF C-10 34C0 SROM ES45 Service Guide Used For Waiting to jump to flag for CPU0 Shadow of value written to EV6 DC_CTL register. Shadow of most recent writes to EV6 CBOX “Write-many” chain. Reserved for future SROM/SRM communication Repeat for CPU1 of CPU0 3410-341F Repeat for CPU2 of CPU0 3410-341F Repeat for CPU3 of CPU0 3410-341F Reserved for SROM mini-console via RMC communication area. Future design. Array 0 to DIMM ID translation Bits<7:5> Bits<4:0> 0 = Exists, No Error Bits <2:0> = 1 = Expected Missing DIMM + 1 (1-8) 2 = Error - Missing DIMM(s) Bits <4:3> = 4 = Error - Illegal MMB (0-3) DIMM(s) 6 = Error Incompatible DIMM(s) Repeat for Array 1 of Array 0 34A0:34A7 Repeat for Array 2 of Array 0 34A0:34A7 Repeat for Array 3 of Array 0 34A0:34A7 Used as scratch area for SROM Table C–1 DPR Address Layout (Continued) Location (Hex) Logical Written Indicator By 3500:35FF Firmware 3600:36FF 3700:37FF 3800:3AFF 3B00:3BFF 3C00:3CFF 3D00:3DFF 3E00:3EFF 3F00:3FFF 3600 SRM SRM RMC RMC RMC RMC RMC RMC Used For Used as the dedicated buffer in which SRM writes OCP or FRU EEROM data. Firmware will write this data, RMC will only read this data. Reserved Reserved RMC scratch space First SCSI backplane EEROM Second SCSI backplane EEROM PS0 second 256 bytes PS1 second 256 bytes PS2 second 256 bytes DPR Address Layout C-11 Appendix D Registers This appendix describes 21264 (EV68) internal processor registers; 21274 (Titan) system support chipset registers; and dual-port RAM (DPR) registers that are related to general logout frame errors. It also provides CPU and system uncorrectable and correctable machine logout frames and error state bit definitions of all the platform logout frame registers. 21264 (EV68) Registers Ibox Status Register (I_STAT) Memory Management Status Register (MM_STAT) Dcache Status Register (DC_STAT) Cbox Read Register Exception Address Register (EXC_ADDR) Interrupt Enable and Current Processor Mode Register (IER_CM) Interrupt Summary Register (ISUM) PAL Base Register (PAL_BASE) Ibox Control Register (I_CTL) Process Context Register (PCTX) 21274 (Titan) System Registers 21274 Cchip Miscellaneous Register (MISC) 21274 Device Interrupt Request Register (DIRn, n=0,1,2,3) 21274 Pchip Error Register (PERROR) 21274 Pchip System Error Register (SERROR) 21274 Array Address Registers DPR Registers DPR Registers (for 680 correctable error state capture) 2 DPR Registers (for I C bus) DPR Registers (power supply status from I2C bus) DPR 680 Fatal Registers (for 680 uncorrectable error state capture) Registers D-1 D.1 Ibox Status Register (I_STAT) The Ibox Status Register (I_STAT) is read/write-1-to-clear register that contains Ibox status information. 63 41 40 39 38 37 34 33 32 30 29 28 0 MIS TRP LS0 TRAP TYPE[3:0] ICM OVR[2:0] PAR LK99-0031A Table D–1 Ibox Status Register Fields Name Bits Type Description Reserved <63:41> RO Reserved for Compaq. MIS <40> RO ROProfileMeMispredictTrap. If the I_STAT<TRP> bit is set, this bit indicates that the profiled instruction caused a mispredict trap. JSR/JMP/RET/COR or HW_JSR/ HW_JMP/HW_RET/HW_COR mispredicts do not set this bit but can be recognized by the presence of one of these instructions at the PMPC location with the I_STAT<TRP> bit set. This identification is exact in all cases except error condition traps. Hardware corrected Icache parity or Dcache ECC errors, and machine check traps can occur on any instruction in the pipeline. TRP <39> RO D-2 ES45 Service Guide ProfileMe Trap. This bit indicates that the profiled instruction caused a trap. The trap type field, PMPC register, and instruction at the PMPC location are needed to distinguish all trap types. Table D–1 Ibox Status Register Fields (Continued) Name Bits Type Description LS0 <38> RO ProfileMe Load-Store Order Trap. LS0 <38> RO ProfileMe Load-Store Order Trap. If the profiled instruction caused a replay trap, this bit indicates that the precise trap cause was an Mbox load-store order replay trap. If clear, this bit indicates that the replay trap was any one of the following: Mbox load-load order Mbox load queue full Mbox store queue full Mbox wrong size trap (such as, STL ••LDQ) Mbox Bcache alias (2 physical addresses map to same Bcache line) Mbox Dcache alias (2 physical addresses map to same Dcache line) Icache parity error Dcache ECC error TRAP TYPE<3:0> <37:34> RO RO ProfileMe Trap Types. If the profiled instruction caused a trap (indicated by I_STAT<TRP>), this field indicates the trap type as listed here: Value Trap Type 0 Replay 1 Invalid (unused) 2 DTB Double miss (3 level page tables) 3 DTB Double miss (4 level page tables) 4 Floating point disabled 5 Unaligned Load/Store 6 DTB Single miss 7 Dstream Fault 8 OPCDEC 9 Invalid (use PMPC, described below) 10 Machine Check 11 Invalid (use PMPC, described below) 12 Arithmetic 13 Invalid (use PMPC, described below) 14 MT_FPCR 15 Reset Traps due to ITB miss, Istream access violation, or interrupts are not reported in the trap type field because they do not cause pipeline aborts. Continued on next page Registers D-3 Table D–1 Ibox Status Register Fields (Continued) Name Bits Type Description Instead, these traps cause pipeline redirection and can be distinguished by examining the PMPC value for the presence of the corresponding PAL-code entry offset addresses indicated below. In these cases, the ProfileMe interrupt will normally be delivered when exiting the trap PALcode flow and the EXC_ADDR register will contain the original PC that encountered the redirect trap. PMPC<14:0> Trap 0581 ITB miss 0481 Istream Access Violation 0681 Interrupt ICM <33> RO ProfileMe Icache Miss. This bit indicates that the profiled instruction was contained in an aligned 4-instruction Icache fetch block that requested a new Icache fill stream. OVR<2:0> <32:30> RO ProfileMe Counter 0 Overcount. This bit indicates a value (0-7) that must be subtracted from the counter 0 result to obtain an accurate count of the number of instructions retired in the interval beginning three cycles after the profiled instruction reaches pipeline stage 2 and ending four cycles after the profiled instruction is retired. PAR <29> WIC Icache Parity Error. This bit indicates that the Icache encountered a parity error on instruction fetch. When a parity error is detected, the Icache is flushed, a replay trap back to the address of the error instruction is generated, and a correctable read interrupt is requested. See also I_STAT<LAM>. Reserved <28:0> RO Reserved for COMPAQ D-4 ES45 Service Guide D.2 Memory Management Status Register (MM_STAT) The Memory Management Status Register (MM_STAT) is a read-only register. When a Dstream TB miss or fault occurs, information about the error latched in MM_STAT.MM_STAT is not updated when a LD_VPTE gets a DTB miss instruction. 63 11 10 9 4 3 2 1 0 DC_TAG_PERR OPCODE[5:0] FOW FOR ACV WR LK99-0039A Table D–2 Memory Management Status Register Fields Name Bits Type Description Reserved <63:11> DC_TAG_ PERR <10> RO This bit is set when a D-cache tag parity error occurs during the initial tag probe of a load or store instruction. The error created a synchronous fault to the D_FAULT PALcode entry point and is correctable. The virtual address associated with the error is available in the VA register. OPCODE <9:4> RO Opcode of the instruction that caused the error. HW_LD is displayed as 3 and HW_ST is displayed as 7. FOW <3> RO Set when a fault-on-write error occurs during a write transaction and PTE<FOW> was set. FOR <2> RO Set when a fault-on-read error occurs during a read transaction and PTE<FOR> was set. ACV <1> RO Set when an access violation occurs during a transaction. Access violations include a bad virtual address. WR <0> RO Set when an error occurs during a write transaction. Reserved for Compaq. Registers D-5 D.3 Dcache Status Register (DC_STAT) The Dcache Status Register (DC_STAT) is a read-write register. If a Dcache tag parity error or data ECC error occurs, information about the error is latched in this register. 63 5 4 3 2 1 0 SEO ECC_ERR_LD ECC_ERR_ST TPERR_P1 TPERR_P0 LK99-0042A Table D–3 Dcache Status Register Fields Name Bits Reserved <63:5> SEO <4> W1C Second error occurred. When set, indicates that a second D-cache store ECC error occurred within 6 cycles of the previous D-cache store ECC error. ECC_ERR_LD <3> W1C ECC error on load. When set, indicates that a single-bit ECC error occurred while processing a load from the D-cache or any fill. ECC_ERR_ST <2> W1C ECC error on store. When set, indicates that an ECC error occurred while processing a store. TPERR_P1 <1> W1C Tag parity error—pipe 1. When set, indicates that a D-cache tag probe from pipe 1 resulted in a tag parity error. The error is uncorrectable and results in a machine check. TPERR_P0 <0> W1C Tag parity error—pipe 0. When set, this bit indicates that a D-cache tag probe from pipe 1 resulted in a tag parity error. The error is uncorrectable and results in a machine check. D-6 ES45 Service Guide Type Description Reserved for Compaq. D.4 Cbox Read Register The Cbox Read Register is read only by PAL code and is an element in the CPU or system uncorrectable and correctable machine check error logout frame. Table D–4 Cbox Read Register Fields Name Description C_SYNDROME_1<7:0> Syndrome for the upper QW in the OW of victim that was scrubbed. See Appendix E. C_SYNDROME_0<7:0> Syndrome for the lower QW in the OW of victim that was scrubbed. See Appendix E. C_STAT<4:0> Bits Error Status 00000 Either no error, or error on a speculative load, of a B-cache victim read due to a D-cache/B-cache miss. 00001 BC_PERR (B-cache tag parity error) 00010 DC_PERR (duplicate tag parity error) 00011 DSTREAM_MEM_ERR 00100 DSTREAM_BC_ERR 00101 DSTREAM_DC_ERR 0011X PROBE_BC_ERR 01000 Reserved 01001 Reserved 01010 Reserved 01011 ISTREAM_MEM_ERR Continued on next page Registers D-7 Table D–4 Cbox Read Register Fields (Continued) Name Description C_STAT<4:0> (continued) Bits Error Status 01100 ISTREAM_BC_ERR 01101 Reserved 0111X Reserved 10011 DSTREAM_MEM_DBL 10100 DSTREAM_BC_DBL 11011 ISTREAM_MEM_DBL 11100 ISTREAM_BC_DBL C_STS<3:0> If C_STAT equals xxx_MEM_ERR or xxx_BC_ERR, then C_STAT contains the status of the block as follows; otherwise, the value of C_STAT is X. Bit Value C_ADDR<6:42> D-8 Status of Block 7–4 Reserved 3 Parity 2 Valid 1 Dirty 0 Shared Address of the last reported ECC or parity error. If C_STAT value is DSTREAM_DC_ERR, only bits <6:19> are valid. ES45 Service Guide D.5 Exception Address Register (EXC_ADDR) The Exception Address Register (EXC_ADDR) is a read-only register that is updated by hardware when it encounters an exception or interrupt bit. 63 2 1 0 PC[63:2] PAL LK99-0018A EXC_ADDR<0> is set if the associated exception occurred in PAL mode. The exception actions are: • If the exception was a fault or a synchronous trap, EXC_ADDR contains the PC of the instruction that triggered the fault or trap. • If the exception was an interrupt, EXC_ADDR contains the PC of the next instruction that would have executed if the interrupt had not occurred. Registers D-9 D.6 Interrupt Enable and Current Processor Mode Register (IER_CM) The Interrupt Enable and Current Processor Mode Register (IER_CM) contains the interrupt enable and current processor mode bit fields. These bit fields can be written either individually or together with a single HW_MTPR instruction. When bits <7:2> of the IPR index field of a HW_MTPR instruction contain the value 000010, this register is selected. Bits <1:0> of the IPR index indicate which bit fields are to be written: bit<1> corresponds to the IER field and bit<0> corresponds to the processor mode field. A HW_MFPR instruction to this register returns the values in both fields. 63 39 38 33 32 31 30 29 28 14 13 12 5 4 3 2 0 EIEN[5:0] SLEN CREN PCEN[1:0] SIEN[15:1] ASTEN CM[1:0] D-10 ES45 Service Guide LK99-0022A Table D–5 IER_CM Register Fields Name Extent Type Description Reserved <63:39> EIEN<5:0> <38:33> RW External Interrupt Enable SLEN <32> RW Serial Line Interrupt Enable CREN <31> RW Corrected Read Error Interrupt Enable PCEN<1:0> <30:29> RW Performance Counter Interrupt Enables SIEN<15:1> <28:14> RW Software Interrupt Enables ASTEN <13> RW AST Interrupt Enable When set, enables those AST interrupt requests that are also enabled by the value in ASTER. Reserved <12:5> CM<1:0> <4:3> Reserved RW Current Mode 00 Kernel 01 Executive 10 Supervisor 11 User <2:0> Registers D-11 D.7 Interrupt Summary Register (ISUM) The Interrupt Summary Register (ISUM) is a read-only register that records all pending hardware, software, and AST interrupt requests that have their corresponding enable bit set. If a new interrupt (hardware, serial line, crd, or performance counters) occurs simultaneously with an ISUM read, the ISUM read returns zeros. That condition is normally assumed to be a passive release condition. The interrupt is signaled again when the PALcode returns to native mode. The effects of this condition can be minimized by reading ISUM twice and ORing the results. 63 39 38 33 32 31 30 29 28 14 13 11 10 9 8 5 4 3 2 0 EI[5:0] SL CR PC[1:0] SI[15:1] ASTU ASTS ASTE ASTK D-12 LK99-0024A ES45 Service Guide Table D–6 ISUM Register Fields Name Extent Type Description Reserved <63:39> EI<5:0> <38:33> RO External Interrupts SL <32> RO Serial Line Interrupt CR <31> RO Corrected Read Error Interrupts PC<1:0> <30:29> RO Performance Counter Interrupts PC0 when PC<0> is set. PC1 when PC<1> is set. SI<15:1> <28:14> Reserved <13:11> ASTU, ASTS <10>,<9> RO Software Interrupts RO AST Interrupts For each processor mode, the bit is set if an associated AST interrupt is pending. This includes the mode’s ASTER and ASTRR bits and whether the processor mode value held in the IER_CM register is greater than or equal to the value for the mode. Reserved <8:5> ASTE, ASTK <4>,<3> RO AST Interrupts For each processor mode, the bit is set if an associated AST interrupt is pending. This includes the mode’s ASTER and ASTRR bits and whether the processor mode value held in the IER_CM register is greater than or equal to the value for the mode. Reserved <2:0> Registers D-13 D.8 PAL Base Register (PAL_BASE) The PAL Base Register (PAL_BASE) is a read-write register that contains the base physical address for PALcode. Its contents are cleared by a chip reset but are not cleared after waking up from sleep mode or from fault reset. 63 44 43 15 14 PAL_BASE[43:15] 0 LK99-0027A Table D–7 PAL_BASE Register Fields Name Extent Type Description Reserved <63:44> RO, 0 Reserved for COMPAQ. PAL_BASE <43:15> RW Base physical address for PALcode. Reserved <14:0> RO, 0 Reserved for COMPAQ. D-14 ES45 Service Guide D.9 Ibox Control Register (I_CTL) The Ibox Control Register (I_CTL) is a read-write register that controls various Ibox functions. Its contents are cleared by a chip reset. 63 48 47 30 29 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 3 2 1 0 SEXT(VPTB[47]) VPTB[47:30] CHIP_ID[5:0] BIST_FAIL TB_MB_EN MCHK_EN ST_WAIT_64K PCT1_EN PCT0_EN SINGLE_ISSUE_H VA_FORM_32 VA_48 SL_RCV SL_XMIT HWE BP_MODE[1:0] SBE[1:0] SDE[1:0] SPE[2:0] IC_EN[1:0] SPCE LK99-0029A Registers D-15 Table D–8 I_CTL Register Fields Name Extent Type Description SEXT(VPTB<47>) <63:48> RW,0 Sign extended VPTB<47>. VPTB<47:30> <47:30> RW,0 Virtual Page Table Base. CHIP_ID<5:0> <29:24> RO This is a read-only field that supplies the revision ID number for the EV68CB/EV68DC part. EV68CB/EV68DC pass 2.3 ID is 010111. BIST_FAIL <23> RO,0 Indicates the status of BIST (clear = pass, set = fail). TB_MB_EN <22> RW,0 When set, the hardware ensures that the virtual-mode loads in DTB and ITB fill flows that access the page table and the subsequent virtual mode load or store that is being retried are “ordered” relative to another processor’s stores. This must be set for multiprocessor systems in which no MB instruction is present in the TB fill flow, unless there are other mechanisms present that ensure coherency. MCHK_EN <21> RW,0 Machine check enable — set to enable machine checks. CALL_PAL_R23 <20> RW,0 CALL_PAL linkage register. If this bit is one, the CALL_PAL linkage register is R23; when zero, it is R27. Coordinate setting this bit with SDE<1:0> to ensure that the shadow register is used as the linkage register. PCT1_EN <19> RW,0 Enable performance counter #1. If this bit is one, the performance counter will count if either the system (SPCE) or process (PPCE) performance counter enable is asserted. D-16 ES45 Service Guide Table D–8 I_CTL Register Fields (Continued) Name Extent Type Description PCT0_EN <18> RW,0 Enable performance counter #0. If this bit is one, the performance counter will count if EITHER the system (SPCE) or process (PPCE) performance counter enable is set. SINGLE_ISSUE_H <17> RW,0 When set, this bit forces instructions to issue only from the bottom-most entries of the IQ and FQ. VA_FORM_32 <16> RW,0 This bit controls address formatting on a read of the IVA_FORM register. VA_48 <15> RW,0 This bit controls the format applied to effective virtual addresses by the IVA_FORM register and the Ibox virtual address sign extension checkers. When VA_48 is clear, 43-bit virtual address format is used, and when VA_48 is set, 48-bit virtual address format is used. The effect of this bit on the IVA_FORM register is identical to the effect of VA_CTL<VA_48> on the VA_FORM register. When VA_48 is set, the sign extension checkers generate an ACV if va<63:0> ≠ SEXT(va<47:0>). When VA_48 is clear, the sign extension checkers generate an ACV if va<63:0> ≠ SEXT(va<42:0>). This bit also affects DTB_DOUBLE Traps. If set, the DTB double miss traps vector to the DTB_DOUBLE_4 entry point. DTB_DOUBLE PALcode flow selection is not affected by VA_CTL<VA_48>. Continued on next page Registers D-17 Table D–8 I_CTL Register Fields (Continued) Name Extent Type Description SL_RCV <14> RO When in native mode, any transition on SL_RCV, driven from the SromData_H pin, results in a trap to the PALcode interrupt handler. When in PALmode, all interrupts are blocked. The interrupt routine then begins sampling SL_RCV under a software timing loop to input as much data as needed, using the chosen serial line protocol. SL_XMIT <13> WO When set, drives a value on SromClk_H. HWE <12> RW,0 If set, allow PALRES intructions to be executed in kernel mode. Note that modification of the ITB while in kernel mode/native mode may cause UNPREDICTABLE behavior. BP_MODE<1:0> <11:10> RW,0 Branch Prediction Mode Selection. BP_MODE<1>, if set, forces all branches to be predicted to fall through. If clear, the dynamic branch predictor is chosen. BP_MODE<0>. If set, the dynamic branch predictor chooses local history prediction. If clear, the dynamic branch predictor chooses local or global prediction based on the state of the chooser. SBE<1:0> <9:8> RW,0 Stream Buffer Enable. The value in this bit field specifies the number of Istream buffer prefetches (besides the demand-fill) that are launched after an Icache miss. If the value is zero, only demand requests are launched. SDE<1:0> <7:6> RW,0 PALshadow Register Enable. Enables access to the PALshadow registers. If SDE<1> is set, R4-R7 and R20-R23 are used as PALshadow registers. SDE<0> does not affect 21264 operation. D-18 ES45 Service Guide Table D–8 I_CTL Register Fields (Continued) Name Extent Type Description SPE<2:0> <5:3> RW,0 Super Page Mode Enable. Identical to the SPE bits in the Mbox M_CTL SPE<2:0>. IC_EN<1:0> <2:1> RW,3 Icache Set Enable. At least one set must be enabled. The entire cache may be enabled by setting both bits. Zero, one, or two Icache sets can be enabled. This bit does not clear the Icache, but only disables fills to the affected set. SPCE <0> RW,0 System Performance Counting Enable. Enables performance counting for the entire system if individual counters (PCTR0 or PCTR1) are enabled by setting PCT0_EN or PCT1_EN, respectively. Performance counting for individual processes can be enabled by setting PCTX<PPCE>. Continued on next page Registers D-19 D.10 Process Context Register (PCTX) The process context register (PCTX) contains information associated with the context of a process. The process context register (PCTX) contains information associated with the context of a process. Any combination of the bit fields within this register may be written with a single HW_MTPR instruction. When bits <7:6> of the IPR index field of a HW_MTPR instruction contain the value 01 2 , this register is selected. Bits <4:0> of the IPR index indicate which bit fields are to be written. 63 47 46 39 38 13 12 9 8 5 4 3 2 1 0 ASN[7:0] ASTRR[3:0] ASTER[3:0] FPE PPCE LK99-0032A The following table lists the correspondence between IPR index bits and register fields. IPR Index Bit Register Field 0 ASN 1 ASTER 2 ASTRR 3 PPCE 4 FPE Table D–9 lists the PXTX register fields. D-20 ES45 Service Guide Table D–9 PCTX Register Fields Name Extent Reserved <63:47> ASN<7:0> <46:39> Reserved <38:13> ASTRR<3:0> <12:9> Type Description RW Address space number. RW AST request register—used to request AST interrupts in each of the four processor modes. To generate a particular AST interrupt, its corresponding bits in ASTRR and ASTER must be set, along with the ASTE bit in IER. Further, the value of the current mode bits in the PS register must be equal to or higher than the value of the mode associated with the AST request. The bit order with this field is: User Mode 12 12 Supervior Mode 11 11 Executive Mode 10 10 Kernel Mode 9 9 ASTER<3:0> <8:5> RW AST enable register—used to individually enable each of the four AST interrupt requests. The bit order with this field is: User Mode 8 8 Supervisor Mode 7 7 Executive Mode 6 6 Kernel Mode 5 5 Reserved <4:3> Continued on next page Registers D-21 Table D–9 PCTX Register Fields (Continued) Name Extent Type Description FPE <2> RW,1 Floating-point enable—if clear, floatingpoint instructions generate FEN exceptions. This bit is set by hardware on reset. PPCE <1> RW Process performance counting enable. Enables performance counting for an individual process with counters PCTR0 or PCTR1, which are enabled by setting PCT0_EN or PCT1_EN, respectively. Performance counting for the entire system can be enabled by setting I_CTL<SPCE>. D-22 ES45 Service Guide D.11 21274 Cchip Miscellaneous Register (MISC) This register is designed so that there are no read side effects, and that writing a 0 to any bit has no effect. Therefore, when software wants to write a 1 to any bit in the register, it need not be concerned with readmodify-write or the status of any other bits in the register. Once NXM is set, the NXS field is locked so that initial NXM error information is not overwritten by subsequent errors. It is unlocked when the software clears the NXM CPU; however, writing it locks out the other CPU. Writing a 1 to ACL (arbitration clear) clears both ABW bits and both ABT (arbitration try) bits and unlocks the ABW field. Address 801 A000 0080 Access RW Registers D-23 Table D–10 21274 Cchip Miscellaneous Register Fields Name Bits Type Initial State Description RES <63:44> MBZ, RAZ 0 DEVSUP <43:40> WO 0 REV <39:32> RO 1 Cchip revision reads as 16 NXS <31:29> RO 0 NXM source—Device that caused the NXM. Unpredictable if NXM not set. 0 = CPU0 1 = CPU1 2 = CPU2 3 = CPU3 4 = P-chip 0 5 = P-chip 1 6, 7 = Reserved NXM <28> R, W1C 0 Nonexistent memory address detected. Sets DRIR<63> and locks the NXS field until it is cleared. RES <27:25> MBZ, RW 0 Reserved. ACL <24> WO 0 Arbitration clear—writing a 1 to this bit clears the ABT and ABW fields. ABT <23:20> R, W1S 0 Arbitration try—writing a 1 to these bits sets them. ABW <19:16> R, W1S 0 Arbitration won—writing a 1 to these bits sets them unless one is already set, in which case the write is ignored. IPREQ <15:12> WO 0 Interprocessor interrupt request— write a 1 to the bit corresponding to the CPU you want to interrupt. Writing a 1 here sets the corresponding bit in the IPINTR. D-24 ES45 Service Guide Reserved. Table D–10 21274 Cchip Miscellaneous Register Fields (Continued) Name Bits Type Initial State Description IPINTR <11:8> R, W1C 0 Interprocessor interrupt pending— one bit per CPU. ITINTR <7:4> R, W1C 0 Interval timer interrupt pending— one bit per CPU. RES <3:2> MBZ, RW 0 Reserved. CPUID <1:0> RO - ID of the CPU performing the read. Registers D-25 D.12 21274 Cchip CPU Device Interrupt Request Register (DIRn, n=0,1,2,3) Register n applies to CPUn. These registers indicate which interrupts are pending to the CPUs. If a raw request bit is set and the corresponding mask bit is set, then the corresponding bit in this register will be set and the appropriate CPU will be interrupted. If a raw request bit is set and the corresponding mask bit is set, then the corresponding bit in this register will be set and the appropriate CPU will be interrupted. Address 801 A000 0280 CPU0 801 A000 02C0CPU1 801 A000 0680 CPU2 801 A000 06C0 CPU3 Access RO Table D–11 21274 Device Interrupt Request Register Fields Name Bits Type Initial State Description ERR <63:58> RO 0 IRQ0 error interrupts <63> Cchip detected MISC <NXM> <62> Recommended hookup to Pchip0 error <61> Recommended hookup to Pchip1 error <60> Recommended hookup to Pchip0 soft error <59> Recommended hookup to Pchip1 soft error RES <57:56> RO 0 Reserved DEV <55:12> RO 0 IRQ1 PCI interrupts pending to the CPU Hot Plug <11:9> RO 0 Hot plug controller interrupt DEV <8:0> RO 0 IRQ 1 PCI controller interrupt D-26 ES45 Service Guide D.13 21274 Array Address Registers (AAR0–AAR3) The Array Address Registers define the base address and size for each memory array. Table D–12 21274 Array Address Register (AAR) Field Bits Type Init Description RES <63:35> MBZ,RAZ 0 Reserved. ADDR <34:24> RW 0 Base address – Bits <34:24> of the physical byte address of the first byte in the array. RES <23:17> MBZ,RAZ 0 Reserved. DBG 16 RW 0 Enables this memory port to be used as a debug interface. ASIZ <15:12> RW 0 Array size. This field must be non-zero for AAR0. Value 0000 0001 0100 0011 0100 0101 0110 0111 1000 1001 1010 1011 1111 Size 0 (bank disabled) 16MB 32MB 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB Reserved Continued on next page Registers D-27 Table D–12 21274 Array Address Register (AAR) (Continued) Field Bits Type Init Description RES <11:10> MBZ,RAZ 0 Reserved. DSA <9> RW 0 Double (Twice)-split array SA <8> RW 0 Split array. RES <7:4> MBZ,RAZ 0 Reserved. ROWS <3:2> RW 0 Number of row bits in the SDRAMs. BNKS D-28 <1:0> RW ES45 Service Guide 0 Value Number of Bits 0 1 2 3 11 12 13 Reserved Number of bank bits in the SDRAMs Value Number of Bits 0 1 2 3 1 2 3 Reserved D.14 Pchip System Error Register (SERROR) This register is used for logging system errors. When system error bits <4, 2:0> are set, this entire register is frozen. Only the NXIO bit and the LOST bit can be set after that. All other values will be held until bits <2:1> are clear. When an error is detected and one of bits <2:1> is set, the associated quadword address, CAP bus command, and syndrome is captured in bits <63:16> of this register. The NXIO bit does not log any address information, and does not lock the register. Table D–13 Pchip System Error Register Field Bits Type Init Description SYN <63:56> RO 0 ECC syndrome of error CMD <55:54> RO 0 Transaction type Value Command 00 DMA read 01 DMA RMW 10 SGTE read 11 Reserved SOURCE <53:52> RO 0 Source bus Value Command 00 GPCI 01 APCI 10 AGP HP 11 AGP LP Continued on next page Registers D-29 Table D–13 Pchip System Error Register (Continued) Field Bits Type Init Description RES <51:47> RAZ 0 Reserved ADDR <46:15> RO 0 Address of the erroneous quadword RES <14:5> RAZ 0 Reserved LOST_CRE <4> R,W1C 0 Lost a correctable ECC error because it was detected after this register was locked. c_err is asserted as long as this bit is set. NXIO <3> R,W1C 0 Nonexistent IO error. Indicates that a reserved IO space was addressed by the CPU. Logged if SERREN<NXIO> is set. No address is logged, and the setting of this bit does not affect the state of the Lost bit. h_err is asserted as long as this bit is set. CRE <2> R,W1C 0 Correctable ECC error. Logged if SER-REN< CRE> is set. c_err is asserted as long as this bit is set. UECC <1> R,W1C 0 Uncorrectable ECC error. Logged if SER-REN< UECC> is set. h_err is asserted as long as this bit is set. LOST_UECC <0> R,W1C 0 Lost an Uncorrectable ECC error because it was detected after this register was locked. h_err is asserted as long as this bit is set. D-30 ES45 Service Guide D.15 Pchip A/G PCI Error Register (GPERROR, APERROR) This register is used for logging PCI errors on the GPCI or APCI buses respectively. The GPCI and APCI registers are identical. If any of bits <11:2> are set, then this entire register is frozen and the Pchip output signal h_err is asserted. Only bits <1:0> can be set after that. All other values will be held until bits <11:2, 0> are clear. When an error is detected and one of bits <10:2> is set, the associated cache block address, PCI command, and syndrome is captured in bits <55:52, 48:14> of this register. A monster window address has PCI address bit <40> set, and this is reflected in A/G PERROR bit <48>. Likewise, a non-MWIN dual address cycle (DAC) has PCI address bit <39> set, which shows up in A/G PERROR bit <47>. Bits <46:14> of A/G PERROR contain the longword PCI address bits <34:02> (a DAC could have bits set in PCI address bits <34:32>). Bits <11:1> of this register are only set if the corresponding enable bits are set in the PERREN register. NOTE: Software must not perform back-to-back writes to this register. A write to this register must be followed by a read from any PA-chip CSR, or a write to any other CSR (except for the corresponding A/G PERRSET register). Back-to-back writes will yield unpredictable results. Registers D-31 Table D–14 Pchip Error Register Field Bits Type Init Description RES <63:56> RAZ 0 Reserved CMD <55:52> RO 0 PCI command Value Command 0000 Interrupt Ackn 0001 Special Cycle 0010 I/O Read 0011 I/O Write 0100 Reserved 0101 Reserved 0110 Memory Read 0111 Memory Write 1000 Reserved 1001 Reserved 1010 Configuration Read 1011 Configuration Write 1100 Mem Read Multiple 1101 Dual Address Cycle 1110 Memory Read Line 1111 Memory Write and Invalidate RES <51:49> RAZ 0 Reserved MWIN <48> RO 0 Indicates that the erroneous access was to the Monster Window (PCI address bit <40>). DAC <47> RO 0 Indicates that the erroneous access was a DAC (PCI address bit <39>). D-32 ES45 Service Guide Table D–14 Pchip Error Register (Continued) Field Bits Type Init Description ADDR <46:14> RO 0 Contains PCI address bits <34:02> RES <13:11> RAZ 0 Reserved IPTPW <10> R,W1C 0 Invalid peer-to-peer read. IPTPR <9> R,W1C 0 Invalid peer-to-peer write. NDS <8> R,W1C 0 No devsel received as a PCI master. DPE <7> R,W1C 0 Pchip detected a parity error on data it received from another PCI device. If the command logged in bits <55:52> of this register is a read, the PCI transaction that encountered the error was a PIO Read or a PTP read on the destination bus, and the Pchip was the master. If the command logged is a write, the transaction that encountered the error was a DMA Write or a PTP write on the source bus (will only be logged in the GPERROR for Pchip Rev 0) and the Pchip was a target. TA <6> R,W1C 0 Target abort received as PCI master. APE <5> R,W1C 0 Address parity error detected as target. SGE <4> R,W1C 0 Scatter Gather Error, invalid page table entry. Continued on next page Registers D-33 Table D–14 Pchip Error Register (Continued) Field Bits Type Init Description DCRTO <3> R,W1C 0 Delayed completion Retry timeout as PCI target. PERR <2> R,W1C 0 Pchip received a p_err_l assertion on data it sent to the PCI. If the command logged in bits <55:52> of this register is a read, the PCI transaction that encountered the error was a DMA Read or a PTP read on the source bus and the Pchip was the target. If the command logged is a write, the transaction that encountered the error was a PIO Write or a PTP write on the destination bus. SERR <1> R,W1C 0 Set when serr_l assertion is detected on the PCI. Does not log any other information in this register. Does not affect the logging of the Lost bit. LOST <0> R,W1C 0 Lost an error because it was detected after this register was locked. D-34 ES45 Service Guide D.16 Pchip AGP Error Register (AGPERROR ) The register is used for logging AGP errors. If any of bits <6:4, 0> are set, then this entire register is frozen and the Pchip output signal h_err is asserted. Only bit <0> can be set after that. All other values will be held until bits <6:4> are clear. When an error is detected and one of bits <6:4> is set, the associated cache block address and AGP bus command are captured in bits <63:16> of this register. Table D–15 Pchip AGP Error Register Field Bits Type Init Description RES <63:60> MBZ 0 Reserved FENCE <59> RO 0 Fence bit, used only if command code indicates LP transaction. LENGTH <58:53> RO 0 AGJP transaction length in quadwords CMD <52:50> RO 0 AGP Command Value Command 000 Read (low priority) 001 Read (high priority) 010 Write (low priority) 011 Write (high priority) 100,101 Reserved 110 Flush 111 Fence Continued on next page Registers D-35 Table D–15 Pchip AGP Error Register (Continued) Field Bits Type Init Description MWIN <49> RO 0 Monster Window hit DAC <48> RO 0 DAC RES <47> RAZ 0 Reserved ADDR <46:15> RO 0 AGP address <34:3> corresponding to the erroneous quadword. RES <14.7> RAZ 0 Reserved NOWINDOW <6> R,W1C 0 An incoming AGP address did not match the Window registers. PTP <5> R,W1C 0 An incoming scatter-gather address had the PTP bit enabled. IPTE <4> R,WIC 0 Invalid page table entry. RESCMD <3> R,W1C 0 HPQFULL <2> R,W1C 0 LPQFULL <1> R,WIC 0 LOST <0> R,WIC 0 Reserved command received on the PCI. Logged if AGPERREN<RESCMD> is set. No other information is logged. Does not affect the setting of the Lost bit Reserved Command received on the PCI. Logged if AGPERREN<RESCMD> is set. No other information is logged; Does not affect the setting of the Lost bit The AGP Request Queue is full and a subsequent LP transaction was received. Logged if AGPER-REN< LPQFULL> is set. No other information is logged; does not affect the setting of the Lost bit. Lost an error because it was detected after this register was locked. D-36 ES45 Service Guide D.17 DPR Registers for 680 Correctable Machine Check Logout Frames DPR Locations A0:A9 represent the information that the console will read when a 680 machine check logout frame is loaded. They provide the interrupt information obtained by the RMC through the LM78 sensors. When an error occurs, the RMC writes the bits and delivers an IRQ to the SRM console. The SRM reads the bits and clears them. On the next 680 error, the RMC writes the error into the A0:A9 locations. Table D–16 DPR Locations A0:A9 DPR Location Description A0 If bit is set, the associated fault is active. Bit 0 +3.3v out of tolerance +5 v out of tolerance +12 v out of tolerance Vterm out of tolerance PCI backplane Zone 0 temp sensor is overtemp BTI (overtemp signals from all CPU and LM78 sensors) Fan 1 fault (below the minimum RPM) Fan 2 fault (below the minimum RPM) A1 Bit 0 2 CTERM out of tolerance –12 v out of tolerance Registers D-37 Table D–16 DPR Locations A0:A9 (Continued) DPR Location Description A2 If bit is set the associated fault is active. Bit 0 1 2 3 4 5 6 7 CPU0_VCORE out of tolerance CPU0_VIO out of tolerance CPU1_VCORE out of tolerance CPU1_VIO out of tolerance PCI backplane LM78 1 is overtemp Not used Fan 4 fault Fan 5 fault A3 Bit 0 1 2 4 5 6 +1.5 volt out of tolerance CPU0_VCACHE out of tolerance CPU1_VCACHE out of tolerance +2.5 volt out of tolerance CPU2_VCACHE out of tolerance CPU3_VCACHE out of tolerance A4 If bit is set the associated fault is active. A5 D-38 Bit 0 1 2 3 4 5 6 7 CPU2_VCORE out of tolerance CPU2_VIO out of tolerance CPU3_VCORE out of tolerance CPU3_VIO out of tolerance PCI backplane LM78 2 is overtemp Not used Fan 3 fault Fan 6 fault Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1-0 AC_input value high limit AC_input value low limit Monitor the temperature Current from +12 volt rail is out of tolerance Current from 5.5 volt rail is out of tolerance Current from 3.3 volt rail is out of tolerance Failing power supply number (0,1,2 are valid) ES45 Service Guide Table D–16 DPR Locations A0:A9 (Continued) DPR Location Description A6 These bits indicate a door has been opened. Bit 0 1 2 3 5 6 7 A7 Temperature Warning Mask Bit 0 1 2 3 4 5 6 A8 CPU0 temp warning CPU1 temp warning CPU2 temp warning CPU3 temp warning Temp Zone 0 (LM78 0 on PCI backplane) Temp Zone 1 (LM78 1 on PCI backplane) Temp Zone 2 (LM78 2 on PCI backplane) Fan Controller Fault. This indicates a fan is not responding to a different RPM range as set by the RMC. (It is used to indicate that the fan failed to reach its maximum RPM at power-up.) Bit 0 1 3 4 5 6 A9 unused CPU door is open Fan door is open PCI door is open System CPU door is open System fan door is open System PCI door is open Fan 1 Fan 2 Fan 3 Fan 4 Fan 5 Fan 6 These bits indicate which temperature zone the rise or fall in temperature occurred in. Bit 0 Bit 1 Bit 2 Bit 3 CPU fans spin at the maximum speed CPU fans reduce the speed from the maximum speed PCI fans spin at the maximum speed PCI fans reduce the speed from the maximum speed Registers D-39 D.18 DPR Power Supply Status Registers The RMC reads nine bytes of information from each of the three power supplies. The first byte is read from an I/O expander port, the second four bytes and the last four bytes are read from the A–D converter. Table D–17 Nine Bytes Read from Power Supply DPR Location Definition DB/E4/ED Reads I/O expander on power supply 0, 1, 2 Bit 0 1 2 3 4:7 PS_ID0_L PS_ID1_L Reserved (Pulled up so bit is always enabled) Thermal_Shutdown_H Tied to High within PS DC/E5/EE 3.3V_current. Each step equals 0.255 (0xFF x 0.33203 = 85A) DD/E6/EF 5 V_current. Each step equals 0.255 (0xFF x 0.33203 = 85A) DE/E7/F0 12 V_current. Each step equals 0.033 (0xFF x 0.07813 = 20A) DF/E8/F1 Fan_Speed (0x8B = 7 V) E0/E9/F2 AC_INPUT value in hex. Each step equals 1.07422VAC (0xFF x 1.07422 = 275VAC) E1/EA/F3 Power_supply_internal_temperature (hot) Byte represents a temp value 1 bit = 0.756° C E2/EB/F4 Power_supply_inlet_temperature 1 bit = 0.266° C E3/EC/F5 Spare NOTE: D-40 The DPR locations refer to power supplies. For example, DB/E4/ED = power supply 0/1/2. The same is true for all locations listed in the table. ES45 Service Guide D.19 DPR 680 Fatal Registers The RMC is powered by an auxiliary 5V supply that is independent from the system power subsystem. When any catastrophic failures (such as overtemperature failure) occur, this error state is captured as shown in Table D–18. The information is used to populate the console data log uncorrectable error frame in Environ_QW_8. Table D–18 DPR 680 Fatal Registers DPR Location Definition BD Copy of the power supply AC input value Bit 0 PS0 1 indicates AC input is valid; 0 indicates invalid Bit 1 PS1 Bit 2 PS2 BE Snapshot of the fault I/O expander, which indicates PS, VTERM, CPU regulator fault if bit is set. Bit 0 PS0 Bit 1 PS1 Bit 2 PS2 Bit 3 VTERM Bit 4 CPU0 Bit 5 CPU1 Bit 6 CPU2 Bit 7 CPU3 BF RMC shutdown code Bit 0 Unused Bit 1 No CPU in CPU slot 0 Bit 2 Invalid CPU SROM voltage setting or checksum Bit 3 TIG load initialization or sequence fail Bit 4 Overtemperature failure Bit 5 CPU door open Bit 6 CPU fans 5 and 6 failed Bit 7 CTERM failure Registers D-41 D.20 CPU and System Uncorrectable Machine Check Logout Frame The SRM console builds the uncorrectable machine check logout frames and passes them to the OS error handlers. The OS error handlers further process and subsequently log the formatted error event into the system binary error log. Table D–19 CPU and System Uncorrectable Machine Check Logout Frame 63 56 55 48 47 40 39 Retryable/Second Error Flags 32 31 24 23 16 15 8 7 0 Frame Size(00C8) System Area Offet(00A0) EV68 Area Offset(0018) Machine Check Frame Revision(1) Machine Check Code EV68 Ibox Status (I_STAT<31:29>) EV68 Dcache Status (DC_STAT<4:0>) EV68 Cbox (C_ADDR<43:6>) EV68 Cbox (C_SYNDROME_1<7:0>) EV68 Cbox (C_SYNDROME_0<7:0>) EV68 Cbox (C_STAT<4:0>) EV68 Cbox (C_STS<3:0>) EV68 TB Miss or Fault Status(MM_STAT<10:0>) EV68 Exception Address (EXC_ADDR) EV68 Interrupt Enablement and Current Processor Mode (IER_CM) EV68 Interrupt Summary Register (ISUM) EV68 Reserved 0 EV68 PAL Base Address (PAL_BASE) EV68 Ibox Control (I_CTL) EV68 Ibox Process Context (PCTX) EV68 Reserved 1 EV68 Reserved 2 Software Error Summary Flags Cchip CPUx Device Interrupt Request Register (DIRx System Primary CPU Fault Watcher) Cchip Miscellaneous Register (MISC) Pchip 0 Error Register (P0_PERROR) Pchip 1 Error Register (P1_PERROR) NOTE: For CPU uncorrectable offsets B0–B8 are zeroed and system uncorrectable offsets 18–98 are zeroed. D-42 ES45 Service Guide Offset(Hex) 00000000 00000008 00000010 00000018 00000020 00000028 00000030 00000038 00000040 00000048 00000050 00000058 00000060 00000068 00000070 00000078 00000080 00000088 00000090 00000098 000000A0 000000A8 000000B0 000000B8 000000C0 D.21 Console Data Log Event Environmental Error Logout Frame (680 Uncorrectable) Compaq Analyze uses the logout frame in Table D–20 for its decomposition of all 680 system environmental uncorrectable error frames. Table D–20 Console Data Log Event Environmental Error Logout Frame (680 Uncorrectable) 63 56 55 48 47 40 39 Revision (1) 32 31 24 23 16 15 8 7 0 Offset (Hex) Type (3) Class (12) Length (80) Processor WHAMI Retryable/Second Error Flags Frame Size 0070) System Area Offet(0020) EV68 Area Offset(00201) Machine Check Frame Revision Machine Check Code (206) Software Error Summary Flags Cchip CPUx Device Interrupt Request Register (DIRx System Primary CPU Fault Watcher) Environ_QW_1 (TIG System Management Information Register (SMIR)) Environ_QW_2 (TIG CPU Information Register (CPUIR)) Environ_QW_3 (TIG Power Supply Information Register (PSIR)) Environ_QW_4 (System_PS/Temp/Fan_Fault - LM78_ISR ) Environ_QW_5 (System_Doors) Environ_QW_6(System_Temperature_Warning) Environ_QW_7(System_Fan_Control_Fault) Environ_QW_8(Fatal_Power_Down_Codes) Environ_QW_9(Environmental Reserved 1) 00000000 00000008 00000010 00000018 00000020 00000028 00000030 00000038 00000040 00000048 00000050 00000058 00000060 00000068 00000070 00000078 NOTE: Only Environ_QW_8 contains valid error state capture. All other Environ_QW_1-7, 9 will be zeroed. 1 Per Alpha SRM requirement. Registers D-43 D.22 CPU and System Correctable Machine Check Logout Frame The SRM console builds the correctable machine check logout frames and passes them to the OS error handlers. The OS error handlers further process and subsequently log the formatted error event into the system binary error log. The operating systems contain built-in throttling mechanisms to handle high-volume bursting of these correctable error conditions. Table D–21 CPU and System Correctable Machine Check Logout Frame 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 Retryable / Second Error Flags Frame Size(0080) System Area Offet(0058) EV68 Area Offset(0018) Machine Check Frame Revision(1) Machine Check Code EV68 Ibox Status (I_STAT<31:29>) EV68 Dcache Status (DC_STAT<4:0>) EV68 Cbox (C_ADDR<43:6>) EV68 Cbox (C_SYNDROME_1<7:0>) EV68 Cbox (C_SYNDROME_0<7:0>) EV68 Cbox (C_STAT<4:0>) EV68 Cbox (C_STS<3:0>) EV68 TB Miss or Fault Status(MM_STAT<10:0>) Software Error Summary Flags (See section 1.4.2) Cchip CPUx Device Interrupt Request Register (DIRx System Primary CPU Fault Watcher) Cchip Miscellaneous Register (MISC) Pchip 0 Error Register (P0-PERROR) Pchip 1 Error Register (P1-PERROR ) NOTE: For CPU correctable offsets 68–78 will be zeroed and system uncorrectable offsets 18–50 will be zeroed. D-44 ES45 Service Guide Offset 0 (Hex) 00000000 00000008 00000010 00000018 00000020 00000028 00000030 00000038 00000040 00000048 00000050 00000058 00000060 00000068 00000070 00000078 D.23 Environmental Error Logout Frame (680 Correctable) Table D–22 shows Environ_QW_1:7 and Environ_QW_8 error state capture information from DPR locations A0:A9 and BD:BF, respectively. Table D–22 Environmental Error Logout Frame 63 56 55 48 47 40 39 32 31 24 23 16 15 8 7 0 Offset (Hex) Retryable/Second Error Flags Frame Size 0070) System Area Offet(0018) EV68 Area Offset(00181) Machine Check Frame Revision(1) Machine Check Code (206) Software Error Summary Flags Cchip CPUx Device Interrupt Request Register (DIRx System Primary CPU Fault Watcher) Environ_QW_1 (TIG System Management Information Register (SMIR)) Environ_QW_2 (TIG CPU Information Register (CPUIR)) Environ_QW_3 (TIG Power Supply Information Register (PSIR)) Environ_QW_4 (System_PS/Temp/Fan_Fault - LM78_ISR ) Environ_QW_5 (System_Doors) Environ_QW_6(System_Temperature_Warning) Environ_QW_7(System_Fan_Control_Fault) Environ_QW_8(Fatal_Power_Down_Codes) Environ_QW_9(Environmental Reserved 1) 00000000 00000008 00000010 00000018 00000020 00000028 00000030 00000038 00000040 00000048 00000050 00000058 00000060 00000068 NOTE: Only Environ_QW_1–7 contain valid error state capture. All other Environ_QW_8,9 will be zeroed. 1 Per Alpha SRM requirement. Registers D-45 D.24 Platform Logout Frame Register Translation Compaq Analyze uses information from all logout frames for its decomposition of all error events. The error state bit definitions of all platform logout frame registers is shown in Table D–23. Table D–23 Bit Definition of Logout Frame Registers Register Identification Bit Field Text Translation Description C_SYNDROME_0 <7:0> Syndrome for lower quadword in octaword of victim that was scrubbed as follows : <7:0>(Hex) CE CB D3 D5 D6 D9 DA DC 23 25 26 29 2A 2C 31 34 0E 0B 13 15 16 19 1A 1C E3 E5 E6 E9 D-46 ES45 Service Guide Data Bit 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 <7:0>(Hex) 4F 4A 52 54 57 58 5B 5D A2 A4 A7 A8 AB AD B0 B5 8F 8A 92 94 97 98 9B 9D 62 64 67 68 Data Bit 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field C_SYNDROME_0 (continued) Text Translation Description EA 28 6B 60 EC 29 6D 61 <7:0>(Hex) F1 F4 01 02 04 08 Data Bit 30 31 CB0 CB1 CB2 CB3 <7:0>(Hex)70 75 10 20 40 80 Data Bit 62 63 CB4 CB5 CB6 CB7 C_SYNDROME_1 <7:0> Syndrome for upper quadword in octaword of victim that was scrubbed (same as specified above) C_STAT <4:0> <4:0>(Hex) Detected Error 1 00 No Error unless DC_STAT<3> = 1 indicating bcache/dcache victim read ECC error. SNGL_BC_TAG_PERR SNGL_DC_DUPLICATE_TAG_PERR SNGL_DSTREAM_MEM_ECC_ERROR SNGL_DSTREAM_BC_ECC_ERR SNGL_DSTREAM _DC_ECC_ERR SNGL_BC_PROBE _HIT_ERR SNGL_ISTREAM_MEM_ECC _ERR SNGL_ISTREAM_BC _ECC_ERR DBL_DSTREAM_MEM_ECC_ERR DBL_DSTREAM_BC_ECC_ERR DBL_ISTREAM_MEM_ECC_ERR DBL_ISTREAM_BC_ECC_ERR 01 02 03 04 05 06 or 07 0B 0C 13 14 1B 1C C_STS <7:4> <3:0> Reserved Captured status of the Bcache in INIT mode (<3>= Parity, <2> = Valid, <1> = Dirty, <0> = Shared). Continued on next page 1 SNGL: Single-bit error leading to correctable error; DBL: double-bit error leading to uncorrectable error. Registers D-47 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description C_ADDR <42:6> Address of last reported ECC or parity error. If C_STAT<4:0> = 05(Hex) then only C_ADDR<19:6> are valid. I_STAT <63:41> <40> <39> <38> <37:34> <33> <32:30> <29> Reserved ProfileMe Mispredict Trap ProfileMe Trap ProfileMe Load-Store Order Trap ProfileMe Trap Types ProfileMe Icache Miss ProfileMe Counter 0 Overcount Set = icache encountered a parity error on instruction fetch and a reply trap is performed which generates a correctable read interrupt. Reserved <28:0> DC_STAT <4:0> 00001(Bin) = Dcache tag probe pipeline 0 error; 00010(Bin) = Dcache tag probe pipeline 1 error; 00100(Bin) = Dcache data ECC error during store; 01000(Bin) = Dcache, Bcache or System fill data ECC error during load; 10000(Bin) = Dcache data store ECC error occurred within 6 cycles of the previous Dcache store ECC error. MM_STAT <3:0> 0001(Bin)= Write reference triggered error; 0010(Bin) = Reference caused an access violation; 0100(Bin) = PTE<FOR> bit set during read reference error; 1000(Bin) = PTE<FOW> bit set during write reference error. Set = Dcache tag parity correctable error during initial tag probe of load/store instruction. Opcode of instruction which triggered error. <10> <9:4> D-48 ES45 Service Guide Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description EXC_ADDR <0> <63:2> Set = exception or interrupt occurred in PAL mode Contains the PC address of the instruction that would have executed if the error interrupt did not occur. IER_CM <4:3> 00(Bin) = Kernel Mode, 01(Bin) = Executive Mode, 10(Bin) = Supervisor Mode, 11(Bin) = User Mode Set = enables those AST interrupt requests by ASTER Software interrupt enables Performance counter interrupt enables Set = Correctable read error interrupt enabled Set = Serial Line Interrupt Enabled External IRQ<5:0> enable <13> <28:14> <30:29> <31> <32> <38:33> I_SUM <28:14> <32> <31> <30:29> <38:33> AST Kernel and Executive Interrupts pending ; <3> Set = Kernel Mode AST interrupt pending, <4> Set =Executive Mode AST interrupt pending AST Supervisor and User Interrupts pending ; <9> Set =Supervisor Mode AST interrupt pending, <10> Set =User Mode AST interrupt pending Software interrupts pending Serial line interrupt pending Set = Corrected read interrupt pending Performance counter interrupts pending External interrupts pending <43:15> Contains the physical base address for PALcode <4:3> <10:9> PAL_BASE Continued on next page Registers D-49 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification I_CTL Bit Field Text Translation Description <2:1> <7:6> 01(Bin) and 10(Bin) for Icache set 1 or 2 enabled, respectively 01(Bin) and 10(Bin) for R8-R11 & R24-R27 and R4-R7 & R20R23 are used for PAL shadow registers, respectively Set = forces bad Icache tag parity Set = forces bad Icache data parity Clear and set for 43 bit or 48 bit virtual address format, respectively Clear or set for R23 or R27 used as CALL_PAL linkage register, respectively Set to enable machine check processing Revision ID number for EV68 Chip as follows: 01(Hex) = Pass 1.0; 02(Hex) = Pass 2.2; 03(Hex) = Pass 2.3; 0x04 (Hex) = Pass 3.0. Virtual page table base address <13> <14> <15> <20> <21> <29:24> <47:30> PCTX <0> <1> <2> <4:3> <8:5> <12:9> <38:13> <46:39> <63:47> Software Error Summary Flags D-50 <0> <1> <2> <63:3> Ibox process context register as follows : Reserved/RAZ If set, both performance counters are enabled If clear , floating-point instructions generate FEN exceptions Reserved/RAZ Enable AST U,S,E,K interrupt requests Request AST U,S,E.K interrupts Reserved/RAZ Address Space Number Reserved/RAZ PAL and OS Error handler signaling software flags Set = Pchip0 P_Error<9:0> error has occurred. Set = Pchip1 P_Error<9:0> error has occurred. Set = Pchip0 or Pchip1 P_Error <11/10> Uncorrectable/correctable error, or CPU correctable error, or CPU uncorrectable error has occurred. Unused ES45 Service Guide Table D–23 Bit Definition of Logout Frame Registers (Continued) ID Bit Field Text Translation Description MISC <43:40> Suppress IRQ1 interrupts to 1(Hex) for CPU0, 2(Hex) for CPU1, 4(Hex) for CPU2, and 8(Hex) for CPU3 Cchip Cchip Revision Level : 00-07(Hex) for C2, 08-0F(Hex) for C4 0(Hex) for CPU0, 1(Hex) for CPU1, 2(Hex) for CPU2, 3(Hex) for CPU3, 4(Hex) for Pchip0, 5(Hex) for Pchip1, as device (source) which caused the NXM Set = NXM address detected, <31:29> are locked, DRIR <63> is set Write 1 = Arbitration Clear =1(Hex) for CPU0, 2(Hex) for CPU1, 4(Hex) for CPU2, and 8(Hex) for CPU3 Arbitration Trying =1(Hex) for CPU0, 2(Hex) for CPU1, 4(Hex) for CPU2, and 8(Hex) for CPU3 Arbitration Won =1(Hex) for CPU0, 2(Hex) for CPU1, 4(Hex) for CPU2, and 8(Hex) for CPU3 to set interprocessor interrupt request. =1(Hex) for CPU0, 2(Hex) for CPU1, 4(Hex) for CPU2, and 8(Hex) for CPU3 interprocessor interrupt (IRQ<3>) pending =1(Hex) for CPU0, 2(Hex) for CPU1, 4(Hex) for CPU2, and 8(Hex) for CPU3 interval timer interrupt (IRQ<2>) pending =00(Bin) for CPU0, 01(Bin) for CPU1, 10(Bin) for CPU2, 11(Bin) for CPU3 ID performing the read. <39:32> <31:29> <28> <24> <23:20> <19:16> <15:12> <11:8> <7:4> <1:0> Continued on next page Registers D-51 Table D–23 Bit Definition of Logout Frame Registers (Continued) ID Bit Field Text Translation Description DIRx <63> <62> <61> <60> <59> <58> <57:56> <55> <54> <53> <52> <51> <50> <49> <48> <47:44> <43:40> <39:36> <35:32> <31:28> <27:24> <23:20> <19:16> <15:12> <11:8> Internal Cchip asynchronous error <i.e.NXM> (IRQ0) P0_Pchip error (IRQ0) P1_Pchip error (IRQ0)) P2_Pchip error (future designs) (IRQ0) P3_Pchip error (future designs) (IRQ0) OCP or RMC Halt(IRQ0) Unused INTR -PCI_ISA Device Interrupt error(IRQ1) SMI- System Mgmt Interrupt error(IRQ1) NMI - Non-Maskable Interrupt-fatal error (IRQ1) Unused Unused Environmental Temp,Doors,Fans errors (IRQ1) Unused Unused Pchip1_SLOT5<3:0>-System PCI Slot 9 INTa,b,c,d (IRQ1) Pchip1_SLOT4<3:0>-System PCI Slot 8 INTa,b,c,d (IRQ1) Pchip1_SLOT3<3:0>-System PCI Slot 7 INTa,b,c,d (IRQ1) Pchip1_SLOT2<3:0>-System PCI Slot 6 INTa,b,c,d (IRQ1) Pchip1_SLOT1<3:0>-System PCI Slot 5 INTa,b,c,d (IRQ1) Pchip1_SLOT0<3:0>-System PCI Slot 4 INTa,b,c,d (IRQ1) Pchip0_SLOT4<3:0>-System PCI Slot 3 INTa,b,c,d (IRQ1) Pchip0_SLOT3<3:0>-System PCI Slot 2 INTa,b,c,d (IRQ1) Pchip0_SLOT2<3:0>-System PCI Slot 1 INTa,b,c,d (IRQ1) Pchip0_SLOT1<3:0>-System PCI Slot 0 INTa,b,c,d (IRQ1) Note:Pchip0_SLOT0 = PCI/ISA Cypress/Acer Bridge <7:0> D-52 Unused ES45 Service Guide Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification Bit Field Text Translation Description P0 & 1_ERROR <63:56> ECC Syndrome of CRE or UECC error - Same as EV68. When CRE or UECC failing transaction: 0000(Bin) = DMA Read; 0001(Bin) = DMA RMW; 0011(Bin) = S/G Read. PCI command of transaction when error not CRE or UECC : 0000(Bin) = PCI IACKCycle ; 0001(Bin) = PCI Special Cycle ; 0010(Bin) = PCI I/O Read; 0011(Bin) = PCI I/O Write; 0100(Bin) = Reserved ; 0101(Bin) = PCI PTP Write ; 0110(Bin) = PCI Memory Read ; 0111(Bin) = PCI Memory Write from CPUx; 1000(Bin) = PCI CSR Read; If clear = valid <63:56>,<55:52>, and <50:16> error information if any <11:0> bits are set, otherwise invalid. If <11> or <10> =set and <51> =clear, <50:19> = System address <34:3> of erred quadword and <18:16> = 000(Bin); else if any one of <9:0> =set and <51> = clear, <50:48> = 000(Bin),<47:18> = starting PCI address <31:2> of erred transaction, <17:16> = 00(Bin) if not DAC; 01(Bin) if DAC SG Windows 3; 1x(Bin) if Monster Window MBZ, RAZ Set = Correctable ECC Error (M or T2) Set = Uncorrectable ECC Error (M or T) Reserved – MBZ/RAZ Set = No device select as PCI (M) error Set = PCI read data parity error as PCI (M) Set = Target abort error detected as PCI (M) Set = Address parity error detected as potential PCI Set = Invalid S/G page table entry detected as PCI Set = Delayed completion retry time-out error as PCI Set = PERR# error as PCI (M) Set = SERR# error as PCI (M or T) Set = Error occurred / lost after this register locked <55:52> <51> <50:16> <15:12> <11> <10> <9> <8> <7> <6> <5> <4> <3> <2> <1> <0> Continued on next page 2 M refers to PCI Master; T refers to PCI Target Registers D-53 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification SMIR (Environ_QW_1) CPUIR (Environ_QW_2) PSIR (Environ_QW_3) D-54 Bit Field Text Translation Description <7> <6> <5> <4> <3> <2> <1> <0> Inverted Sys_Rst = System is being reset Inverted PCI_Rst1 = PCI Bus #1 is in reset Inverted PCI_Rst0 = PCI Bus #0 is in reset Set = System temperature over 50 degrees C failure unused Set = Sys_DC_Notok failure detected Inverted OCP_RMC_Halt = OCP or RMC halt detected Set = System Power Supply failure detected <7> <6> <5> <4> <3> <2> <1> <0> Set = CPU3 regulator or configuration sequence fail Set = CPU2 regulator or configuration sequence fail Set = CPU1 regulator or configuration sequence fail Set = CPU0 regulator or configuration sequence fail Set = CPU3 regulator is enabled Set = CPU2 regulator is enabled Set = CPU1 regulator is enabled Set = CPU0 regulator is enabled <7> <6> <5> <4> <3> <2> <1> <0> Not Used Set = Power Supply 2 failed and was enabled Set = Power Supply 1 failed and was enabled Set = Power Supply 0 failed and was enabled Not Used Set = Power Supply 2 is enabled Set = Power Supply 1 is enabled Set = Power Supply 0 is enabled ES45 Service Guide Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification System_PS/Temp/ Fan_Fault_ LM78_ISR (Environ_QW_4) Bit Field Text Translation Description <0> <1> <2> <3> <4> Set = PS +3.3V out of tolerance Set = PS +5V out of tolerance Set = PS +12V out of tolerance Set = VTERM out of tolerance Set = Temperature zone 0 (PCI Backplane slots 1-3 area) over limit failure Set = LM75 CPU0-3 Temperature over limit failure (OLF) Set = System Fan 1 failure Set = System Fan 2 failure Set = CTERM out of tolerance Unused Set = -12V out of tolerance Unused Set = CPU0_VCORE +2V out of tolerance Set = CPU0_VIO +1.5V out of tolerance Set = CPU1_VCORE +2V out of tolerance Set = CPU1_VIO +1.5V out of tolerance Set = Temperature zone 1 (PCI Backplane slots 7-10) (OLF) Unused <5> <6> <7> <8> <9> <10> <15:11 <16> <17> <18> <19> <20> <21> <22> <23> <31:24> <32> <33> <34> <35> <36> <37> <38> <39> <41:40> <42> <43> <44> <45> <46> Set = System Fan 4 failure Set = System Fan 5 failure Unused Set = CPU2_VCORE +2V out of tolerance Set = CPU2_VIO +1.5V out of tolerance Set = CPU3_VCORE +2V out of tolerance Set = CPU3_VIO +1.5V out of tolerance Set = Temperature zone 2 (PCI Backplane slots 4-6) (OLF) Unused Set = System Fan 3 failure Set = System Fan 6 failure 00(Bin) = Power supply 0; 01 (Bin) = power supply 1; 10 (Bin) = power supply 2; 11(Bin) = Reserved that has caused the <42:47> warning condition. Set = Power supply 3.3V rail above high amperage warning Set = Power supply 5.0V rail above high amperage warning Set = Power supply 12V rail above high amperage warning Set = Power supply high temperature warning Set = Power supply AC input low limit warning Registers D-55 Table D–23 Bit Definition of Logout Frame Registers (Continued) Register Identification System_Doors (Environ_QW_5) System_Temperature_Warning (Environ_QW_6) Bit Field Text Translation Description <47> Set = Power supply AC input high limit warning <63:48> Unused <0> <1> <2> <3> <4> <5> <6> <7> <63:8> Unused Set = System CPU door is open Set = System Fan door is open Set = System PCI door is open Unused Set = System CPU door is closed Set = System Fan door is closed Set = System PCI door is closed Unused <0> <1> <2> <3> <4> <63:7> Set = CPU0 temperature warning fault has occurred Set = CPU1 temperature warning fault has occurred Set = CPU2 temperature warning fault has occurred Set = CPU3 temperature warning fault has occurred Set = System temperature zone 0 warning fault has occurred Set = System temperature zone 1 warning fault has occurred Set = System temperature zone 2 warning fault has occurred Unused <0> <1> <2> <3> <4> <5> <7:6> <8> <9> <10> <11> Set = System Fan 1 is not responding to RMC Commands Set = System Fan 2 is not responding to RMC Commands Set = System Fan 3 is not responding to RMC Commands Set = System Fan 4 is not responding to RMC Commands Set = System Fan 5 is not responding to RMC Commands Set = System Fan 6 is not responding to RMC Commands Unused Set = CPU fans 5/6 at maximum speed Set = CPU fans 5/6 reduced speed from maximum Set = PCI fans 1-4 at maximum speed Set = PCI fans 1-4 reduced speed from maximum. <5> <6> System_Fan_ Control_Fault (Environ_QW_7) D-56 ES45 Service Guide Table D–23 Bit Definition of Logout Frame Registers Register Identification Bit Field Text Translation Description Fatal_Power_Down_Codes <0> <1> <2> <3:7> <8> <9> <10> <11> <12> <13> <14> <15> <16> <17> <18> Set = Power Supply 0 AC input fail Set = Power Supply 1 AC input fail Set = Power Supply 2 AC input fail Unused Set = Power Supply 0 DC fail Set = Power Supply 1 DC fail Set = Power Supply 2 DC fail Set = Vterm fail Set = CPU0 Regulator fail Set = CPU1 Regulator fail Set = CPU2 Regulator fail Set = CPU3 Regulator fail Unused Set = No CPU in system motherboard CPU slot 0 Set = Invalid CPU SROM voltage setting or checksum Set = TIG load initialization or sequence fail Set = Overtemperature fail Set = CPU door open fail Set = System fan 5 (CPU backup fan) fail Set = Cterm fail Unused (Environ_QW_8) <19> <20> <21> <22> <23> <63:24> Registers D-57 Appendix E Isolating Failing DIMMs This appendix explains how to manually isolate a failing DIMM from the failing address and failing data bits. It also covers how to isolate single-bit errors. The following topics are covered: • Information for Isolating Failures • DIMM Isolation Procedure • EV68 Single-Bit Errors Isolating Failing DIMMs E-1 E.1 Information for Isolating Failures Table E–1 lists the information needed to isolate the failure. See Appendix D for the register table for the Array Address Registers (AARs). The failing address and failing data can come from a variety of different locations such as the SROM serial line, SRM screen displays, the SRM event log, and errors detected by the 21264 (EV68) chip. Convert the address to data bits if the address is not on a 256-bit alignment (address ends in a value less than 20 or address xxxxx20 or address xxxxxnn, where nn is 1 through 1F). For example, using failing address 0x1004 and failing data bit 8(dec), first multiply the failing address 4 by 8 = 32. Then add 32 to the failing data bit to yield the actual failing data bit 40. This conversion yields the new failing information to be failing address 0x1000 and failing data bit = 40(dec). Table E–1 Information Needed to Isolate Failing DIMMs Failing Address Failing Data/Check bits Array Address Registers CSC AAR0 AAR1 AAR2 AAR3 DPR Locations DPR:80 DPR:82 DPR:84 DPR:86 E-2 ES45 Service Guide Memory Addresses 801.A000.0000 801.A000.0100 801.A000.0140 801.A000.0180 801.A000.01C0 Memory Addresses 801.1000.2000 801.1000.2080 801.1000.2100 801.1000.2180 E.2 DIMM Isolation Procedure Use the procedure in this section to isolate the failing DIMM. 1. Find the failing array by using the failing address and the Array Address Registers (AARs—see Appendix D). Use the AAR base address and size to create an Address range for comparing the failing address. For example, if AAR1 base address was 40000000 (1 GB) and its size was 10000000 (256 MB), the address range would be 40000000–4FFFFFFF (4–4.25 GB). This range would be used to compare against the failing address. 2. Determine if the Address XORing is enabled. • If Address XORING is enabled, use Table E–2 to find the real array on which the failure occurred for 4-way interleaving, or Error! Reference source not found. for 2-way interleaving. • If Bit 51 of the CSC register is set to 1, XORing is disabled. Table E–2 Determining the Real Failed Array for 4-Way Interleaving Failing Address <8:7> Original Array 0 Original Array 1 Original Array 2 Original Array 3 00 01 10 11 Real Array 0 Real Array 1 Real Array 2 Real Array 3 Real Array 1 Real Array 0 Real Array 3 Real Array 2 Real Array 2 Real Array 3 Real Array 0 Real Array 1 Real Array 3 Real Array 2 Real Array 1 Real Array 0 Table E–3 Determining the Real Failed Array for 2-Way Interleaving Failing Address <8> Original Array 0 Original Array 1 Original Array 2 Original Array 3 0 1 Real Array 0 Real Array 2 Real Array 1 Real Array 3 Real Array 2 Real Array 0 Real Array 3 Real Array 1 Isolating Failing DIMMs E-3 3. After finding the real array, determine whether it is the lower array set or the upper array set. Use DPR locations 80, 82, 84, and 86 listed in Table E–1. Table E–4 shows the description of these locations. Table E–4 Description of DPR Locations 80, 82, 84, and 86 DPR Location 80 82 84 86 E-4 Description Array 0 (AAR 0) Configuration Bits<7:4> Bits<3:0> 4 = non split—lower set 0 = Configured—Lowest array only 1 = Configured—Next lowest array 5 = split—lower set only 2 = Configured—Second highest 9 = split—upper set only array D = split—8 DIMMs 3 = Configured—Highest array F = Twice split— 4 = Misconfigured—Missing DIMM(s) 8 DIMMs 8 = Miconfigured—Illegal DIMM(s) C = Misconfigured— Incompatible DIMM(s) Array 1 (AAR 1) configuration Array 2 (AAR 2) configuration Array 3 (AAR 3) configuration ES45 Service Guide 4. Use the following table to determine the proper set. Bits<27,28,29,30,31,32> are from the failing address. Array Size Configuration Type Bits <7:4> from DPR 4&5 9 256MB Lower Set Upper Set Bit <27> == 0 – Lower Set, 1– Upper Set 512MB Lower Set Upper Set Bit <28> == 0 – Lower Set, 1– Upper Set 1GB Lower Set Upper Set Bit <29> == 0 – Lower Set, 1– Upper Set 2GB Lower Set Upper Set Bit <30> == 0 – Lower Set, 1– Upper Set 4GB Lower Set Upper Set Bit <31> == 0 – Lower Set, 1– Upper Set 8GB Lower Set Upper Set Bit <32> == 0 – Lower Set, 1– Upper Set 5. D&F Now that you have the real array, the failing Data/Check bits, and the correct set, use Table E–5 to find the failing DIMM or DIMMs. The table shows data bits 0–255 and check bits 0–31. These data bits indicate a single-bit error. An SROM compare error would yield address and data bits from 0–63. When you convert the address to be in the correct range, the failing data would be somewhere between 0 and 255. Isolating Failing DIMMs E-5 Table E–5 Failing DIMM Lookup Table Array 0 Data Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B 0 0 5 0 4 1 5 1 4 0 2 0 3 1 0 5 0 4 1 5 1 4 0 2 0 3 2 0 5 0 4 1 5 1 4 0 2 0 3 0 5 0 4 1 5 1 4 0 2 4 0 5 0 4 1 5 1 4 0 2 5 0 5 0 4 1 5 1 4 0 6 0 5 0 4 1 5 1 4 7 0 5 0 4 1 5 1 8 2 5 2 4 3 5 9 2 5 2 4 3 5 10 2 5 2 4 3 11 2 5 2 4 12 2 5 2 4 13 2 5 2 14 2 5 15 2 16 0 17 Upper Set J # M M B J # 1 2 1 3 1 2 1 3 3 1 2 1 3 0 3 1 2 1 3 0 3 1 2 1 3 2 0 3 1 2 1 3 0 2 0 3 1 2 1 3 4 0 2 0 3 1 2 1 3 3 4 2 2 2 3 3 2 3 3 3 4 2 2 2 3 3 2 3 3 5 3 4 2 2 2 3 3 2 3 3 3 5 3 4 2 2 2 3 3 2 3 3 3 5 3 4 2 2 2 3 3 2 3 3 4 3 5 3 4 2 2 2 3 3 2 3 3 2 4 3 5 3 4 2 2 2 3 3 2 3 3 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 18 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 19 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 20 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 21 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 22 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 23 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 E-6 ES45 Service Guide Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Data Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set Upper Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # 24 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 25 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 26 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 27 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 28 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 29 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 30 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 31 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 32 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 33 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 34 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 35 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 36 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 37 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 38 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 39 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 40 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 41 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 42 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 43 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 44 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 45 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 46 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 47 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 Continued on next page Isolating Failing DIMMs E-7 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Data Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set Upper Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B 48 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 49 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 50 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 51 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 52 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 53 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 54 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 55 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 56 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 57 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 58 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 59 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 60 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 61 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 62 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 63 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 64 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 65 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 66 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 67 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 68 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 69 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 70 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 71 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 E-8 ES45 Service Guide J # Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Data Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set Upper Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # 72 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 73 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 74 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 75 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 76 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 77 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 78 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 79 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 80 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 81 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 82 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 83 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 84 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 85 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 86 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 87 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 88 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 89 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 90 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 91 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 92 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 93 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 94 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 95 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 Continued on next page Isolating Failing DIMMs E-9 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Data Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set Upper Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # 96 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 97 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 98 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 99 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 100 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 101 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 102 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 103 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 104 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 105 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 106 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 107 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 108 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 109 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 110 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 111 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 112 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 113 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 114 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 115 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 116 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 117 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 118 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 119 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 E-10 ES45 Service Guide Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Data Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set Upper Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # 120 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 121 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 122 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 123 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 124 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 125 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 126 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 127 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 128 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 129 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 130 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 131 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 132 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 133 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 134 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 135 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 136 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 137 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 138 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 139 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 140 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 141 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 142 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 143 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 Continued on next page Isolating Failing DIMMs E-11 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Data Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set Upper Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # 144 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 145 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 146 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 147 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 148 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 149 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 150 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 151 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 152 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 153 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 154 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 155 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 156 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 157 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 158 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 159 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 160 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 161 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 162 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 163 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 164 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 165 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 166 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 167 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 E-12 ES45 Service Guide Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Data Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set Upper Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # 168 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 169 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 170 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 171 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 172 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 173 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 174 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 175 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 176 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 177 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 178 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 179 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 180 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 181 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 182 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 183 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 184 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 185 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 186 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 187 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 188 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 189 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 190 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 Continued on next page Isolating Failing DIMMs E-13 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Data Bits Lower Set Array 1 Upper Set Lower Set J # M M B Array 2 Upper Set Lower Set J # M M B J # M M B Array 3 Upper Set Lower Set Upper Set J # M M B J # M M B J # M M B J # M M B J # M M B 191 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 192 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 193 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 194 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 195 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 196 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 197 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 198 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 199 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 200 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 201 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 202 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 203 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 204 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 205 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 206 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 207 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 208 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 209 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 210 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 211 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 212 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 213 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 214 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 215 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 E-14 ES45 Service Guide Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Data Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set Upper Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # 216 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 217 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 218 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 219 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 220 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 221 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 222 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 223 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 224 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 225 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 226 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 227 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 228 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 229 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 230 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 231 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 232 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 233 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 234 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 235 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 236 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 237 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 238 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 239 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 Continued on next page Isolating Failing DIMMs E-15 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Data Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set Upper Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # 240 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 241 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 242 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 243 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 244 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 245 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 246 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 247 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 248 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 249 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 250 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 251 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 252 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 253 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 254 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 255 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 E-16 ES45 Service Guide Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Check Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set Upper Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # 0 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 1 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 2 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 3 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 4 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 5 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 6 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 7 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 8 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 9 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 10 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 11 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 12 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 13 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 14 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 15 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 16 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 17 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 18 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 19 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 20 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 21 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 22 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 23 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 Continued on next page Isolating Failing DIMMs E-17 Table E–5 Failing DIMM Lookup Table (Continued) Array 0 Check Bits Lower Set Array 1 Upper Set Lower Set Array 2 Upper Set Lower Set Array 3 Upper Set Lower Set Upper Set M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # M M B J # 24 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 25 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 26 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 27 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 28 0 5 0 4 1 5 1 4 0 2 0 3 1 2 1 3 29 2 5 2 4 3 5 3 4 2 2 2 3 3 2 3 3 30 0 9 0 8 1 9 1 8 0 6 0 7 1 6 1 7 31 2 9 2 8 3 9 3 8 2 6 2 7 3 6 3 7 E-18 ES45 Service Guide E.3 EV68 Single-Bit Errors The procedure for detection down to the set of DIMMs for a single-bit error is very similar to the procedure described in the previous sections. However, you cannot isolate down to a specific data or check bit. The 21264 (EV68) chip detects and reports a C_ADDR<42:6> failing address that is accurate to the cache block (64 bytes). The syndrome registers (Table E– 6) detect data syndrome information, providing isolation down to the low or high quadword of the target octaword that the fault has been detected within. Each of the syndrome registers is able to report 64 data bits (the quadword) and 8 check bits (memory data bus ECC bits). Table E–6 shows the syndrome hexadecimal to physical data or check bit decoding. For example, if you have an EV68 single-bit C_Syndrome_0 hexadecimal error value equal to 23, the second column indicates the decoded physical data or check bit for this encoding. Use these physical data bits in conjunction with the previously described isolation procedure to isolate the failing DIMMs. Table E–6 Syndrome to Data Check Bits Table Syndrome C_Syndrome 0 C_Syndrome 1 CE CB D3 D5 D6 D9 DA DC 23 25 26 29 2A 2C Data Bit 0 or 128 Data Bit 1 or 129 Data Bit 2 or 130 Data Bit 3 or 131 Data Bit 4 or 132 Data Bit 5 or 133 Data Bit 6 or 134 Data Bit 7 or 135 Data Bit 8 or 136 Data Bit 9 or 137 Data Bit 10 or 138 Data Bit 11 or 139 Data Bit 12 or 140 Data Bit 13 or 141 Data Bit 64 or 192 Data Bit 65 or 193 Data Bit 66 or 194 Data Bit 67 or 195 Data Bit 68 or 196 Data Bit 69 or 197 Data Bit 70 or 198 Data Bit 71 or 199 Data Bit 72 or 200 Data Bit 73 or 201 Data Bit 74 or 202 Data Bit 75 or 203 Data Bit 76 or 204 Data Bit 77 or 205 Isolating Failing DIMMs E-19 Table E–6 Syndrome to Data Check Bits Table (Continued) Syndrome C_Syndrome 0 C_Syndrome 1 31 34 0E 0B 13 15 16 19 1A 1C E3 E5 E6 E9 EA EC F1 F4 4F 4A 52 54 57 58 5B 5D A2 A4 A7 A8 AB AD Data Bit 14 or 142 Data Bit 15 or 143 Data Bit 16 or 144 Data Bit 17 or 145 Data Bit 18 or 146 Data Bit 19 or 147 Data Bit 20 or 148 Data Bit 21 or 149 Data Bit 22 or 150 Data Bit 23 or 151 Data Bit 24 or 152 Data Bit 25 or 153 Data Bit 26 or 154 Data Bit 27 or 155 Data Bit 28 or 156 Data Bit 29 or 157 Data Bit 30 or 158 Data Bit 31 or 159 Data Bit 32 or 160 Data Bit 33 or 161 Data Bit 34 or 162 Data Bit 35 or 163 Data Bit 36 or 164 Data Bit 37 or 165 Data Bit 38 or 166 Data Bit 39 or 167 Data Bit 40 or 168 Data Bit 41 or 169 Data Bit 42 or 170 Data Bit 43 or 171 Data Bit 44 or 172 Data Bit 45 or 173 Data Bit 78 or 206 Data Bit 79 or 207 Data Bit 80 or 208 Data Bit 81 or 209 Data Bit 82 or 210 Data Bit 83 or 211 Data Bit 84 or 212 Data Bit 85 or 213 Data Bit 86 or 214 Data Bit 87 or 215 Data Bit 88 or 216 Data Bit 89 or 217 Data Bit 90 or 218 Data Bit 91 or 219 Data Bit 92 or 220 Data Bit 93 or 221 Data Bit 94 or 222 Data Bit 95 or 223 Data Bit 96 or 224 Data Bit 97 or 225 Data Bit 98 or 226 Data Bit 99 or 227 Data Bit 100 or 228 Data Bit 101 or 229 Data Bit 102 or 230 Data Bit 103 or 231 Data Bit 104 or 232 Data Bit 105 or 233 Data Bit 106 or 234 Data Bit 107 or 235 Data Bit 108 or 236 Data Bit 109 or 237 E-20 ES45 Service Guide Table E–6 Syndrome to Data Check Bits Table (Continued) Syndrome C_Syndrome 0 C_Syndrome 1 B0 B5 8F 8A 92 94 97 98 9B 9D 62 64 67 68 6B 6D 70 75 01 02 04 08 10 20 40 80 Data Bit 46 or 174 Data Bit 47 or 175 Data Bit 48 or 176 Data Bit 49 or 177 Data Bit 50 or 178 Data Bit 51 or 179 Data Bit 52 or 180 Data Bit 53 or 181 Data Bit 54 or 182 Data Bit 55 or 183 Data Bit 56 or 184 Data Bit 57 or 185 Data Bit 58 or 186 Data Bit 59 or 187 Data Bit 60 or 188 Data Bit 61 or 189 Data Bit 62 or 190 Data Bit 63 or 191 Check Bit 0 or 16 Check Bit 1 or 17 Check Bit 2 or 18 Check Bit 3 or 19 Check Bit 4 or 20 Check Bit 5 or 21 Check Bit 6 or 22 Check Bit 7 or 23 Data Bit 110 or 238 Data Bit 111 or 239 Data Bit 112 or 240 Data Bit 113 or 241 Data Bit 114 or 242 Data Bit 115 or 243 Data Bit 116 or 244 Data Bit 117 or 245 Data Bit 118 or 246 Data Bit 119 or 247 Data Bit 120 or 248 Data Bit 121 or 249 Data Bit 122 or 250 Data Bit 123 or 251 Data Bit 124 or 252 Data Bit 125 or 253 Data Bit 126 or 254 Data Bit 127 or 255 Check Bit 8 or 24 Check Bit 9 or 25 Check Bit 10 or 26 Check Bit 11 or 27 Check Bit 12 or 28 Check Bit 13 or 29 Check Bit 14 or 30 Check Bit 15 or 31 Isolating Failing DIMMs E-21 Index A Accessing problem reports, 5-6 Alpha System Reference Manual, 4-28 Architecture, 1-2 Auto start, 6-16 UNIX or OpenVMS, 6-16 auto_action environment variable, 6-8, 6-16 auto_action environment variable, SRM, 6-7 Autoboot, 6-16 AUX_5V LED, 1-26 AUX_5V power supply, 1-21 Auxiliary power supply, RMC, 7-3 B Beep codes, 3-14 Boot device, setting, 6-17 Boot problems, 2-7 boot_file environment variable, 6-8 boot_osflags environment variable, 6-8 bootdef_dev environment variable, 6-8 Booting Linux, 6-34 buildfru command, 4-6 Bypass modes, 7-6 Bypassing the RMC, 7-6 C Cables, 8-3 cage disk installation, 8-47 cat el command, 4-10 CCAT, 2-10 C-chip, 1-3 CD-ROM drive, 1-6 part number, 8-5 Chassis accessing in a cabinet, 8-15 front components, 1-6 rear components, 1-7 removing covers from, 8-17 Checksum error, 3-16 Chipset, 1-3 clear password command, 6-19 clear_error all command, 4-11, 8-2, 8-10 clear_error command, 4-11, 4-58 Clearing checksum errors, 4-58 Clearing errors, 4-11 COM1 data flow, defining, 7-15 COM1 environment variables, 7-12 COM1 MMJ port, 1-11 com1_ modem environment variable, 6-11 com1_baud environment variable, 6-10 com1_flow environment variable, 6-10 com1_mode environment variable, 6-10, 7-4 COM2 and parallel port loopback tests, 4-64 COM2 port, 1-11 com2_baud environment variable, 6-10 com2_flow environment variable, 6-10 com2_modem environment variable, 6-11 Command conventions, RMC, 7-14 Compaq Analyze, 2-9 and SDD errors, 4-58 and TDD errors, 4-58 documentation, 5-4 evidence designator, 5-9 initial screen, 5-5 overview, 5-2 using, 5-5 Components system front, 1-6 system rear, 1-7 Configuration CPU, 6-22 memory, 6-23 Index-1 power supply, 6-33 Configuring devices, 6-20 Connecting to RMC from local terminal, 7-10 from local VGA, 7-11 Connectors, rear, 1-10 Connectors, system motherboard, 1-14 console environment variable, 6-3, 6-11 Console event log, 3-13 displaying, 4-10 Console programs, 6-2 Console terminal, 1-33 Control panel, 1-12 Controls Halt button, 1-13 Power button, 1-12 Reset button, 1-13 Covers, 8-17 removing from pedestal, 8-20 removing from tower, 8-19 CPU configuration, 6-21 part numbers, 8-4 slot numbers, 6-21 CPU cards, 1-14, 1-16 replacing, 8-30 CPU correctable error (630), 5-19 CPU uncorrectable error (670), 5-19 cpu_enabled environment variable, 6-12 Crash Analysis Tool, 2-10 crash command, 4-12 Crash dumps, 2-10, 4-12 Crashes, troubleshooting, 2-8 D Data buses, 1-18 Data structures, displaying, 4-27 D-chips, 1-3 De-installing Q-VET, 2-19 deposit and examine commands, 4-13 Devices, configuring, 6-20 Devices, verifying, 4-66 Diagnostic commands buildfru, 4-6 cat el, 4-10 Index-2 clear_error, 4-11 clear_error all, 4-11 crash, 4-12 deposit and examine, 4-13 exer, 4-17 floppy_write, 4-22 grep, 4-23 hd, 4-25 info, 4-27 kill, 4-41 kill_diags, 4-41 memexer, 4-42 memtest, 4-44 more el, 4-10 net, 4-49 net -ic, 4-49 net -s, 4-49 nettest, 4-51 set sys_serial_num, 4-55 show error, 4-56 show fru, 4-59 show_status, 4-62 sys_exer, 4-64 test, 4-66 test -lb, 4-66 Diagnostic commands list, 4-2 Diagnostics power-up, 3-1 running in background, 4-1 showing status of, 4-62 SRM console, 4-1 Dial-in configuration, 7-24 Dial-out alert, 7-26 DIMM arrays, 6-23 DIMMs configuring, 6-23 part numbers, 8-4 Director, Compaq Analyze, 5-4 Disk cages, installing, 8-45 Display device selecting, 6-3 verifying, 6-3 Displaying FRU configuration, 4-59 DPR, 1-22 clearing errors, 8-2, 8-10 error respository, 7-3 dump command (RMC), 7-20 E ECC logic, 5-18 ei*0_inet_init environment variable, 6-12 ei*0_mode environment variable, 6-12 ei*0_protocols environment variable, 6-12 Enclosure panels removing from a pedestal, 8-14 removing from a tower, 8-12 Enclosures, 1-4 env command (RMC), 7-18 Environment variables, 6-6 setting, 6-7 Environment, monitoring, 7-18 Environmental errors captured by SRM, 523 Error beep codes, 3-14 Error handling tools, 2-9 Error log event structure map, 5-22 Error log format, 5-21 Error logs, 5-1 Error messages power-up, 3-14 SROM, 3-22, 3-23 Error repository, clearing, 8-2, 8-10 Escape sequence (RMC), 7-10 Ethernet external loopback, 4-64 EV68 (21264) microprocessor, 1-16 Event log, 3-13 Event structure map, 5-21 ew*0_inet_init environment variable, 6-12 ew*0_mode environment variable, 6-12 ew*0_protocols environment variable, 6-12 exer command, 4-17 Exercising devices, 4-17 Exercising memory, 4-42, 4-44 F Fail-safe loader, 2-21, 3-16 activating, 3-24, 3-25 jumpers, 3-24 Fans, 1-27 part numbers, 8-3, 8-4 replacing, 8-23 Fault detection and correction, 5-17 Firm Bypass mode, 7-8 Firmware updates, 2-20, 3-17 Flash SROM, 3-7 Floppy diskette drive, 1-6 Floppy drive part number, 8-5 replacing, 8-53 floppy_write script, 4-22 Front doors, 1-32 FRU assembly hierarchy, 4-7 FRU descriptor, 4-8 FRU EEPROMs viewing errors logged to, 4-56 FRUs displaying physical configuration, 4-59 hot-plug, 8-9 locations, 8-7 part numbers, 8-3 tools for removing, 8-9 FSL fail-safe loader, 3-16 Function jumpers, 3-24 G grep command, 4-23 Greycode test, 4-45, 4-46 H Halt button, 1-13 with login command, 6-19 halt in/out commands (RMC), 1-13, 7-23 Halt LED, 1-13 Halt, remote, 1-13, 7-23 hangup command (RMC), 7-25 Hard drive overview, 1-30 replacing, 8-26 Hardware configuration viewing, 6-5 hd command, 4-25 heap_expand environment variable, 6-13 Hex dump, 4-25 Index-3 hose_x_default_speed, 4-3 Hot swap module assembly, 1-8 replacing, 8-43 Hot-plug FRUs, 8-9 I I/O connector assembly, replacing, 8-55 I/O connectors, 1-10 I/O control logic, 1-19 I/O implementation, 1-20 indictment command, 5-3 info 0 command, 4-27 info 1 command, 4-29 info 2 command, 4-30 info 3 command, 4-31 info 4 command, 4-33 info 5 command, 4-34 info 6 command, 4-36 info 7 command, 4-38 info 8 command, 4-40 Information resources, 2-20 Installing disk cages, 8-45 Installing hard drives, 8-25 Installing Q-VET, 2-13 Interlock switch, 8-18 Interrupts, 5-19 J Junk I/O. See I/O connector assembly K kbd_hardware_type environment variable, 6-13 Keyboard port, 1-11 Keys, 1-31 kill command, 4-41 kill_diags command, 4-41 kzpsa_host_id environment variable, 6-13 L language environment variable, 6-13 Index-4 LEDs control panel, 1-12 power supply, 1-26 LFU utility, 3-17, 3-27 Line voltage, 1-26 Linux booting, 6-34 Local mode, 7-5 login command, 6-19 Logout frame for console level environmental error, 523 Loopback connectors, 4-65 Loopback tests, 2-9, 4-66 COM2 and parallel ports, 4-64 M Machine checks, 5-19 memexer command, 4-42 Memory architecture, 1-17 Memory buses, 1-3 Memory configuration, 6-23 pedestal, 6-27 tower, 6-28 Memory DIMMs, 6-25 Memory exercisers, 4-42 Memory exercisors, 4-44 Memory failure, 3-9 Memory interleaving, 1-18 Memory motherboards. See MMBs Memory options, 1-18 memory_text environment variable, 6-13 memtest command, 4-44 Memtest test 1, 4-46 MMBs, 1-18 location of, 1-14 part number, 8-5 replacing, 8-34 Modem port, 1-11 Modules, processor, 1-14 MOP loopback tests, 4-52 more el command, 3-13, 4-10 Motherboard logic, 1-15 Motherboard, replacing, 8-62 Mouse port, 1-11 N net command, 4-49 net -ic command, 4-49 net -s command, 4-49 nettest command, 4-51 Network ports, testing, 4-51 No MEM error, 3-18 O OCP, 1-12 customized message, 6-4 error messages, 3-14 OCP assembly, replacing, 8-44 ocp_text environment variable, 6-13 Operating systems errors reported by, 2-8 Operator control panel. See OCP Options, supported, 2-22 os_type environment variable, 6-14 P Pagers, 7-27 PAL handler, 5-17 PALcode error routines, 5-19 exception/interrupt handling, 5-17 Parallel port, 1-11 password environment variable, 6-14 Patches, 2-21 P-chips, 1-3 PCI backplane, 1-19 backplane cables, 8-57 backplane part numbers, 8-5 bus implementation, 1-20 buses, 6-30 card cage, 8-18 card installing or replacing, 8-40 hot swap module assembly, 1-8 installing cards, 8-41 module LEDs, 6-32 module separators, 8-58 replacing hot swap module, 8-43 replacing the backplane, 8-60 PCI slot locations pedestal, 6-29 tower, 6-31 pci_parity environment variable, 6-14 php_button_test, 4-4 php_led_test, 4-3 PIC processor, 1-23, 1-24, 7-3 pk*0_fast environment variable, 6-14 pk*0_host_id environment variable, 6-15 pk*0_soft_term environment variable, 6-15 POK LED, 1-26 Power button, 1-12 Power cords, 8-6 Power harness, replacing, 8-66 Power LED, 1-13 power on/off commands (RMC), 1-13, 7-22 Power problems, 2-4 Power supplies, 1-25 configuring, 6-33 installation order, 6-33 installing, 8-22 LEDs, 1-26 locations, 6-33 numbering, 6-33 redundant, 6-33 removing, 8-21 Power supply part number, 8-5 Power-on, remote, 1-13 Power-on/off, from RMC, 7-22 Power-up diagnostics, 3-1 RMC, 3-2 SRM, 3-2 SROM, 3-2 Power-up display, 3-6 SRM, 3-10 SROM, 3-6, 3-7 Power-up error messages, 3-14 Power-up memory failure, 3-9 Power-up procedure, 3-7 Power-up sequence, 3-3, 3-4 Problem list problem report details, 5-7 Problem report accessing, 5-6 details, 5-7 Index-5 Processor card, 1-14 Q quit command (RMC), 7-10 Q-VET de-installing, 2-19 installation verification, 2-11 installing, 2-13 results review, 2-17 running, 2-15 R RCM tool, 2-10 Redundant power supply, 6-33 Registers, dislaying, 4-27 Remote management console. See RMC Remote power-on/off, 7-22 Remote system management logic, 1-21 Removable media, 1-29 replacing 5.25-inch device, 8-50 Removable media bays, 1-6 Removal and replacement, 8-1 Removing covers from chassis, 8-17 Removing enclosure panels, 8-11 from a pedestal, 8-14 from a tower, 8-12 Reset button, 1-13 reset command (RMC), 1-13, 7-23 Revision and Configuration Management tool, 2-10 RMC, 1-21, 2-10, 7-1 auxiliary power supply, 7-3 bypass modes, 7-6 CLI, 6-2, 7-10, 7-13 command conventions, 7-14 commands, 7-13 configuring call-out, 7-24 connecting from local VGA, 7-11 connecting from serial terminal, 7-10 data flow diagram, 7-4 dial-out alert, 7-26 dump command, 7-20 env command, 7-18 error information, 7-3 Index-6 escape sequence, 7-10 exiting, 7-10 exiting from local VGA, 7-11 Firm Bypass mode, 7-8 hangup command, 7-25 jumpers, 7-30 Local mode, 7-5 logic, 1-24, 7-3 operating modes, 7-4 overview, 1-24, 7-2 PIC processor, 7-3 quit command, 7-10 remote power on/off, 7-22 remote reset, 7-23 resetting to factory defaults, 7-30 set com1_mode command, 7-15 set escape command, 7-29 Snoop mode, 7-7 Soft Bypass mode, 7-7 status command, 7-16 terminal setup, 7-9 Through mode, 7-5 troubleshooting, 7-32 updating, 3-26 Running Q-VET, 2-15 S SCB offsets, 5-19 SCSI breakouts, 1-11 scsi_poll, 4-4 scsi_reset, 4-4 SDD errors, 4-57 Security SRM, 6-18 Separator removing, 8-58 Serial number mismatch, 4-57 Serial terminal, 1-33, 6-3 Service tools CD, 2-20 set com1_mode command (RMC), 7-15 set console command, 6-3 set envar command, 6-7 set escape command (RMC), 7-29 set ocp_text command, 6-4 set secure command, 6-19 set sys_serial_num command, 4-55 show console command, 6-3 show envar command, 6-7 show error command, 4-56 message translation, 4-58 show fru command, 4-59 show fru E field, 4-61 show power command, 6-33 show_status command, 4-62 Slot locations, PCI, 6-29 Slot numbers CPUs, 6-21 PCI, 6-30 Snoop mode, 7-7 Soft Bypass mode, 7-7 Software patches, 2-21 SPC logic, 1-23 Speaker, testing, 4-66 SRM console diagnostic commands, 4-2 diagnostics, 4-1 power-up display, 3-10 problems accessing, 2-5 problems reported by, 2-6 SRM console commands, 2-9, A-1 SROM error messages, 3-22, 3-23 power-up display, 3-6, 3-7 Stabilizer bar, 8-17 status command (RMC), 7-16 Storage hard drive, 1-30 removable media, 1-29 Storage drive bays, 1-6 Supported options list, 2-22 Switched system interconnect, 1-3 sys_com1_rmc, 4-4 sys_exer command, 4-64 sys_serial_num environment variable, 6-15 System access, 1-31 System architecture, 1-2 System block diagram, 1-2 System card cage, 8-18 System correctable error (620), 5-20 System enclosures, 1-4 System environmental error (680), 5-20 System motherboard, 1-14 replacing, 8-62 System power controller (SPC), 1-23 System serial number setting, 4-55, 8-64 System uncorrectable error (660), 5-20 T TDD errors, 4-57 Technical information, on Web, 2-21 Terminal setup (RMC), 7-9 Terminating diagnostics, 4-41 test command, 4-66 test -lb command, 4-66 Test script, 4-67 Testing floppy and tape drives, 4-65, 4-67 Testing network ports, 4-51 Through mode (RMC), 7-5 TIG chip, 1-22, 7-3 Tools and utilities, 2-9 Troubleshooting boot problems, 2-7 Compaq Analyze, 5-2 crash dumps, 2-10 errors reported by operating system, 2-8 power problems, 2-4 problem categories, 2-3 problems getting to console, 2-5 problems reported by console, 2-6 RMC, 7-32 strategy, 2-2 tools and utilities for, 2-9 with console event log, 3-13 tt_allow_login environment variable, 6-15 U UART ports, 7-5 Unindictment command, 5-3 UNIX indictment command, 5-3 Updating RMC, 3-26 Index-7 V Verifying devices, 4-66 VGA console tests, 4-67 VGA monitor, 1-33, 6-3 VT terminal, 6-3 W WEBES Director, 5-4 Write test, on floppy, 4-22 Index-8
Home
Privacy and Data
Site structure and layout ©2025 Majenko Technologies