Digital PDFs
Documents
Guest
Register
Log In
EK-VXFTA-SI-A01
June 1993
236 pages
Original
1.9MB
view
download
Document:
VAXft Systems Model 810 Service Information
Order Number:
EK-VXFTA-SI
Revision:
A01
Pages:
236
Original Filename:
vxftasia.pdf
OCR Text
VAXft Systems Model 810 Service Information Order Number: EK-VXFTA-SI.A01 June 1993 This manual is intended for use by trained personnel responsible for maintaining VAXft Model 810 systems. Digital Equipment Corporation June 1993 The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document. No responsibility is assumed for the use or reliability of software on equipment that is not supplied by Digital Equipment Corporation or its affiliated companies. Restricted Rights: Use, duplication, or disclosure by the U.S. Government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013. © Digital Equipment Corporation June 1993. All Rights Reserved. Printed in Canada The following are trademarks of Digital Equipment Corporation: CompacTape, OpenVMS, ThinWire, TK, UETP, VAX, VAXft, VMS, VAXELN, and the DIGITAL logo. FCC NOTICE: This equipment generates, uses, and may emit radio frequency energy. It has been tested and found to comply with the limits for a Class A computing device pursuant to Subpart J of Part 15 of FCC rules of operation in a commercial environment. This equipment, when operated in a residential area, may cause interference to radio/TV communications. In such event the user (owner), at his own expense, may be required to take corrective measures. This document is available on CDROM. Documentation Map Hardware Information (VAXft Systems) Overview Information (VAXft Systems) Software Product Description Models 110, 410, 610, 612 Model 810 Configuration Guide Configuring the Model 810 Operating System (VMS) Cover Letter Software Information (VAXft System Services) Before You Install Letter Release Notes Site Prep and Installation Guide Release Notes Installation Information Owner’s Manual Operating Information VMS Upgrade and Installation Manual Wide Area * VAXNetwork Device Drivers *Maintenance Guide * Service Information *Site Prep Information = Book = Tape VMS Upgrade and Installation Supplement: VAXft Systems Using Factory−Installed Software with VAXft Systems Manager’s Guide Online Help * VMS Volume Shadowing Manual = Bookreader Reference Manual = Online = Letter * = Order Separately MR−6230−RA Contents 1 Cabinet and Component Descriptions 1.1 1.2 1.3 1.4 1.5 In This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CPU and Expansion Cabinets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zone Control Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Domestic and International Power Distribution Boxes . . . . . . . . . . . . . . . . 1–1 1–1 1–6 1–8 1–9 2 Console Operations 2.1 In This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Console Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Console Operating Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Entering CIO Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Exiting CIO Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Console Control Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Console Command Language Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Bootstrap Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Entering CIO Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 CIO Mode Console Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 BOOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 CLEAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.3 CONTINUE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.4 DEPOSIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.5 DUP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.6 EXAMINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.7 FIND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.8 HELP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.9 INITIALIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.10 MOVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.11 MATCH_ZONES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.12 REPEAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.13 SET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.13.1 SET BOOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.14 SHOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.15 START . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.16 TEST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.17 X(transfer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.18 Z ..................................................... 2.8.19 !(comment) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–1 2–1 2–3 2–4 2–4 2–5 2–6 2–7 2–8 2–9 2–9 2–10 2–11 2–11 2–13 2–13 2–15 2–15 2–16 2–16 2–16 2–17 2–17 2–18 2–18 2–19 2–20 2–21 2–22 2–22 v 3 System Maintenance 3.1 In This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Maintenance Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Operating Rules and Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 General Troubleshooting Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Module Fault LEDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Power System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Power System Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Device Status and Fault Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 RF35 Disk Drawer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.2 SF35 Storage Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.3 SF73 Storage Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.4 TF85C Tape Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.5 TF857 Tape Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.5.1 Power-On Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.5.2 Operator Control Panel Controls and Indicators . . . . . . . . . . . . . . 3.9 ROM-Based Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 TEST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.2 Z ..................................................... 3.9.3 CPU ROM-Based Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.4 I/O ROM-Based Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–1 3–1 3–2 3–4 3–6 3–7 3–12 3–19 3–19 3–21 3–24 3–26 3–27 3–27 3–27 3–29 3–30 3–31 3–31 3–34 4 Error Handling and Analysis 4.1 In This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Error Handling Services Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Basic Error Isolation and Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 EHS Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 System Operating Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Error Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 VAXELN Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Field Replaceable Units (FRUs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Deconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.1 I/O Attachment Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.2 CPU Module and Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.3 I/O Expansion Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.4 Interface Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.5 Zone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2.6 Cross-Link Cable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Application of Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 OpenVMS Error Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Fault Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 FRU Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Deconfiguration Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Threshold Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5 Fault Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5.1 System Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5.2 End Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5.3 End Action Timeouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5.4 VAXELN Detected Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5.5 Software Detected Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.5.6 Unsynchable Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Module NVRAM Status and LED Indicators . . . . . . . . . . . . . . . . . . . . . . . vi 4–1 4–1 4–2 4–3 4–4 4–5 4–10 4–12 4–12 4–13 4–13 4–14 4–14 4–15 4–16 4–16 4–17 4–19 4–20 4–22 4–24 4–26 4–27 4–27 4–28 4–29 4–30 4–34 4–36 4–38 4.6 FTSS Event Reporting Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Event Reporting Interface Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Error Event Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2.1 Deconfiguration Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Firmware Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 System Console and Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1.1 System Resets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1.2 CCA Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.2 I/O Expansion Module Console and Diagnostics . . . . . . . . . . . . . . . . . . 4.8 Firmware and OpenVMS Interface Data Structures . . . . . . . . . . . . . . . . . 4.8.1 Console Communications Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1.1 Duplex Compatibility Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1.2 Dispatch Block Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1.3 Boot Parameter Block Description . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Device Configuration Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2.1 Sub-Device Configuration Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2.2 CPU Module SubDCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.3 Page Frame Number Bitmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Error Log Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.1 CPU/MEM Fault Error Log Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.2 CPU/MEM Fault End Action Error Log Entry . . . . . . . . . . . . . . . . . . . 4.9.3 CPU or Zone Unsynchable Error Log Entry . . . . . . . . . . . . . . . . . . . . . 4–40 4–40 4–40 4–49 4–50 4–50 4–51 4–53 4–53 4–54 4–55 4–57 4–59 4–60 4–61 4–63 4–64 4–65 4–66 4–66 4–69 4–72 5 FRU Removal and Replacement Procedures 5.1 5.2 5.3 5.3.1 5.3.2 5.3.3 5.3.4 5.3.5 5.4 5.4.1 5.4.2 5.4.3 5.4.4 5.4.5 5.4.6 5.4.7 5.4.8 5.4.9 5.4.10 5.4.11 5.4.12 5.4.13 5.4.14 5.4.15 5.4.16 5.4.17 5.4.18 5.4.19 5.4.20 In This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Field Replaceable Unit List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Before You Begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handling FRUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shutting Down a Zone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Verifying Zone Shutdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Starting Up a Zone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Accessing the FRUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FRU Removal and Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CPU and ATM Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SIMMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MMBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fan and FCSB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RF35 Disk Drive Removal and Replacement . . . . . . . . . . . . . . . . . . . . DSSI Disk Drawer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zone Control Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FEU, 3.3V Regulator, 5V Regulator, PSC Modules . . . . . . . . . . . . . . . . Cross-Link Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Console Extender Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DSSI Extender Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CAMP Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DSSI Interface Module (DIM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ethernet Interface Module (EIM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DSSI Cable Removal and Replacement . . . . . . . . . . . . . . . . . . . . . . . . TF85C-BA Tape Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SF73 Disk Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SF35 Storage Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TF857-CA Tape Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power Distribution Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5–1 5–1 5–3 5–4 5–4 5–5 5–5 5–5 5–6 5–7 5–8 5–9 5–10 5–12 5–14 5–14 5–16 5–18 5–20 5–22 5–24 5–26 5–28 5–29 5–30 5–32 5–36 5–39 5–42 vii 6 Managing Integrated Storage Elements 6.1 6.2 6.3 6.4 6.5 6.6 6.6.1 6.6.2 6.6.3 6.6.4 In This Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loading the DUP Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using VMS DUP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using the Server Setup Switch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assigning DSSI Unit Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Warm Swapping an ISE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Setting ISE Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ISE Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ISE Replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Installing an ISE in a Running System . . . . . . . . . . . . . . . . . . . . . . . . 6–1 6–1 6–1 6–2 6–2 6–3 6–5 6–7 6–8 6–11 A Miscellaneous System Information A.1 A.2 A.3 A.4 A.4.1 A.4.2 A.4.3 A.4.4 A.5 A.6 In This Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Processor Halt Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Console Halt Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error Register Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Fault (SYSFLT) Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Error Address (SYSADR) Register . . . . . . . . . . . . . . . . . . . . . . DMA Error Address (DMAADR) Register . . . . . . . . . . . . . . . . . . . . . . Reset Reason 0013 Fault Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . I/O Physical Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Control Block Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A–1 A–1 A–3 A–4 A–4 A–7 A–7 A–8 A–8 A–10 B ISE Parameter Worksheets B.1 B.2 B.3 In This Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Individual ISE Parameter Worksheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . ISE Zone Parameter Worksheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B–1 B–1 B–3 Indirect Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How to Shut Down a Zone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How to Verify Zone Shutdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2–12 5–5 5–5 Cabinet Layout, Front View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cabinet Layout, Rear View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zone Control Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power Module Controls and Indicators . . . . . . . . . . . . . . . . . . . . . . . . Domestic Power Distribution Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . International Power Distribution Box . . . . . . . . . . . . . . . . . . . . . . . . . System Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Console Operating Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boot Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–2 1–4 1–6 1–8 1–10 1–11 2–2 2–4 2–7 Index Examples 2–1 5–1 5–2 Figures 1–1 1–2 1–3 1–4 1–5 1–6 2–1 2–2 2–3 viii 3–1 3–2 3–3 3–4 3–5 3–6 3–7 3–8 3–9 3–10 3–11 4–1 4–2 4–3 4–4 4–5 4–6 4–7 4–8 4–9 4–10 4–11 4–12 4–13 4–14 4–15 5–1 5–2 5–3 5–4 5–5 5–6 5–7 5–8 5–9 5–10 5–11 5–12 5–13 5–14 5–15 5–16 5–17 5–18 5–19 5–20 5–21 Module Fault LEDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Power System Block Diagram (1 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . Power System Block Diagram (2 of 2) . . . . . . . . . . . . . . . . . . . . . . . . . Power Module Controls and Indicators . . . . . . . . . . . . . . . . . . . . . . . . RF35 Disk Drawer Controls and Indicators . . . . . . . . . . . . . . . . . . . . . SF35 Operator Control Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SF35 Rear Panel Fault Indicator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Location of SF73 Storage Array LEDs and Switchpacks . . . . . . . . . . . Rear of the SF73 Storage Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TF85C Cartridge Tape Drive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TF857 Operator Control Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hardware Error Handling Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . EHS Architectural Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OpenVMS Error Log Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fault Summary Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FRU Information Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deconfiguration Information Block . . . . . . . . . . . . . . . . . . . . . . . . . . . Threshold Information Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fault Data Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . End Action Timeout Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VAXELN Detected Error Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software Detected Error Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Unsynchable Event Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Firmware and OpenVMS Data Structure Memory Map . . . . . . . . . . . . Dispatch Block Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SubDCB Links to DCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CPU Module and ATM Module Locations . . . . . . . . . . . . . . . . . . . . . . SIMM Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MMB Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fan Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FCSB Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RF35 Disk Drive Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zone Control Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FEU, 3.3V Regulator, 5V Regulator, and PSC Locations . . . . . . . . . . . Cross-Link Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Module Extraction Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Console Extender Module Location . . . . . . . . . . . . . . . . . . . . . . . . . . . Console Extender Module Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DSSI Extender Module Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . CAMP Module Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DIM Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DIM Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EIM Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TF85C-BA Tape Drive, Rear View . . . . . . . . . . . . . . . . . . . . . . . . . . . . TF85C-BA Tape Drive Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SF73 Disk Drive, Rear View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3–6 3–8 3–9 3–13 3–20 3–21 3–23 3–24 3–25 3–26 3–28 4–2 4–4 4–19 4–20 4–22 4–24 4–26 4–27 4–30 4–30 4–35 4–37 4–54 4–59 4–64 5–6 5–7 5–8 5–9 5–10 5–11 5–12 5–15 5–16 5–18 5–19 5–20 5–21 5–22 5–24 5–26 5–27 5–28 5–30 5–31 5–32 ix 5–22 5–23 5–24 5–25 5–26 5–27 5–28 5–29 5–30 5–31 5–32 6–1 6–2 A–1 A–2 A–3 A–4 A–5 A–6 SF73 Disk Drive, Front View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SF73 Disk Drive Enclosure Removal . . . . . . . . . . . . . . . . . . . . . . . . . . SF73 Disk ISE Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SF35 Storage Array, Rear View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SF35 Storage Array, Front View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SF35 Disk ISE Removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TF857-CA Tape Drive, Rear View . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loosening the Shipping Restraint Screw . . . . . . . . . . . . . . . . . . . . . . . Setting the TF857 Tape Loader Node ID . . . . . . . . . . . . . . . . . . . . . . . Domestic Power Distribution Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . International Power Distribution Box . . . . . . . . . . . . . . . . . . . . . . . . . VAXft Model 810 Front View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VAXft Model 810 Rear View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Fault Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JXD System Error Address Register . . . . . . . . . . . . . . . . . . . . . . . . . . JXD DMA Error Address Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . I/O Physical Address Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Control Block Base Register . . . . . . . . . . . . . . . . . . . . . . . . . . System Control Block Vector Format . . . . . . . . . . . . . . . . . . . . . . . . . . 5–33 5–34 5–35 5–36 5–37 5–38 5–39 5–40 5–41 5–42 5–43 6–3 6–4 A–4 A–7 A–7 A–9 A–10 A–10 Key to Figure 1–1, Cabinet Layout, Front View . . . . . . . . . . . . . . . . . . Key to Figure 1–2, Cabinet Layout, Rear View . . . . . . . . . . . . . . . . . . Key to Figure 1–3, Zone Control Panel . . . . . . . . . . . . . . . . . . . . . . . . Key to Figure 1–4, Power Module Controls and Indicators . . . . . . . . . Key to Figure 1–5, Domestic Power Distribution Box . . . . . . . . . . . . . Key to Figure 1–6, International Power Distribution Box . . . . . . . . . . Key to Figure 2–1, System Components . . . . . . . . . . . . . . . . . . . . . . . . Function of the Console Components . . . . . . . . . . . . . . . . . . . . . . . . . . Console Control Characters and Function Keys . . . . . . . . . . . . . . . . . . Console Command Language Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifiers for BOOT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VMB Program /R5:<flag> Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifier for CLEAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifiers for DEPOSIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Address-Spec Symbolic Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifiers for DUP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifiers for EXAMINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Address-Spec Symbolic Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifiers for FIND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . INITIALIZE Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SET Variables and Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SHOW Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifiers for TEST Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifiers for TEST Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifier for Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1–3 1–5 1–7 1–9 1–10 1–11 2–2 2–3 2–5 2–6 2–9 2–10 2–10 2–11 2–12 2–13 2–14 2–14 2–15 2–16 2–17 2–18 2–20 2–21 2–22 Tables 1–1 1–2 1–3 1–4 1–5 1–6 2–1 2–2 2–3 2–4 2–5 2–6 2–7 2–8 2–9 2–10 2–11 2–12 2–13 2–14 2–15 2–16 2–17 2–18 2–19 x 3–1 3–2 3–3 3–4 3–5 3–6 3–7 3–8 3–9 3–10 3–11 3–12 3–13 3–14 3–15 3–16 3–17 3–18 3–19 3–20 3–21 3–22 3–23 3–24 3–25 3–26 3–27 3–28 3–29 4–1 4–2 4–3 4–4 4–5 4–6 4–7 4–8 4–9 4–10 4–11 4–12 4–13 4–14 4–15 4–16 4–17 4–18 Before Stopping a Zone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . After a Zone is Repaired . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Before Leaving the Site . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General Troubleshooting Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . Key to Figure 3–1, Module Fault LEDs . . . . . . . . . . . . . . . . . . . . . . . . Power System Functional Summary . . . . . . . . . . . . . . . . . . . . . . . . . . System DC Voltage Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Key to Figure 3–4, Power Module Controls and Indicators . . . . . . . . . Fan, LDC, Temperature Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . . FEU Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PSC Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 V DC to DC Converter Error Codes . . . . . . . . . . . . . . . . . . . . . . . . 2 V DC to DC Converter Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . 3 V DC to DC Converter Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . 5 V DC to DC Converter Error Codes . . . . . . . . . . . . . . . . . . . . . . . . . 12 V DC to DC Converter Error Codes . . . . . . . . . . . . . . . . . . . . . . . . RF35 Disk Drawer Controls and Indicators . . . . . . . . . . . . . . . . . . . . . SF35 Operator Control Panel Description . . . . . . . . . . . . . . . . . . . . . . SF35 Rear Panel Controls and Indicator . . . . . . . . . . . . . . . . . . . . . . . SF73 Front Panel Controls and Indicators . . . . . . . . . . . . . . . . . . . . . . TF85C Tape Drive Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TF85C Cartridge Tape Drive Indicators . . . . . . . . . . . . . . . . . . . . . . . . TF857 OCP Controls and Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifiers for TEST Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifiers for TEST Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qualifier for Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CPU ROM-Based Diagnostic Descriptions . . . . . . . . . . . . . . . . . . . . . . I/O ROM-Based Diagnostic Descriptions . . . . . . . . . . . . . . . . . . . . . . . EHS Error Notification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error Handling Flowchart Definitions . . . . . . . . . . . . . . . . . . . . . . . . . System Operating Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VAXELN Error Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System FRUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ATM Deconfiguration Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CPU Deconfiguration Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I/O Expansion Module Deconfiguration Actions . . . . . . . . . . . . . . . . . . Interface Module Deconfiguration Actions . . . . . . . . . . . . . . . . . . . . . . Zone Deconfiguration Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross-Link Cable Deconfiguration Actions . . . . . . . . . . . . . . . . . . . . . . FRU Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . OpenVMS Error Log Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fault Summary Block Entry Descriptions . . . . . . . . . . . . . . . . . . . . . . FRU Information Block Entry Descriptions . . . . . . . . . . . . . . . . . . . . . Deconfiguration Information Block Entry Descriptions . . . . . . . . . . . . Threshold Information Block Entry Descriptions . . . . . . . . . . . . . . . . . 3–2 3–2 3–3 3–4 3–4 3–7 3–10 3–12 3–13 3–15 3–16 3–16 3–18 3–18 3–19 3–19 3–19 3–20 3–22 3–23 3–24 3–26 3–27 3–28 3–30 3–30 3–31 3–31 3–34 4–2 4–3 4–4 4–5 4–11 4–12 4–13 4–14 4–15 4–15 4–16 4–16 4–17 4–19 4–20 4–23 4–24 4–26 xi 4–19 4–20 4–21 4–22 4–23 4–24 4–25 4–26 4–27 4–28 4–29 4–30 4–31 4–32 4–33 4–34 4–35 4–36 4–37 4–38 4–39 5–1 5–2 5–3 5–4 5–5 5–6 5–7 5–8 5–9 5–10 5–11 5–12 5–13 5–14 5–15 5–16 5–17 5–18 5–19 5–20 5–21 5–22 6–1 6–2 6–3 6–4 xii System Register Entry Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . End Actions Register Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . End Action Timeout Block Entry Description . . . . . . . . . . . . . . . . . . . VAXELN Detected Error Block Entry Descriptions . . . . . . . . . . . . . . . Software Detected Error Block Entry Descriptions . . . . . . . . . . . . . . . Unsynchable Event Block Entry Descriptions . . . . . . . . . . . . . . . . . . . Module ID NVRAM/DCB Status Codes . . . . . . . . . . . . . . . . . . . . . . . . System Reset Action Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Reset Reason Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Error Handler Reset Reasons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I/O Reset Action Code Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . I/O Reset Reason Code Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . CCA Component Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Duplex Compatibility Test Failure Codes . . . . . . . . . . . . . . . . . . . . . . . Dispatch Block Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BPB Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BPB Entry Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DCB Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DCB Entry Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CPU SubDCB Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CPU SubDCB Entry Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . Model 810 FRUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Handling FRUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CPU Module and ATM Module Removal Procedure . . . . . . . . . . . . . . . SIMM Removal Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MMB Removal Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fan and FCSB Removal Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . RF35 Disk Drive Removal Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . DSSI Disk Drawer Removal Procedure . . . . . . . . . . . . . . . . . . . . . . . . Zone Control Panel Removal Procedure . . . . . . . . . . . . . . . . . . . . . . . . FEU, 3.3V Regulator, 5V Regulator, and PSC Removal Procedure . . . . Cross-Link Assembly Removal Procedure . . . . . . . . . . . . . . . . . . . . . . Console Extender Module Removal Procedure . . . . . . . . . . . . . . . . . . . DSSI Extender Module Removal Procedure . . . . . . . . . . . . . . . . . . . . . CAMP Module Removal Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . DIM Removal Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EIM Removal Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DSSI Cable Removal Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TF85C-BA Tape Drive Removal Procedure . . . . . . . . . . . . . . . . . . . . . . SF73 Disk Drive Enclosure Removal Procedure . . . . . . . . . . . . . . . . . . SF35 Storage Array Removal Procedure . . . . . . . . . . . . . . . . . . . . . . . TF857-CA Tape Drive Removal Procedure . . . . . . . . . . . . . . . . . . . . . . Power Distribution Box Removal Procedure . . . . . . . . . . . . . . . . . . . . . PARAMS Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Switches For Disabling the MSCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . ISE Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Disabling the MSCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4–28 4–29 4–30 4–30 4–35 4–37 4–39 4–51 4–52 4–52 4–54 4–54 4–55 4–58 4–60 4–60 4–61 4–61 4–61 4–65 4–65 5–1 5–4 5–7 5–8 5–9 5–11 5–13 5–14 5–15 5–17 5–19 5–21 5–23 5–25 5–27 5–29 5–29 5–31 5–33 5–38 5–40 5–43 6–2 6–2 6–5 6–9 6–5 A–1 A–2 A–3 A–4 A–5 A–6 Disabling the MSCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Processor Halt Code Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Processor Halt Reason Code Definitions . . . . . . . . . . . . . . . . . . . . . . . . Console Halt Reason Code Definitions . . . . . . . . . . . . . . . . . . . . . . . . . Xlink Mode Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Code Field Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SCB Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6–11 A–1 A–2 A–3 A–4 A–10 A–11 xiii 1 Cabinet and Component Descriptions 1.1 In This Chapter This chapter includes descriptions of the: • CPU and expansion cabinets • Zone control panel • Power modules • Domestic power distribution box • International power distribution box 1.2 CPU and Expansion Cabinets Figure 1–1 shows the front layout of an expanded system. Table 1–1 describes the components shown in Figure 1–1. Figure 1–2 shows the rear layout of an expanded system. Table 1–2 describes the components shown in Figure 1–2. Cabinet and Component Descriptions 1–1 Figure 1–1 Cabinet Layout, Front View 1 2 10 10 Front 3 6 10 10 4 5 8 1 7 15 11 12 10 10 16 9 2 13 14 Expansion Cabinet CPU Cabinet MR−0406−92RAGS 1–2 Cabinet and Component Descriptions Table 1–1 Key to Figure 1–1, Cabinet Layout, Front View Item Component Description 1 Zone A Complete computer with enough elements to run an operating system. 2 Zone B Complete computer with enough elements to run an operating system. 3 Fan assembly Cooling device. 4 Disk drawer Optional SF35 disk drive(s). System Module Card Cage 5 Slot 0 - CPU module Logic chips and memory. 6 Slot 1 - ATM module I/O logic supporting up to eight interface adapter cards. 7 Slot 2 - Not used For future expansion. 8 Zone control panel Zone controls and indicators. 9 Blank panel Not used. 10 Disk device Location for disk device. 11 Disk/tape device Location for disk or tape device. 12 Disk/tape/tape loader Location for disk, tape, or tape loader device. 13 Power distribution box A AC power source for Zone A. 14 Power distribution box B AC power source for Zone B. 15 UPS A Optional uninterruptible power supply for Zone A. 16 UPS B Optional uninterruptible power supply for Zone B. Cabinet and Component Descriptions 1–3 Figure 1–2 Cabinet Layout, Rear View Rear 2 1 19 19 19 19 20 21 19 19 3 4 5 6 7 8 9 10 1 11 14 16 17 12 13 15 18 2 22 24 25 23 CPU Cabinet Expansion Cabinet Expansion Cabinet Option MR−0407−92RAGS 1–4 Cabinet and Component Descriptions Table 1–2 Key to Figure 1–2, Cabinet Layout, Rear View Item Component Description 1 Zone A Complete computer with enough elements to run an operating system. 2 Zone B Complete computer with enough elements to run an operating system. 3 Fan assembly Cooling device. 4 Blank panel Not used. 5 Front End Unit (FEU) AC input circuit breaker. 6 FEU Converts ac power to 48 Vdc. 7 FEU AC input connector. 8 Regulator Provides +3.3 Vdc at 30 A, +12 Vdc at 12.5 A, and bias. 9 Regulator Provides +5 Vdc at 90 A. 10 Power system controller Provides interface signals to the ATM module. Miscellaneous Module Card Cage 11 Blank panel Not used. 12 Slot 0 - Not used For future expansion. 13 Slot 1 - Cross-link assembly Connects Zone A and Zone B. 14 Slot 2 - Console module Module with console port. 15 Slot 3 - Not used Factory test module. 16 Slot 4 - Disk In/Disk Out module Permits zone interconnections to access all configured disks. 17 Slot 5 - CAMP module Provides custom power control circuits. Interface Module Card Cage 18 Slots 10 to 17 DSSI and NI interface modules. Slots 20 to 27 For future expansion. 19 Disk device Location for disk device. 20 Disk/tape/tape loader Location for disk, tape, or tape loader device. 21 Disk/tape device Location for disk or tape device. 22 Power distribution box A AC power source for Zone A. 23 Power distribution box B AC power source for Zone B. 24 UPS A Optional uninterruptible power supply for Zone A. 25 UPS B Optional uninterruptible power supply for Zone B. Cabinet and Component Descriptions 1–5 1.3 Zone Control Panel Figure 1–3 shows the layout of the zone control panel. Table 1–3 describes the functions of the zone control panel controls and indicators. Figure 1–3 Zone Control Panel 1 2 3 4 5 6 7 1 8 9 10 MR−0514−92RAGS 1–6 Cabinet and Component Descriptions Table 1–3 Key to Figure 1–3, Zone Control Panel Item Control/Indicator Function 1 Logic Power - OFF Two switches with amber indicators. Pressing the two switches removes 48 V power and disables the zone. Pressing one switch has no effect on the operation of the zone. (CPU cabinet disk power is not affected when logic power is removed by pressing these switches.) 2 Logic Power - ON One switch with a green indicator. Pressing this switch applies 48 V power to the zone. (CPU cabinet disk power is not affected when logic power is applied by pressing this switch.) 3 Local Console One switch with a green indicator. Pressing this switch connects the system to the console local port for communication. 4 Remote Console One switch with a green indicator. Pressing this switch connects the system to the remote port for communication. 5 Secure One switch with a green indicator. Pressing this switch disables the console Break key function. (You cannot use the console Break key to halt the zone or system.) 6 Zone Halt Enable One switch with a green indicator. Pressing this switch enables the console Break key function. (You can use the console Break key to halt the zone.) 7 System Halt Enable One switch with a green indicator. Pressing this switch enables the console Break key function. (You can use the console Break key to halt both zones.) Note System Halt Enable is NOT supported in Simplex mode. 8 System OK Green indicator. On when the system power is on and the system is operational. 9 System Fault Amber indicator. On when the system is not operational. 10 OS Running Green indicator. On when the system is operational and running a customer or diagnostic application. Cabinet and Component Descriptions 1–7 1.4 Power Modules Figure 1–4 shows the location of the power module controls and indicators. Table 1–4 describes their functions. Figure 1–4 Power Module Controls and Indicators FEU DC3 DC5 PSC 7 8 9 10 11 12 13 1 2 3 4 14 5 6 15 16 CAMP MR−0483−92RAGS 1–8 Cabinet and Component Descriptions Table 1–4 Key to Figure 1–4, Power Module Controls and Indicators Item Control/Indicator Function 1 AC Circuit Breaker 2 FEU Failure When on, indicates the dc output voltages for the FEU are below the specified minimum. 3 FEU OK When on, indicates the dc output voltages for the FEU are above the specified minimum. 4 DC3 Failure When on, indicates that one of the +3 Vdc output voltages is not within the specified tolerances. 5 DC3 OK When on, indicates that the +3 Vdc output voltages are within the specified tolerances. 6 AC Present When on, indicates ac power is present at the ac input connector, regardless of the position of the circuit breaker. 7 DC5 Failure When on, indicates that one of the +5 Vdc output voltages is not within the specified tolerances. 8 DC5 OK When on, indicates that the +5 Vdc output voltages are within the specified tolerances. 9 PSC Failure When on, indicates a PSC fault. 10 PSC OK When blinking, indicates the PSC is performing power-on self-tests. 11 Over Temperature Shutdown When on, indicates that the PSC shut down the system because of an internal overtemperature condition. 12 Fan Failure When on, indicates a fan failure. Use the hexadecimal number in the Fault ID Display to isolate the fan. 13 Disk Power Failure When on, indicates a disk power failure. Use the hexadecimal number in the Fault ID Display to isolate the storage compartment that houses the disk. 14 Fault ID Display Displays power subsystem fault codes. 15 PSC Reset Button When out, indicates a PSC fault condition. Press in to reset. 16 CAMP Fan Fault When on, indicates that a fan fault caused all disk drives and tape drives to shut down. When on, indicates the PSC is functioning. 1.5 Domestic and International Power Distribution Boxes The domestic power distribution box (PN 30-24374-01) is shown in Figure 1–5. Table 1–5 describes the components shown in the figure. The international power distribution box (PN 30-35415-02) is shown in Figure 1–6. Table 1–6 describes the components shown in the figure. Cabinet and Component Descriptions 1–9 Figure 1–5 Domestic Power Distribution Box 5 1 I CB 2 3 5 4 MR-0498-92DG Table 1–5 Key to Figure 1–5, Domestic Power Distribution Box Item Component Description 1 Three-phase power cord Connects the power distribution box to ac power. The power cord may be repositioned by moving the locking arm. 2 Circuit breaker When set to on, ac power is applied to the distribution box. 3 Local/Remote switch The switch has icons representing Remote, Off, and Local. When set to: • Local, the internal bus controls the operation of ac power. • Off, the distribution box is turned off. • Remote, the distribution box is turned on (if the power cord is connected to ac power and the circuit breaker is set to on). 4 For power cords Used to dress the power cords. 5 Eight ac outlets Reserved for the FEU and expansion cabinet. 1–10 Cabinet and Component Descriptions Figure 1–6 International Power Distribution Box 5 1 2 3 4 MR-0499-92DG Table 1–6 Key to Figure 1–6, International Power Distribution Box Item Component Description 1 Single-phase power cord Connects the power distribution box to ac power. 2 Circuit breaker When set to on, ac power is applied to the distribution box. 3 Local/Remote switch The switch has icons representing Remote, Off, and Local. When set to: • Local, the internal bus controls the operation of ac power. • Off, the distribution box is turned off. • Remote, the distribution box is turned on (if the power cord is connected to ac power and the circuit breaker is set to on). 4 For power cords Used to dress the power cords. 5 Six ac outlets Reserved for the expansion cabinet. Cabinet and Component Descriptions 1–11 2 Console Operations 2.1 In This Chapter This chapter describes the console, console operating modes and commands, and booting information. This chapter includes: • Console description • Console operating modes • Console control characters • Console command language syntax • Bootstrap procedures • Entering CIO mode • CIO mode console commands 2.2 Console Description The system architecture (Figure 2–1 and Table 2–1) supports in each zone: • A local console terminal • The console firmware (programs located in ROM) residing on: The primary NCIO module The CPU module • A remote console terminal The remote console terminal and the local console terminal are connected to the zone through the primary NCIO module. The console operates a terminal that may be: • Connected to the CPU serial port • On the system console port Cabinet and Component Descriptions 2–1 Figure 2–1 System Components 2 8 7 4 3 5 6 2 8 7 4 3 5 1 MR−0486−92RAGS Table 2–1 Key to Figure 2–1, System Components Number Component 1 CPU cabinet 2 Zone (A or B) 3 CPU module 4 To memory 5 Primary NCIO module 6 Cross-link cable 7 Local console terminal 8 Remote console terminal (optional) 2–2 Cabinet and Component Descriptions Table 2–2 describes the function of each console component. Table 2–2 Function of the Console Components Part Function Local console terminal Terminal located with the system that is used for console input and display output. Remote console port One remote port is available in each zone. The port may be connected to a remote console terminal through a modem. There is no built-in modem control. The remote console port provides the same functions as the local console port. Console firmware The console firmware resides on the primary NCIO module and on the CPU module. You can use any one of the four console terminals (local or remote) for input commands, but use only one terminal at a time. All of the console terminals echo the response of the system to a console command. If the system is operating with a single zone running, you must use a console terminal (local or remote) that is connected to that zone for input commands. 2.3 Console Operating Modes Operators communicate with the system in one of the following input/output modes: • Program I/O (PIO) mode • Console I/O (CIO) mode Normal operation takes place in the PIO mode. From PIO mode, the operator uses the console to: • Log in • Use the mail facility • Create and edit files From CIO mode, the operator executes the console commands. These commands are described in Section 2.8. Cabinet and Component Descriptions 2–3 2.3.1 Entering CIO Mode The CIO mode is entered when you turn on system power if: • The Zone Halt Enable switch is pressed • A STOP/ZONE instruction is executed • A severe processor condition occurs • An external halt is detected Once entered, the console prompt >>> is displayed and the CIO mode is ready to execute commands entered at the prompt. 2.3.2 Exiting CIO Mode The CIO mode is exited by issuing one of the following console commands: • BOOT • START • CONTINUE These commands are described in Section 2.8. Figure 2–2 shows how to move between PIO and CIO modes. Figure 2–2 Console Operating Modes PIO Mode BOOT CONTINUE START STOP/ZONE CIO Mode MR−0487−92RAGS 2–4 Cabinet and Component Descriptions 2.4 Console Control Characters The ASCII control characters and function keys listed in Table 2–3 have special meanings when typed on a console terminal. Table 2–3 Console Control Characters and Function Keys Character/Key Function Break In CIO mode, acts like Ctrl/C . In PIO mode, causes the processor to halt and begin running the console program. If the system is in a secure mode when you press the Break key the halt is suppressed. If you press the Zone Halt Enable or System Halt Enable switch, the halt initiated by pressing the Break earlier is enabled. Ctrl/C Echoes ^C and causes the console to abort processing of a command, if possible. Ctrl/O Alternately enables and disables output. Ctrl/Q Resumes output previously suspended by Ctrl/S . Ctrl/R Echoes ^R and retypes the command line. Ctrl/S Stops transmission until Ctrl/Q is typed. Ctrl/U Echoes ^U and ignores the current command line. The console prompt is displayed on the next line. This affects only the entry of the current line. Pressing Ctrl/U does not abort a command that is executing. <x (delete) Deletes the character to the left of the cursor. On video terminals, the deleted characters disappear. On hard-copy terminals, the deleted characters are typed within a pair of backslash delimiters as they are deleted. Esc or Ctrl/[ Suppresses any special meaning associated with a given character. Return Terminates a command line and executes the command. Cabinet and Component Descriptions 2–5 2.5 Console Command Language Syntax The console commands accept qualifiers. Qualifiers specify a numerical value or select an option from a list of options. Command elements may be abbreviated and any extra tabs or spaces are ignored. Unless otherwise noted, numerical values must be given in hexadecimal notation. The command length may not exceed 80 characters. Table 2–4 lists the console command language syntax rules. The console commands available for the system are listed in Section 2.8. Table 2–4 Console Command Language Syntax Command Element Rule Abbreviations A command verb or argument may be abbreviated to the extent that it remains unique. Multiple adjacent spaces and tabs Are treated as a single space. Qualifiers May appear after a command verb, option, or symbol. They must be preceded by a slash (/). Numbers Must be hexadecimal. No characters Are treated as a null command. No action is taken. 2–6 Cabinet and Component Descriptions 2.6 Bootstrap Procedures The BOOT command initializes the system and then loads and starts the virtual memory bootstrap (VMB) program from read-only memory (ROM). The VMB program, in turn, loads and starts the operating system from the specified boot device. Figure 2–3 shows the steps in the boot procedure. Figure 2–3 Boot Procedure Enter BOOT command at the >>> console prompt. Boot procedure initializes the system. Boot procedure loads VMB into main memory. VMB loads the operating system. MR−0490−92RAGS The VMB program is the primary bootstrap program. VMB: • Resides in ROM on the ATM module. • Is loaded into memory and initiated by the system console firmware. • Provides the necessary parameters for successful operation of the OpenVMS secondary bootstraps. • Allows you to boot from DSSI compatible disk and tape devices over the Ethernet. Cabinet and Component Descriptions 2–7 2.7 Entering CIO Mode To recognize and process CIO commands: • The System Halt Enable switch on both zone control panels must be pressed • The operating software must be halted • The processor must be running the console firmware The example below shows how to use the Break key to enter CIO mode from PIO mode and then return to PIO mode by using the CONTINUE command. The System Halt Enable switch on both zone control panels must be pressed. Caution Use CONTINUE to continue from a system halt. Use START/ZONE to continue from a zone halt. A remote operator can use CIO mode only when full access privileges for the remote console have been set at the local console. Example $ $ $ $ $ Break >>> ?002 External halt PC = 01E01473 >>> CONTINUE $ ! Press the System Halt Enable switch on ! both zone control panels. ! From PIO mode, press the Break key once. ! This puts the processor in HALT mode. ! ! ! ! This command resumes execution of the ! operating system software. ! The console returns to PIO mode. Notice that comments (characters following an exclamation point (!)) are allowed on a command line. Comments are ignored by the console when the Return key is pressed. This may be useful when you document a console session on a hardcopy terminal. Notice also that lowercase characters are accepted, but the console converts all characters to uppercase. 2–8 Cabinet and Component Descriptions 2.8 CIO Mode Console Commands This section describes the CIO mode console commands. The console commands are listed below with command abbreviations shown in bold capital letters. Boot CLEAR Continue Deposit DUP Examine Find HElp Initialize Move MATCH_ZONES Repeat SEt SHow Start Test X(transfer) Z !(comment) 2.8.1 BOOT BOOT initializes the system, loads a program image from a specified boot device, and transfers control to that program image. When you do not supply a boot-spec, the default boot device is used. When you do not supply flag(s), a value of 0 is assumed. The console program accepts a terminating colon on the boot-spec, but ignores the colon when the name is processed. The BOOT syntax is: BOOT[/OVER][[/R5:]<flag(s)> boot-spec] The boot-spec format may be dduuu/PATH=path-list . . . dduuu/PATH=path-list, where: dd is a device mnemonic. uuu is a unit number (0 to 999). /PATH=path-list is a qualifier. See Table 2–5. Or, the boot-spec format may be a variable that specifies the boot devices and paths. See Section 2.8.13.1. Table 2–5 describes the qualifiers. Table 2–6 lists the VMB program /R5:<flag> values. Table 2–5 Qualifiers for BOOT Qualifier Function /R5:<flag> Passes parameters to the virtual memory bootstrap (VMB) program. See Table 2–6. /PATH=path-list Specifies a path to a boot device. The path-list specifies zones and slot numbers in the path. When the path-list has more than one slot, you separate the slots by commas. The path-list format is zss, where: z is a zone ID (A or B).1 ss is a slot number (10 to 17, 20 to 27) of an adapter connecting to a boot device. /OVER Overrides the results of the bootability test to allow a Simplex mode boot. 1 The console validates this field before invoking VMB. Cabinet and Component Descriptions 2–9 Table 2–6 VMB Program /R5:<flag> Values Bit Hex Value Function Action 0 1 Conversational boot Returns to the SYSBOOT> prompt. 1 2 Debug Maps the XDELTA program into the system page table. 2 4 Initial breakpoint Operating system issues a breakpoint after turning on memory management. 3 8 Secondary boot Boots from boot block specified in /R4:n. 5 20 Bootstrap breakpoint Transfers control to the XDELTA program. 8 100 Solicit file name VMB issues a prompt for the secondary boot procedure. 9 200 Halt before transfer VMB executes a halt before transferring control to the secondary bootstrap procedure. 31:28 x0000000 Top-level system boot Specifies the top-level directory number for a system disk with multiple system roots, where x = a hex value from 0 to F. 2.8.2 CLEAR CLEAR BOOT deletes a boot-spec. CLEAR ERRORS clears the error frame of the previously detected error. If you do not clear the error frame, the next error is not recorded in the error frame. CLEAR BROKE clears the broke bit in EEPROM. The following CLEAR syntax deletes a boot-spec: CLEAR BOOT <name> The following CLEAR syntax clears the error frame: CLEAR ERRORS The following CLEAR syntax clears the broke bit ID in EEPROM: CLEAR BROKE[/PATH=path-number] Table 2–7 describes the /PATH=path-number qualifier. Table 2–7 Qualifier for CLEAR Qualifier Function /PATH=path-number Specifies the zone and slot number of the module to clear. The path-number format is zss, where: z is the zone ID (A or B). ss is the slot number (0 to 2, 10 to 17, 20 to 27) of an adapter connecting to a DSSI device. CLEAR BROKE clears the module ID EEPROM in the zone that is running. 2–10 Cabinet and Component Descriptions 2.8.3 CONTINUE CONTINUE exits the CIO mode and returns operation to the PIO mode. Caution Use CONTINUE to continue from a system halt. Use START/ZONE to continue from a zone halt. The CONTINUE syntax is: CONTINUE 2.8.4 DEPOSIT DEPOSIT stores the specified data in the specified address. When the system is initialized or when any transition from a running to a halted state occurs, the defaults are physical address space 0 and data size longword. The DEPOSIT syntax is: DEPOSIT[/{B,W,L,Q}][/{G,I,M,P,V,U}][/N:count]address-spec data-spec The address-spec identifies a physical or virtual hexadecimal memory address. A qualifier may be placed before or after an address-spec or data-spec. The data-spec identifies a hexadecimal number to be stored, unless the default radix has been changed with a %D introducer. When you do not supply a data-spec, a value of 0 is assumed. Table 2–8 describes the qualifiers. Table 2–9 lists the address-spec symbolic addresses. Table 2–8 Qualifiers for DEPOSIT Qualifier Function /B Sets the data size to byte. /W Sets the data size to word. /L Sets the data size to longword. /Q Sets the data size to quadword. /G Sets general purpose register address space R0 through PC. /I Sets internal processor register (IPR) address space accessed by the MTPR and MFPR instructions. /P Sets physical address space. /V Sets virtual address space. An EXAMINE to virtual memory returns the translated physical address. A DEPOSIT to virtual memory sets the PTE <M> bit. /U Sets access to console private memory. This qualifier must be specified for each command. /N:count Specifies the number of consecutive locations to modify. The console deposits to the first address, then to the specified number of succeeding addresses. This qualifier must be specified for each command. Cabinet and Component Descriptions 2–11 Table 2–9 Address-Spec Symbolic Addresses Symbolic Address Description R<n> General purpose register number n, where n is a decimal number 0 to 15. FP Frame pointer. AP Argument pointer. SP Stack pointer. PC Program counter. PSL Program status longword. + A location following the last location accessed by an EXAMINE or DEPOSIT. The location is the last address plus the size of the last reference (1 for byte, 2 for word, 4 for longword). - A location preceding the last location accessed by an EXAMINE or DEPOSIT. The location is the last address minus the size of the last reference (1 for byte, 2 for word, 4 for longword). * The last location referenced by an EXAMINE or DEPOSIT. @ Indirect addressing. The address-spec is used as a pointer to the data. The format is @address-spec, where address-spec can be any valid address except another @. See Example 2–1. Note Remember that the symbolic addresses from the previous command are used for indirect addressing. See Example 2–1. Example 2–1 Indirect Addressing >>> DEPOSIT R0 200 ! The value 200 is stored directly in R0. The defaults ! are set to longword, general purpose register. >>> DEPOSIT/P @R0 200 ! The value 200 is stored directly in the address pointed ! to by R0. The /P qualifier tells the parser that the ! value in R0 should be treated as a physical address. ! The defaults are set to longword, physical. >>> DEPOSIT/V @R0 200 ! The value 200 is stored directly in the address pointed ! to by R0. The /V qualifier tells the parser that the ! value in R0 should be treated as a virtual address. ! The defaults are set to longword, virtual. >>> DEPOSIT @200 ! The value 200 is stored in the address specified in ! the previous command. The defaults are set to longword, ! virtual. 2–12 Cabinet and Component Descriptions 2.8.5 DUP DUP connects to the DSSI DUP service on a selected node. DUP is used to examine and modify the parameters of a DSSI device. DUP syntax is: DUP[/PATH:<path-number>] node-id /[TASK:task] The node-spec identifies the node number (0 to 7) of a DSSI device attached to the console. Table 2–10 describes the qualifiers. Table 2–10 Qualifiers for DUP Qualifier Function /PATH=path-number Specifies the zone and slot number of an adapter connecting to a DSSI device. The path-number format is zss, where: z is the zone ID (A or B). ss is the slot number (10 to 17, 20 to 27) of an adapter connecting to a DSSI device. node-id Specifies the DSSI node connecting to a DSSI device. Valid node-ids are 0 to 5. TASK:task Invokes a task from a DSSI device. Valid DUP tasks are: DRVEXR DRVTST HISTRY DIRECT ERASE VERIFY DKUTIL PARAMS 2.8.6 EXAMINE EXAMINE displays the contents of the specified memory location or register. The display line consists of: • A single-character address specifier • The hexadecimal physical address to be examined • The examined data in hexadecimal When the system is initialized or when any transition from a running to a halted state occurs, the defaults are physical address space 0 and data size longword. The EXAMINE syntax is: EXAMINE[/{B,W,L,Q}][/{G,I,M,P,V,U}][/N:count][/A][address-spec] The address-spec identifies a physical or virtual hexadecimal memory address. A qualifier may be placed before or after the address-spec or data-spec. Table 2–11 describes the qualifiers. Table 2–12 lists the address-spec symbolic addresses. Cabinet and Component Descriptions 2–13 Table 2–11 Qualifiers for EXAMINE Qualifier Function /B Sets the data size to byte. /W Sets the data size to word. /L Sets the data size to longword. /Q Sets the data size to quadword. /G Sets general purpose register address space R0 through PC. /I Sets internal processor register (IPR) address space accessed by the MTPR and MFPR instructions. /P Sets physical address space. /V Sets virtual address space. An EXAMINE to virtual memory returns the translated physical address. A DEPOSIT to virtual memory sets the PTE <M> bit. /U Sets access to console private memory. This qualifier must be specified for each command. /N:count Specifies the number of consecutive locations to modify. The console deposits to the first address, then to the specified number of succeeding addresses. This qualifier must be specified for each command. /A Interprets and displays the data as ASCII characters. Nonprinting characters are displayed as periods. Table 2–12 Address-Spec Symbolic Addresses Symbolic Address Description R<n> General purpose register number n, where n is a decimal number 0 to 15. FP Frame pointer. AP Argument pointer. SP Stack pointer. PC Program counter. PSL Program status longword. + A location following the last location accessed by an EXAMINE or DEPOSIT. The location is the last address plus the size of the last reference (1 for byte, 2 for word, 4 for longword). - A location preceding the last location accessed by an EXAMINE or DEPOSIT. The location is the last address minus the size of the last reference (1 for byte, 2 for word, 4 for longword). * The last location referenced by an EXAMINE or DEPOSIT. @ Indirect addressing. The address-spec is used as a pointer to the data. The format is @address-spec, where address-spec can be any valid address except another @. See Example 2–1. Note Remember that the symbolic addresses from the previous command are used for indirect addressing. See Example 2–1. 2–14 Cabinet and Component Descriptions 2.8.7 FIND FIND searches the main memory beginning at physical address space 0 for either a page-aligned 512-Kbyte segment of memory, or a restart parameter block (RPB). When FIND is successful, it saves the address plus the segment of memory (or RPB) in the stack pointer. When FIND is unsuccessful, an error message is displayed and the contents of the stack pointer are unpredictable. The FIND syntax is: FIND Table 2–13 describes the qualifiers. Table 2–13 Qualifiers for FIND Qualifier Function /MEMORY Searches main memory for a page-aligned 512-Kbyte segment of memory. /RPB Searches main memory for a restart parameter block. The search leaves memory unchanged. 2.8.8 HELP HELP displays a summary of the commands, their arguments, and qualifiers. When you supply a command name, HELP displays the arguments and qualifiers for that command only. HELP does not provide complete descriptions of the commands. The HELP syntax is: HELP [command] Or: ? [command] Cabinet and Component Descriptions 2–15 2.8.9 INITIALIZE INITIALIZE performs the steps shown in Table 2–14. Table 2–14 INITIALIZE Steps Step Action 1 Do hard reset of zone (the cross-link state is set to off). 2 Do hard reset of all available ATMs. 3 Initialize hardware. 4 Reconfigure the zone and update the device configuration block (DCB) to reflect the zone status. 5 Execute the Duplex Compatibility Test. 6 Load the firmware into the console main loop. The INITIALIZE syntax is: INITIALIZE 2.8.10 MOVE MOVE transfers the specified number of bytes (count) from the source-address to the destination-address. The MOVE syntax is: MOVE source-address destination-address count The source-address is the starting address of the data. The destination-address is the starting address of the destination. The count is the number of bytes to be moved. 2.8.11 MATCH_ZONES MATCH_ZONES copies the system-wide module data EEPROM from the other zone. MATCH_ZONES does not copy the zone-specific module data EEPROM. Use MATCH_ZONES only when: • The cross-link state is set to off, and • The path to the other zone is available. (The cross-link cables and other zone power is on.) The MATCH_ZONES syntax is: MATCH_ZONES 2–16 Cabinet and Component Descriptions 2.8.12 REPEAT REPEAT continuously executes the specified command. REPEAT applies to the following commands only. • DEPOSIT • EXAMINE REPEAT can be aborted by pressing Ctrl/C at the console keyboard. The REPEAT syntax is: REPEAT command 2.8.13 SET SET modifies the value of the specified variable. The SET syntax is: SET variable value [value] Note SET does not allow abbreviations. You must enter the name of the variable completely. Table 2–15 lists the variables with the acceptable values. Table 2–15 SET Variables and Values Variable Description Acceptable Values BOOT DEFAULT Default boot specification. Up to 80 characters of ASCII text MODE Boot mode. FAILSTOP = Simplex mode FAILSAFE = Duplex mode RESTART Halt action switch. HALT = Enter console mode BOOT = Boot RESTART = Restart BAUD Console port speed. 300, 600, 1200, 2400, 4800, 9600, 19200, 38400 ZONE Zone identification. A = Zone A B = Zone B Cabinet and Component Descriptions 2–17 2.8.13.1 SET BOOT SET BOOT saves the values of boot-specs. Space for nine boot-specs is available on the CPU module EEPROM. The first space is reserved for the default bootspec. The other eight spaces are available to the user. The SET BOOT syntax is: SET BOOT DEFAULT value Or: SET BOOT boot-spec value The boot-spec may be up to 8 characters of ASCII text. The value is the ASCII text assigned to the boot-spec. 2.8.14 SHOW SHOW displays information about the specified variable. When the cross-link state is off (Simplex mode), information about the current zone is displayed. When the cross-link state is on (Duplex mode), information about both zones is displayed. The SHOW syntax is: SHOW variable Table 2–16 lists the variables. You must supply a variable. Table 2–16 SHOW Variables Variable Description Acceptable Values DEFAULT Default specification. Up to 80 characters of ASCII text MODE Boot mode. FAILSTOP = Simplex mode FAILSAFE = Duplex mode RESTART Halt action switch. HALT = Enter console mode BOOT = Boot RESTART = Restart BAUD Console port speed. 300, 600, 1200, 2400, 4800, 9600, 19200, 38400 ZONE Zone identification. A = Zone A B = Zone B BOOT Displays the saved boot specifications. CONFIGURATION Displays the current system configuration, including the identity and status of any modules in the system. VERSION Displays the firmware revision of all ROMs in the system. (continued on next page) 2–18 Cabinet and Component Descriptions Table 2–16 (Cont.) SHOW Variables Variable Description Acceptable Values DSSI/PATH=pathnumber Specifies the zone and slot number of an adapter connecting to a DSSI device. The path-number format is zss, where: z is the zone ID (A or B). ss is the slot number (10 to 17, 20 to 27) of an adapter connecting to a DSSI device. ETHERNET Displays the physical Ethernet addresses. MEMORY Displays system memory information. STATE Displays the state of the cross-link and the system cables. ERRORS Displays the diagnostic error frames. Not allowed if the cross-link state is on. ALL Displays the contents of all variables. 2.8.15 START START begins execution of the operating software from the specified address. START is equivalent to DEPOSIT PC followed by CONTINUE. The START syntax is: START address-spec You must supply an address-spec. Cabinet and Component Descriptions 2–19 2.8.16 TEST TEST enables the user to test: • The system • A zone • The CPU and memory Use TEST only when the cross-link state is set to off. The TEST syntax is: TEST [qualifier(s)] Tables 2–17 and 2–18 describe the TEST selection and control qualifiers. Table 2–17 Qualifiers for TEST Selection Qualifier Function /GROUP:n1 Specifies a decimal number from 0 to 5 that identifies the group of tests to be run. /TEST:n1 Specifies a decimal number from 0 to 32 that identifies the tests to be run. /SUBTEST:n1 Specifies a decimal number from 0 to 32 that identifies the subtests to be run. /VERBOSE Enables a display of all individual tests during execution. /NOTRACE Disables test traces. 1 n can be a: • Single value • Range separated by a colon (1:5) • List separated by commas (1,5,9) • Combination of range and list (1:6,8,10,11:29) 2–20 Cabinet and Component Descriptions Table 2–18 Qualifiers for TEST Control Qualifier Function /PASSCOUNT:n n is a decimal number from 0 to MAXINT. When n is 0, the passcount is infinite. /NOTRACE Disables the test traces. /COE Continues on error. /NOCONFIRM Disables the test confirmation on destructive tests. /EXTENDED Enables extended error reports. /NOSTATUS Disables status messages and reports. /LIST Lists the available tests, but does not run them. When you do not supply the qualifier(s), TEST runs all the nonextended tests (except those that require confirmation). 2.8.17 X(transfer) X is used by automatic systems communicating with the console. X is not intended for use by operators. X loads or unloads the count of bytes beginning at the specified address. When the high-order bit of the count longword is 1, the data is read from physical memory to the console terminal. When the high-order bit of the count longword is 0, the data is written from the console terminal to physical memory. The X syntax is: X address-spec count Return data-stream checksum The address-spec is a hexadecimal number that specifies a physical address. The count is an 8-bit hexadecimal number that specifies a number of bytes. The data-stream contains the bytes to be transferred by X. The checksum is a 2-digit hexadecimal number that specifies the 2’s complement checksum of the data-stream. The checksum verifies the data-stream. Cabinet and Component Descriptions 2–21 2.8.18 Z Z connects to the firmware of another module in the system. The Z syntax is: Z[/PATH=path-number] Table 2–19 describes the qualifier. Table 2–19 Qualifier for Z Qualifier Function /PATH=path-number Specifies the zone and slot number of a module. The pathnumber format is zss, where: z is the zone ID (A or B). ss is the slot number of the module. When you do not supply a path, Z tries to connect to the module in slot 1 of the zone that is running. Note Z performs a hard reset on the ATMs, but you need to issue a programmed reset to load and start the functional firmware. After Z, you must issue a BOOT from the same zone, or a START/ZONE from the other zone (if that zone is running the operating system). 2.8.19 !(comment) The ! (exclamation point) prefixes a comment. The text following the ! is ignored. The ! syntax is: !(comment) Or: command!(comment) 2–22 Cabinet and Component Descriptions 3 System Maintenance 3.1 In This Chapter This chapter includes: • Maintenance strategy • Operating rules and cautions • General troubleshooting procedure • Module fault LEDs • Power system overview • Power system maintenance • Device status and fault indicators • ROM-based diagnostics 3.2 Maintenance Strategy When a hardware component fails, the Model 810 system uses self-diagnosis through ROM-based diagnostics (RBDs) to isolate the faulty FRU. Once isolated, the system automatically: • Places the faulty FRU off line • Reports the error in the error log • Identifies the faulty FRU on the console terminal • Turns on the faulty FRU fault LED System Maintenance 3–1 3.3 Operating Rules and Cautions Table 3–1, Table 3–2, and Table 3–3 contain operating rules for use during a service call. Table 3–4 provides cautions. Table 3–1 Before Stopping a Zone Step Action 1. Do not depend on the accuracy of a zone ID label. Issue SHOW ZONE before STOP/ZONE to check the states of both zones. 2. Issue SHOW SYSTEM to make sure that the FTSS$SERVER process is running before turning off zone power, or pressing the Break key. 3. Check both zone control panels. The System Fault indicator in the failing zone should be on. 4. Check console messages and error log for related problem information. 5. Always issue SHOW DEV D before STOP/ZONE to make sure that shadow set copying in not in progress. 6. Issue STOP/ZONE. Wait for the zone to initialize, and then turn off zone power. 7. Remove the cross-link assembly. Table 3–2 After a Zone is Repaired Step Action 1. Replace the cross-link assembly. 2. Turn on zone power. 3. Issue SHOW MODE to make sure that the zone is set to: MODE = FAILSAFE. 4. Issue START/ZONE. 5. Check the running zone console for the following message: % FTSS-S-ZONEAVAIL. 6. If the message in step 5 does not appear on the console, consider replacing the cross-link assembly. 7. Monitor the console for the following environmental information messages: "OPERATING ON EXTERNAL POWER" "OPERATING BATTERY POWER" (Life approx 1 hr.) "NORMAL ZONE TEMPERATURE" "YELLOW ZONE TEMPERATURE" "BATTERY TEST PASSED IN CABINET....." "BATTERY TEST FAILED" (Battery not present) FTSS messages.... 3–2 System Maintenance Table 3–3 Before Leaving the Site Step Action 1. Issue SHOW DEVICE D to make sure that all disks are either shadow set members or in the process of being copied. 2. Issue SHOW DEVICE E to make sure that all EP/EF drivers are on line. 3. Use FTSS$FSM to show the failover set status: MCR FTSS$FSM Return FSM> SHOW ADAPTER Return 4. Issue SHOW DEV PW to make sure the PW driver is on line. 5. Issue SHOW CLUSTER/CONTINUE (ADD CIRCUITS, CONNECTIONS,LPORT,RPORT) to check for correct DSSI configuration: $ SHOW CLUSTER/CONTINUE Return COMMAND> ADD CIRCUITS, CONNECTIONS,LPORT,RPORT Return SYSTEMS MEMBERS CIRCUITS CONNECTIONS NODE SOFTWARE STATUS LPORT RPORT RP_TYP CIR_STA LOC_PROC_NAME FTSYS VMS V5.4 PWA0 PWB0 PWF0 PWG0 PWA0 PWB0 PWF0 PWG0 6 7 6 7 7 6 7 6 SWIFT SWIFT SWIFT SWIFT SWIFT SWIFT SWIFT SWIFT OPEN OPEN OPEN OPEN OPEN OPEN OPEN OPEN SCS$DIRECTORY LISTEN RFX V200 PWA0 0 RF35 MSCP$TAPE MSCP$DISK OPEN VMS$DISK_CL_DRVROPEN OPEN USERS RFX V200 PWG0 PWA0 PWG0 0 1 1 RF35 RF35 RF35 OPEN OPEN OPEN VMS$DISK_CL_DRVROPEN OPEN FTTA RFX V246 PWA0 2 RF35 OPEN VMS$DISK_CL_DRVROPEN OPEN SYSB RFX V200 PWG0 PWB0 2 0 RF35 RF35 OPEN OPEN VMS$DISK_CL_DRVROPEN OPEN DISK1 RFX V200 PWF0 PWB0 PWF0 0 1 1 RF35 RF35 RF35 OPEN OPEN OPEN VMS$DISK_CL_DRVROPEN OPEN FTTB RFX V246 PWB0 2 RF35 OPEN VMS$DISK_CL_DRVROPEN OPEN PWF0 2 RF35 OPEN SYSA 6. CON_STA Make sure that the Break keys on both zones are disabled (zone control panel SECURE LED is on). System Maintenance 3–3 Table 3–4 Cautions 1. Do not press ZONE HALT ENABLE and the Break key to stop a running zone. Use STOP/ZONE. If ZONE HALT ENABLE is used, CONTINUE will not resume zone operation. 2. Do not press the Break key or cycle power during the power on or RBD tests. This action may corrupt the EEPROM. 3. Do not perform a Simplex boot (MODE = FAILSTOP) from a disk used by the running zone. This action may corrupt the disk. 4. Do not turn off zone power or halt a zone if the FTSS$SERVER is not loaded and running. 3.4 General Troubleshooting Procedure Table 3–5 provides a general procedure for isolating and replacing a faulty FRU. While the repair is being performed, the user application continues to run. Table 3–5 General Troubleshooting Procedure Step Action 1. Check both zone control panels. The System Fault indicator in the failing zone should be on. 2. If the zone is not already stopped, ask the system manager or other responsible system person to perform a SHOW ZONE and STOP ZONE. After the system manager stops the zone, remove the cross-link assembly. If you are given permission to stop the zone, use the procedure specified in Table 3–1. 3. Check all fault LEDs and the console messages. To verify that the correct FRU has been isolated, check the error log. If a fault LED is on and/or a console message indicates that an FRU has been removed from service, replace the FRU. (See Chapter 5, FRU Removal and Replacement Procedures.) Note Before removing and replacing any module, check the Power Module indicators (Table 3–9) to rule out any potential power problems. 4. If the replaced FRU corrected the problem, turn on zone power. 5. If the repaired zone passes the power on diagnostics, turn off zone power and reconnect the cross-link assembly. 6. Turn on zone power. If the power on diagnostics and the duplex compatibility test pass with the cross-link assembly connected, turn the system over to the system manager. The system manager is responsible for synchronizing the system and returning it to duplex operation. (continued on next page) 3–4 System Maintenance Table 3–5 (Cont.) General Troubleshooting Procedure Step Action 7. If the replaced FRU did not correct the problem, open the system cabinet front door. Check all module and disk drawer fault LEDs. If any fault LED is on, replace the associated module or device. (See Chapter 5, FRU Removal and Replacement Procedures.) 8. If no module or disk fault LED is on, open the system cabinet rear door. Check all module LEDs in the miscellaneous and interface module card cages. If a fault LED is on, replace the associated module. (See Chapter 5, FRU Removal and Replacement Procedures.) 9. If no module fault LED is on, open the expansion cabinet rear door. Check the disk power fault indicators to eliminate any potential power problems. (See Figure 3–7 and Figure 3–9.) If a power fault indicator is on, replace the device. (See Chapter 5, FRU Removal and Replacement Procedures.) 10. If no power fault indicator is on, open the expansion cabinet front door and check all disk and tape unit fault LEDs and indicators. (See Figure 3–6, Figure 3–8, and Table 3–23.) If any LED or fault indicator is on, replace or repair the failing device. (See Chapter 5, FRU Removal and Replacement Procedures.) 11. If no fault LEDs or indicators are on, run the error log utility. (See Chapter 4, Error Handling and Analysis.) Use the OpenVMS HELP facility to help you run the utility as shown in the following example. Qualifier examples can be displayed at the ANALYZE Subtopic? prompt as shown at the end of the code example. $ HELP ANALYZE/ERROR_LOG ANALYZE /ERROR_LOG Invokes the Errorlog Report Formatter (ERF) to selectively report the contents of an error log file. The /ERROR_LOG qualifier is required. For a complete description of the OpenVMS Analyze Error Log Utility, including more information about the ANALYZE/ERROR_LOG command and its qualifiers, see the OpenVMS Error Log Utility Reference Manual. Format: ANALYZE/ERROR_LOG [file-sped[,...]] Additional information available: Parameters Command_Qualifiers /BEFORE /BINARY /BRIEF /ENTRY /EXCLUDE /INCLUDE /LOG /OUTPUT /REGISTER_DUMP /SID_REGISTER /SINCE /STATISTICS Examples /FULL /REJECTED /SUMMARY ANALYZE /ERROR_LOG Subtopic? Return ANALYZE Subtopic? Examples Return (continued on next page) System Maintenance 3–5 Table 3–5 (Cont.) General Troubleshooting Procedure Step Action 12. If the problem cannot be isolated and repaired, the service call should be escalated to the Customer Service Center for further action. 3.5 Module Fault LEDs Figure 3–1 shows all module fault LED locations. Table 3–6 identifies each module. Figure 3–1 Module Fault LEDs Rear Front 4 5 1 . .. .. .. . . ... 2 7 8 . 9 . . . .. . . . 10 . 3 . 6 .. .. .. .. .. . . . . . .. . . . CPU Cabinet CPU Cabinet MR−0049−93RAGS 3–6 System Maintenance Table 3–6 Key to Figure 3–1, Module Fault LEDs Key Module 1 CPU module 2 ATM module 3 System Fault (zone control panel) 4 Front end unit 5 DC3 converter 6 DC5 converter 7 Power system controller 8 Console module 9 CAMP module 10 DSSI and Ethernet interface modules 3.6 Power System Overview The following sections describe the power distribution and power components. Figure 3–2 and Figure 3–3 are basic block diagrams of the system power and power distribution. Table 3–7 provides a functional summary of the power components. Table 3–8 is a DC voltage summary. System Maintenance 3–7 Figure 3–2 Power System Block Diagram (1 of 2) UTILITY POWER INPUT 120 Vac, 60 Hz 240 Vac, 50 Hz Optional Uninterruptible Power System AC POWER OUTPUT AND DISTRIBUTION Power Distribution Boxes With UPS: AC Power Distributed to System and Expansion Cabinets Without UPS: AC Power Distributed to Expansion Cabinet DC POWER OUTPUT AND DISTRIBUTION 48V_DRCT Front End Unit 48V_SWD DC5 5 Vdc to Centerplane to CPU/IO ATM/Console Extender/Interface Modules 3.3 Vdc and 12 Vdc to Centerplane to CPU/IO ATM/Console Extender/Interface Modules DC3 Thermal Emulator Output to Power System Control (PSC) 2 Vdc Output Not Used on Model 810 CAMP Module Zone A and B LDCs 21 Vdc to CPU and IO ATM Module Clock Logic 48V_PSC to PSC I2C Bus Power to Module Fault LEDs 5 Vdc In−Zone Disk Control Panel 5 Vdc Terminal DC Power 12 Vdc LDC Control Card LDC Control and Status to CAMP Module Zone A and B Disk Extender Modules 48V_SWD DC3 3.3 Vdc/12 Vdc Input DC5 5 Vdc Input Console Extender Module −12 Vdc Input Console Extender Module −12 Vdc to Centerplane/CPU/IO ATM/Interface Modules Interface Module MR−0500−92RAGS−A 3–8 System Maintenance Figure 3–3 Power System Block Diagram (2 of 2) DC POWER OUTPUT AND DISTRIBUTION DC3 3.3 Vdc/12 Vdc Input DC5 5 Vdc Input Console Extender Module −12 Vdc Input DC3 3.3 Vdc/12 Vdc Input DC5 5 Vdc Input Console Extender Module −12 Vdc Input CPU Module IO ATM Module 12 Vdc 3.3 Vdc Internal DC to DC Converter I2C Bus−Power Status to System Power Fail Function (POK_H) to I/O Devices and Options From CAMP Module: Initiate Power On Sequence DC3 Thermal Emulator Input System Temperature Monitor Centerplane DC Voltage Monitor LDC Status Monitor Power System Control Initiate Overtemperature Power Off Sequence Initiate Overvoltage Power Off Sequence Initiate Undervoltage Power Off Sequence Fan Speed Commands to CAMP Module Report LDC Status and Faults to System MR−0500−92RAGS−B System Maintenance 3–9 Table 3–7 Power System Functional Summary FRU Functional Summary Local Disk Converter (LDC) An LDC is located in each in-zone disk drawer. It provides +12 Vdc with fast transit response and tolerance to short-term loading during disk spinup. Also provides +5 Vdc for power logic, and EMI filtering for the 48 V bus. It provides VTERM, which is a 5 V diode isolated output, and current limited for powering the I/O bus terminators. Fusing is included to prevent a fault on one LDC from loading the 48 V bus and crashing the entire power system. Front End Unit (FEU) H7884-AA Provides the main ac circuit breaker, and generates two +48 V outputs: • Unswitched (DRCT) which supports the CAMP and Disk Extender modules, LDCs, DC3, and DC5 • Switched (SWD) which supports the interface modules, and Console and Disk extender modules Also provides programmable fan power output from +11 to +27 Vdc which allows the system to adjust the fan speed based on system temperature. The PSC monitors the system temperature through a thermal emulator in DC3, and sends fan speed commands through the CAMP module to the FEU to adjust the fan power output. Power System Controller (PSC) H7851-AA An I2C bus allows the PSC to write power status information to the system, and provides a power fail signal (POK_H) to the mass storage devices and I/O options. Receives commands from the CAMP module to initiate the logic power on sequence by commanding the FEU to turn on the +48 V switched output and enable the DC3 and DC5 outputs. The PSC also drives the power system visual status indicators. It monitors system temperature through the thermal emulator in DC3 and sends fan speed commands through the CAMP module to the FEU for fan power and fan speed control. Provides a warning when system temperatures are beyond the normal operating range: Green Zone = 5°C (41°F) to 52°C (126°F) Yellow Zone = 5°C (41°F) to 62°C (144°F) Red Zone = 5°C (41°F) to 75°C (167°F) Initiates the power off sequence when system temperature reaches the red zone. The PSC monitors the centerplane voltages and initiates a power off on an undervoltage fault; fires the crowbar and initiates a power off on an overvoltage fault. Also initiates a power off if the FEU indicates a 48 V output is out of tolerance, or there is less than 4 millisecond of reserve power, and on a fan failure. The PSC monitors the LDC status and reports failures to the system. (continued on next page) 3–10 System Maintenance Table 3–7 (Cont.) Power System Functional Summary FRU Functional Summary DC5 H7179-AA DC to dc converter which provides +5 Vdc to the CPU, MMB, SIMMs, I/O ATM, interface and console extender modules, as well as +5 Vdc to the I/O ATM internal +5 Vdc to +3.3 Vdc converter for the SOC. Provides EMI filtering on the 48 V bus, and fusing to prevent the power system from crashing due to a short circuit on a converter input. Supports the crowbar SCR on a 5 V overvoltage or undervoltage fault. DC3 H7178-AA DC to dc converter which provides +3 Vdc to the CPU, I/O ATM, interface and console extender modules. Provides +12 Vdc to the console extender module +12 V to -12 V converter for the CPU and I/O ATM modules, and the +21 V converter for the CPU and I/O ATM clock logic. Provides EMI filtering on the 48 V bus, and fusing to prevent the power system from crashing due to a short circuit on a converter input. Supports the crowbar SCR on a 3 V or 12 V undervoltage or overvoltage fault. Provides system temperature sensing through the thermal emulator. The emulator provides system temperature information to the PSC for system cooling fan speed control and for power off in the event of an overtemperature condition. CAMP module Control and Miscellaneous Power module. Provides miscellaneous custom power control circuits. Console extender module Provides local and remote console terminal ports, modem port, and zone control panel interface. Fan current sense board (FCSB) Monitors the fan current and rotation, and generates a rotation signal to the CAMP module. The CAMP module in turn generates a tachometer signal to the PSC for fan speed monitoring and control. Zone A and B power controllers Provide ac utility power to the peripheral devices. Power controllers are located in the expansion cabinet. Power I2C bus Provides serial communication between the PSC, console extender, and I/O ATM modules. The PSC uses the bus to write power status information. The I/O ATM uses the bus to control the zone control panel LEDs through the console extender module. It also writes the Ethernet hardware addresses. System Maintenance 3–11 Table 3–8 System DC Voltage Summary Component Supplies . . . To . . . DC5 (H7179-AA) +5 Vdc CPU, I/O ATM, console extender, and interface modules DC3 (H7178-AA) +3.3 Vdc CPU, I/O ATM, console extender, and interface modules DC3 (H7178-AA) +12 Vdc CPU, I/O ATM, console extender, and interface modules FEU (H7884-AA) +48V_DRCT (direct) CAMP and disk extender modules, LDCs, DC3, and DC5 FEU (H7884-AA) +48V_SWD (switched) Console extender, disk extender, and interface modules CAMP 48V_DRCT to 12 V converter VBIAS12 I2C bus power to drive module fault LEDs CAMP 48V_DRCT to 12 V converter VBIAS5 CAMP module internal bias voltage Console extender module +48_SWD to -12 V converter -12 Vdc CPU and I/O ATM modules CAMP +12 V to +21 V converter +21 Vdc CPU and I/O ATM module clock logic FEU (H7884-AA) 11 Vdc to 27 Vdc Programmable fan control power Local disk converter (LDC) +5 Vdc In-zone disk control panel LDC +12 Vdc LDC control card LDC +5 VTERM Terminal dc power 3.7 Power System Maintenance Figure 3–4 shows the location of the power module controls and indicators. Table 3–9 describes module functions and repair action. Table 3–10, Table 3–11, Table 3–12, Table 3–13, Table 3–14, Table 3–15, Table 3–16, and Table 3–17 describe the Fault ID Display codes of the PSC. 3–12 System Maintenance Figure 3–4 Power Module Controls and Indicators FEU DC3 DC5 PSC 7 8 9 10 11 12 13 1 2 3 4 14 5 6 15 16 CAMP MR−0483−92RAGS Table 3–9 Key to Figure 3–4, Power Module Controls and Indicators Item Control/Indicator Function Repair Action 1 AC Circuit Breaker 2 FEU Failure When on, indicates the dc output voltages for the FEU are below the specified minimum. Replace the FEU. See Chapter 5. 3 FEU OK When on, indicates the dc output voltages for the FEU are above the specified minimum. 4 DC3 Failure When on, indicates that one of the output voltages is not within the specified tolerances. Replace the dc converter. See Chapter 5. (continued on next page) System Maintenance 3–13 Table 3–9 (Cont.) Key to Figure 3–4, Power Module Controls and Indicators Item Control/Indicator Function 5 DC3 OK When on, indicates that the output voltages are within the specified tolerances. 6 AC Present When on, indicates ac power is present at the ac input connector, regardless of the position of the circuit breaker. If ac power is present, check the power source and power cord. Replace the dc converter. See Chapter 5. 7 DC5 Failure When on, indicates that one of the output voltages is not within the specified tolerances. 8 DC5 OK When on, indicates that the output voltages are within the specified tolerances. 9 PSC Failure When on, indicates a PSC fault. 10 PSC OK When blinking, indicates the PSC is performing power-on self-tests. Repair Action If the system will not power on, and the ac LED is the only LED on, check the circuit breaker. Replace the PSC. See Chapter 5. When on, indicates the PSC is functioning. 11 Over Temperature Shutdown When on, indicates that the PSC shut down the system because of an internal overtemperature condition. Set the circuit breaker to off and wait 1 minute before turning system power on. Make sure the air intake is unobstructed and that the room temperature does not exceed the maximum requirement. 12 Fan Failure When on, indicates a fan failure. Use the hexadecimal number in the Fault ID Display to isolate the fan. Replace the fan. See Chapter 5. 13 Disk Drive Power Failure When on, indicates a disk drive power failure. Use the hexadecimal number in the Fault ID Display to isolate the storage compartment that houses the disk drive. The faulty unit is probably the local disk converter (LDC). To isolate the LDC, disconnect the drives on the specified bus, and turn on system power. If the indicator stays on with the drives disconnected, replace the failing LDC. See Chapter 5. A cable or drive may also be at fault. (continued on next page) 3–14 System Maintenance Table 3–9 (Cont.) Key to Figure 3–4, Power Module Controls and Indicators Item Control/Indicator Function Repair Action 14 Fault ID Display Displays the power subsystem fault codes. 15 PSC Reset Button When out, indicates a PSC fault condition. Press in to reset. 16 CAMP Fan Fault When on, indicates that a fan fault caused all disk drives and tape drives to shut down. Replace the fan. See Chapter 5. Table 3–10 Fan, LDC, Temperature Error Codes Error Code PSC OK PSC Failure LDC Fault FAN Failure 0 On Off —1 — Normal operation, displayed after PSC passes self-test 1 — — — On Fan 1 failed 2 — — — On Fan 2 failed 3 — — — On Fan 3 failed 4 — — — On Fan 4 failed 9 — — — On Access door opened, or two or more fans failed A — — On — LDCA (LDC0) failed B — — On — LDCB (LDC1) failed C — — On — LDCC (LDC2) failed D — — On — LDCD (LDC3) failed A — — On — LDCE (LDC4) failed — — — On — LDCF (LDC5) failed — — — On — LDCG (LDC6) failed — — — On — LDCH (LDC7) failed 7 Off On — — Temperature sensor failed, low reading 8 Off On — — Temperature sensor failed, high reading — — — On — Temperature in red zone Error Description 1 Dash entries = LED state NOT changed by error The PSC Fault ID Display provides a continuous, 1-character rotating display of the 4-character error codes listed in Tables 3–11 to 3–17. Character display time is approximately 1/2 second. System Maintenance 3–15 Table 3–11 FEU Error Codes Error Code FEU OK FEU Failure Error Description E200 Off On 48V_SWITCHED OK before enabling E201 Off On Fan converter operating before enabling E202 Off On HVDC is OK, but POWER is not OK (contradictory status) E203 Off On The ac current is not OK (in idle state/loop) E204 Off On 48V_DIRECT is not OK and POWER is OK (IRQ18) E205 Off On 48V_SWITCHED is not OK and switched bus requested (IRQ19) E206 Off On HVDC is OK, but POWER is not OK (IRQ20) E210 Off On SWITCHED BUS did not turn on at startup E211 Off On SWITCHED BUS did not turn off at shutdown E212 Off On The ac current is high for the second time (in startup or run loop) E220 Off On Fan converter voltage is low Table 3–12 PSC Error Codes Error Code PSC OK PSC Failure Error Description EFFF Off On Invalid error number (in display_error procedure) E000 Off On Unused error condition E001 Off On PSC bias supply not OK E002 Off On 80C196 internal register test failed E003 Off On 80C196 operational test failed E004 Off On 80C196 on-chip RAM test failed E005 Off On ROM checksum test failed E006 Off On External RAM test failed E007 Off On Port FF20 (PSC/FEU LEDs) not initially zero E008 Off On Port FF22 (Module enable) not initially zero E009 Off On Port FF23 (DC-DC LEDs) not initially zero E010 Off On Port FF24 (LDC enable) not initially zero E011 Off On External interrupt test failed (8259 did not clear test bit) E012 Off On Masked interrupt occurred (A/D conversion complete) E013 Off On Masked interrupt occurred (HSI data available) E014 Off On Masked interrupt occurred (HSO) E015 Off On Masked interrupt occurred (HSI pin 0) E016 Off On Masked interrupt occurred (Serial I/O) E017 Off On Software trap interrupt occurred (F7 instruction executed) (continued on next page) 3–16 System Maintenance Table 3–12 (Cont.) PSC Error Codes Error Code PSC OK PSC Failure E018 Off On Unimplemented opcode interrupt occurred (invalid instruction) E019 Off On Masked interrupt occurred (HSI FIFO 4th entry) E020 Off On Masked interrupt occurred (Timer 2 capture) E021 Off On Masked interrupt occurred (Timer 2 overflow) E022 Off On PSC bias supply failed (NMI occurred) E023 Off On Invalid interrupt number (>31) received from 8259 E024 Off On IRQ4 occurred (slave 0 to master 8259) E025 Off On IRQ5 occurred (slave 1 to master 8259) E026 Off On IRQ6 occurred (slave 2 to master 8259) E027 Off On Masked IRQ13 occurred (FEU DIRECT 48 became OK) E028 Off On Masked IRQ14 occurred (FEU SWITCHED 48 became OK) E029 Off On Masked IRQ16 occurred (FEU POWER became OK) E030 Off On External interrupt test, not enabled (IRQ22) E031 Off On External interrupt test, bit not set (IRQ22) E032 Off On Masked IRQ25 occurred (OCP DC ON, turned on) E033 Off On Masked IRQ26 occurred (PSC DC ON, turned on) E034 Off On Invalid converter number (start of enable_converter procedure) E035 Off On Invalid converter number (end of enable_converter procedure) E036 Off On Invalid converter number (start of disable_converter procedure) E037 Off On Invalid converter number (end of disable_converter procedure) E047 Off On Unused error condition E078 Off On Unused error condition E079 Off On Unused error condition E086 Off On Unused error condition E087 Off On Unused error condition E088 Off On Unused error condition E091 Off On Unused error condition E092 Off On Unused error condition E093 Off On Unused error condition E094 Off On Unused error condition E095 Off On Unused error condition E096 Off On Unused error condition E097 Off On Unused error condition Error Description (continued on next page) System Maintenance 3–17 Table 3–12 (Cont.) PSC Error Codes Error Code PSC OK PSC Failure Error Description E098 Off On Unused error condition E099 Off On Unused error condition Table 3–13 12 V DC to DC Converter Error Codes Error Code 12V OK 12V Fault 5V OK 5V Fault 3V OK 3V Fault 2V OK 2V Fault E010 — 1 Error Description — Off On Off On — — Delta 0 V E101 — — Off — Off — Off — Indeterminant converter overvoltage (IRQ7) E102 Off — Off — Off — Off — Indeterminant converter overvoltage/ undervoltage (IRQ15) E103 Off On Off On Off On Off On Unknown converter overvoltage/ undervoltage condition 1 Dash entries = LED state NOT changed by error Table 3–14 2 V DC to DC Converter Error Codes Error Code 2V OK 2V Fault Error Description E110 Off On Out of regulation low E111 Off On Out of regulation high E112 Off On Undervoltage E113 Off On Overvoltage E114 Off On Voltage present when disabled E115 Off On Did not turn off Note The 2 V converter output is not used on the Model 810. 3–18 System Maintenance Table 3–15 3 V DC to DC Converter Error Codes Error Code 3V OK 3V Fault Error Description E120 Off On Out of regulation low E121 Off On Out of regulation high E122 Off On Undervoltage E123 Off On Overvoltage E124 Off On Voltage present when disabled E125 Off On Did not turn off Table 3–16 5 V DC to DC Converter Error Codes Error Code 5V OK 5V Fault Error Description E130 Off On Out of regulation low E131 Off On Out of regulation high E132 Off On Undervoltage E133 Off On Overvoltage E134 Off On Voltage present when disabled E135 Off On Did not turn off Table 3–17 12 V DC to DC Converter Error Codes Error Code 12V OK 12V Fault Error Description E140 Off On Out of regulation low E141 Off On Out of regulation high E142 Off On Undervoltage E143 Off On Overvoltage E144 Off On Voltage present when disabled E145 Off On Did not turn off 3.8 Device Status and Fault Indicators The following sections describe the device status and fault indicators. 3.8.1 RF35 Disk Drawer Figure 3–5 shows the RF35 disk drawer controls and indicators. Table 3–18 describes their functions. System Maintenance 3–19 Figure 3–5 RF35 Disk Drawer Controls and Indicators D0 D1 D2 FAULT WRITE PROT ON LINE PWR ON/OFF SET UP D3 D4 0−1 SU D5 FAULT WRITE PROT ON LINE PWR ON/OFF SET UP 0−1 SU MR−0436−92RAGS Table 3–18 RF35 Disk Drawer Controls and Indicators Control/Indicator Color State Operating Condition Fault Red On Drive is faulty. Off Drive is functioning correctly. Out, off System can read from the disk and write to the disk. In, on System cannot write to the disk, but can read from the disk. Out, off Drive is disabled. In, on Drive is enabled. In, on Power is on. Out, off Power is off. In Prevents the drive from joining the DSSI cluster. Also allows you to set the DSSI parameters for a new drive or a drive you replace in the system after repair. (If you want to set the DSSI parameters, you press the Set Up switch and the Power On/Off switch at the same time.) Out Has no effect on the drive. Write Protect On Line Power On/Off Set Up Switch 3–20 System Maintenance Amber Green Green 3.8.2 SF35 Storage Array Figure 3–6 shows the operator control panel. Table 3–19 describes their functions. Figure 3–7 shows the rear of the storage array. Table 3–20 describes the functions of the controls and indicator located at the rear of the storage array. Figure 3–6 SF35 Operator Control Panel Operator Control Panel (OCP) Front A B C Reeaarr R D E F A B C D E F Ready Write Protect Fault Fault Indicators A A B C D E F B C Front D E F A B C Rear D E F MR-0017-93DG System Maintenance 3–21 Table 3–19 SF35 Operator Control Panel Description Control/Indicator Function Ready Push-to-set switch with green indicator. Brings the integrated storage element (ISE) on-line in about 10 seconds. The indicator remains on while the ISE is on-line. Write Protect Push-to-set switch with amber indicator. Write protects the data on the ISE. The data cannot be overwritten, nor can new data be written to the ISE. Fault Recessed switch with multi-color indicator. Controls the MSCP. This switch is equivalent to the SU switch. The colors indicate the following conditions: Green (in) = MSCP is disabled. Green (out)= MSCP is enabled. Amber = Fault is detected while the MSCP is disabled. Red = ISE fault. Off = Normal MSCP operation. Drive DC Power Switches 3–22 System Maintenance One switch/indicator for each ISE. Apply power to the ISEs. Each ISE spins up and runs a self-test. The indicator shows that nominal power is being applied to the ISE. (If you want to bring the ISE on-line, you press the Ready switch next.) Figure 3–7 SF35 Rear Panel Fault Indicator DSSI Connectors A B C D E F digi tal 1 0 AC Power Switch Power Supply Fault Indicator (Behind Panel) 230 115 FAULT Line Voltage Selector Switch (Behind Panel) MR-0421-92DG Table 3–20 SF35 Rear Panel Controls and Indicator Control/Indicator Function AC Power Switch Applies power to the ac power supply. Line Voltage Selector Switch Selects 120 Vac (60 Hz) or 240 Vac (50 Hz) line voltage. Power Supply Fault Indicator When on, indicates an overtemperature condition. System Maintenance 3–23 3.8.3 SF73 Storage Array Figure 3–8 shows the SF73 storage array status and fault indicators. Table 3–21 descibes their functions. Figure 3–9 shows the controls and indicator located at the rear of the storage array. Figure 3–8 Location of SF73 Storage Array LEDs and Switchpacks digi tal Write Ready Protect Fault DSSI ID 1 DSSI ID Write Ready Protect Fault 2 MR-0423-92DG Table 3–21 SF73 Front Panel Controls and Indicators Control/Indicator Function Ready Push-to-set switch with green indicator. Brings the integrated storage element (ISE) on-line in about 10 seconds. The indicator remains on while the ISE is on-line. Write Protect Push-to-set switch with amber indicator. Write protects the data on the ISE. The data cannot be overwritten, nor can new data be written to the ISE. Fault Switch with red indicator. When the indicator is on, the ISE failed. Press the switch to display the fault codes and clear the ISE fault. The indicator is off during normal operation. TERM PWR LED When on, indicates that the correct termination power is being supplied. SPLIT LEDs (2) When on, indicates that the storage array is operating in split-bus mode. Switchpacks (4) One for each of the drives in the storage array. Each switchpack is used to set the DSSI ID number. The icon on the front of the door indicates the location of the drive. The three rightmost switches of each switchpack are the DSSI ID switches. The leftmost switch is the SU switch. Drive DC Power Switches One switch/indicator for each ISE. Each switch applies power to an ISE. Each ISE spins up and runs a self-test. The indicator shows that nominal power is being applied to the ISE. (If you want to bring the ISE on-line, you press the Ready switch next.) 3–24 System Maintenance Figure 3–9 Rear of the SF73 Storage Array DSSI Connectors 1 0 AC Power Switch Power Supply Fault Indicator (Behind Panel) 230 115 FAULT Line Voltage Selector Switch (Behind Panel) MR-0422-92DG System Maintenance 3–25 3.8.4 TF85C Tape Drive Table 3–22 may help you define and correct TF85C tape drive problems. Table 3–22 TF85C Tape Drive Problems Problem Possible Solution Correctable failure during operation If the TF85C drive fails during operation, reset the the drive, then rewind, unload, and remove the cartridge. If all four indicators are blinking, press the Unload button. If the failure is correctable, the tape begins to rewind and the yellow indicator blinks. When the tape is unloaded, the green indicator turns on and the beeper sounds. Then pull the Insert/Remove handle to open the drive and remove the cartridge. Noncorrectable failure during tape motion If the tape does not rewind when the Unload button is pushed, and all indicators continue to blink, the failure is not correctable. The drive must be serviced or replaced. Failure during cartridge insertion A cartridge failure occurs if a cartridge is damaged or if internal portions of the drive that handle the cartridge are not working. Suspect a cartridge failure if the green indicator blinks, but the tape does not move (the yellow indicator does not blink). Remove the cartridge and try another one, or inspect the tape leader and drive takeup leader. Figure 3–10 shows the front of the TF85C tape drive. Table 3–23 describes the indicators shown in Figure 3–10. Figure 3–10 TF85C Cartridge Tape Drive t ad gh Lo Li o T t ai his t W n pe O dle pe a T an H rt se his t In se lo e C dl an H R d oa n nl to U ut t o B gh T i ss L re P t is ai th W n pe pe O dle Ta an ve o em H se U ed g e ct in in at dle n te te e e a pe per an ri ro ap Us Cle Ta W P O H T Text is 8pt on 8pt Rt,z,-45 TK85 is TI med (ti) 12pt U nl oa d MR-0471-92DG 3–26 System Maintenance Table 3–23 TF85C Cartridge Tape Drive Indicators Indicator Color State Operating Condition Write Protected Orange On Tape is write-protected. Off Tape is write-enabled. Tape in Use Yellow Blinking Tape is moving. On Tape is loaded; ready for use. On Drive head needs cleaning or tape is bad. If it remains on after you unload the cleaning tape . . . Then the cleaning was not completed because the tape ended. If, after cleaning, it turns on again when the data cartridge is reloaded . . . Then a data cartridge problem occurred. Try another cartridge. On Okay to operate the Insert/Remove handle. Off Do not operate the Insert/Remove handle. On Power-on self-test is in progress. Blinking A fault is occurring. Press the Unload button to unload the cartridge. If the fault is cleared, the yellow indicator blinks while the tape rewinds. When the green indicator turns on, you can move the Insert/Remove handle to remove the cartridge. If the fault is not cleared, all four indicators continue to blink. Do not attempt to remove the cartridge. Refer to the TF85C service guide. Use Cleaning Tape Operate Handle All four indicators Orange Green 3.8.5 TF857 Tape Loader This section describes the power on process and the operator control panel (OCP) indicators. 3.8.5.1 Power-On Process When the TF857 tape loader powers on, all of the indicators on the control panel (OCP) turn on within 15 seconds. The power on self-test (POST) is initializing the subsystem. When POST completes successfully, all OCP indicators, including the Magazine Fault and Loader Fault indicators, turn off — except for Power On. Then the elevator scans the magazine to find slots that contain cartridges. 3.8.5.2 Operator Control Panel Controls and Indicators Figure 3–11 shows the OCP controls and indicators. Table 3–24 describes their functions. System Maintenance 3–27 Figure 3–11 TF857 Operator Control Panel Operator Control Panel Eject Load/Unload Mode Select Key Slot Select OCP Disabled 0 Automatic Mode Power On Current Slot Indicators 0-5 Manual Mode Service Mode Button and Indicator Area OCP Label Write Protected Tape In Use 1 Use Cleaning Tape Magazine Fault Loader Fault 2 DSSI Node ID Label 3 Eject Load/Unload Slot Select 0 Power On Write Protected Write Protect Load Fault 1 Tape In Use 4 Use Cleaning Tape Magazine Fault Loader Fault 2 3 5 4 5 6 6 40% REDUCTION MR-0472-92 Table 3–24 TF857 OCP Controls and Indicators Control/Indicator Color Function Eject button – Opens the receiver, allowing access to the magazine for removal and insertion of cartridges. Also can be used to unload the tape from the drive to the magazine. Eject indicator Green Indicates that pressing the Eject button opens the receiver. If a cartridge is in the drive, the cartridge unloads to the magazine and the receiver opens. If no cartridge is in the drive, the receiver opens. (continued on next page) 3–28 System Maintenance Table 3–24 (Cont.) TF857 OCP Controls and Indicators Control/Indicator Color Function Load/Unload button – Loads the currently selected cartridge into the drive, or unloads the cartridge from the drive to the magazine. If the Loader Fault or Magazine Fault indicators are on, can also be used to reset the subsystem. Load/Unload indicator Green Indicates you can press the Load/Unload button. Slot Select button – When pressed, increments the current slot indicator to the next slot. Slot Select indicator Green Indicates the Slot Select button can be used. Pressing the button increments the current slot indicator to the next slot. Power On indicator Green When on, indicates the TF857-AA tape loader power is on (ac and dc voltages are within tolerance). When off, indicates the tape loader power is off. Write Protected indicator Orange When on, indicates the cartridge in the drive is write protected. When off, indicates the cartridge in the drive is write enabled. Tape in Use indicator Yellow Indicates tape drive activity as follows: • Slow blinking indicates tape is rewinding; rapid blinking indicates tape is reading or writing. • When on steadily, indicates a cartridge is in the drive and the tape is not moving. • When off, indicates no cartridge is in the drive. Magazine Fault indicator Red Indicates a magazine failure. Use Cleaning Tape indicator Orange Indicates the read/write head needs cleaning. Loader Fault indicator Red Indicates a TF857-AA tape loader transfer assembly error or drive error. Current slot indicators 0–6 Green Identify the current slot (see Slot Select button). Each current slot indicator blinks when its corresponding cartridge moves to or from the drive. Also used with the Magazine Fault or Loader Fault indicator to indicate the type of fault. 3.9 ROM-Based Diagnostics The following sections describe how to use the TEST and Z commands and to run the ROM-based diagnostics (RBDs). System Maintenance 3–29 3.9.1 TEST TEST enables the user to test: • The system • A zone • The CPU and memory Use TEST only when the cross-link state is set to off. The TEST syntax is: TEST [qualifier(s)] Tables 3–25 and 3–26 describe the TEST selection and control qualifiers. Table 3–25 Qualifiers for TEST Selection Qualifier Description /GROUP:n1 Specifies a decimal number from 0 to 5 that identifies the group of tests to be run. /TEST:n1 Specifies a decimal number from 0 to 32 that identifies the tests to be run. /SUBTEST:n1 Specifies a decimal number from 0 to 32 that identifies the subtests to be run. /VERBOSE Enables a display of all individual tests during execution. /NOTRACE Disables test traces. 1 n can be a: • Single value • Range separated by a colon (1:5) • List separated by commas (1,5,9) • Combination of range and list (1:6,8,10,11:29) Table 3–26 Qualifiers for TEST Control Qualifier Description /PASSCOUNT:n n is a decimal number from 0 to MAXINT. When n is 0, the passcount is infinite. /NOTRACE Disables the test traces. /COE Continues on error. /NOCONFIRM Disables the test confirmation on destructive tests. /EXTENDED Enables extended error reports. /NOSTATUS Disables status messages and reports. /LIST Lists the available tests, but does not run them. When you do not supply the qualifier(s), TEST runs all the nonextended tests (except those that require confirmation). 3–30 System Maintenance 3.9.2 Z Z connects to the firmware of another module in the system. It is also used to initiate I/O ROM-based diagnostics. The Z syntax is: Z[/PATH=path-number] Table 3–27 describes the qualifier. Table 3–27 Qualifier for Z Qualifier Function /PATH=path-number Specifies the zone and slot number of a module. The pathnumber format is zss, where: z is the zone ID (A or B). ss is the slot number of the module. When you do not supply a path, Z tries to connect to the module in slot 1 of the zone that is running. Note Z performs a hard reset on the ATMs, but you need to issue a programmed reset to load and start the functional firmware. After Z, you must issue a BOOT from the same zone, or a START/ZONE from the other zone (if that zone is running the operating system). 3.9.3 CPU ROM-Based Diagnostics Table 3–28 provides a brief description of the CPU ROM-based diagnostics (RBDs). Table 3–28 CPU ROM-Based Diagnostic Descriptions Group Test Subtest G: 0 Description Self-Test G: 0 T: 0 NVRAM Test G: 0 T: 0 S: 0 NVRAM CPU EEPROM Data Integrity Test G: 0 T: 0 S: 1 NVRAM CPU EEPROM Checksum Test G: 0 T: 0 S: 2 NVRAM I2C Bus Register Access Test G: 0 T: 0 S: 3 NVRAM Module-ID PROM Access and Data Integrity R/W Test G: 0 T: 0 S: 4 NVRAM Module-ID PROM Checksum Test G: 0 T: 0 S: 5 NVRAM System Ethernet Access Test G: 0 T: 0 S: 6 NVRAM System Ethernet PROM Checksum Test G: 0 T: 1 P-CACHE Test (continued on next page) System Maintenance 3–31 Table 3–28 (Cont.) CPU ROM-Based Diagnostic Descriptions Group Test Subtest Description G: 0 T: 1 S: 0 P-CACHE Register Bit Test G: 0 T: 1 S: 1 P-CACHE Tag Integrity Test G: 0 T: 1 S: 2 P-CACHE Data Integrity Test G: 0 T: 1 S: 3 P-CACHE Data/Tag Parity Test G: 0 T: 2 G: 0 T: 2 S: 0 VIC Register Bit Test G: 0 T: 2 S: 1 VIC Cache Tag Test G: 0 T: 2 S: 2 VIC Cache Data Test G: 0 T: 2 S: 3 VIC Cache Data Parity Error Test G: 0 T: 2 S: 4 VIC Cache Tag Parity Error Test G: 0 T: 2 S: 5 VIC Branch Prediction Test G: 0 T: 3 G: 0 T: 4 G: 0 T: 4 S: 0 MEMORY Data Bus & Catastrophic Failure Test G: 0 T: 4 S: 1 MEMORY Address Uniqueness Test G: 0 T: 4 S: 2 MEMORY Bank Addressing Test G: 0 T: 4 S: 3 MEMORY Chip Addressing Test G: 0 T: 4 S: 4 MEMORY Chip Open Address Lines Test G: 0 T: 4 S: 5 MEMORY Single-Bit ECC Error Logic Test G: 0 T: 4 S: 6 MEMORY Double-Bit ECC Error Logic Test G: 0 T: 4 S: 7 MEMORY ECC Error Logic Test G: 0 T: 4 S: 8 MEMORY ECC Test G: 0 T: 4 S: 9 MEMORY ECC Lines Test G: 0 T: 5 G: 0 T: 5 G: 0 T: 6 G: 0 T: 6 S: 0 B-CACHE Data RAM Test G: 0 T: 6 S: 1 B-CACHE Tag RAM Test G: 0 T: 6 S: 2 B-CACHE ECC RAM Test G: 0 T: 6 S: 3 B-CACHE Write Test G: 0 T: 6 S: 4 B-CACHE Data Integrity Test G: 0 T: 6 S: 5 B-CACHE Data Test (error enabled) G: 0 T: 7 G: 0 T: 7 S: 0 DMA Powerup State Test G: 0 T: 7 S: 1 DMA Register Access Test G: 0 T: 7 S: 2 DMA Address Decode Test G: 0 T: 7 S: 3 DMA Interlock Access Test G: 0 T: 7 S: 4 DMA Queue Processing Test VIC Test JXD Test Memory Test BITMAP Test S: 0 BITMAP March Test B-CACHE Test DMA Test (continued on next page) 3–32 System Maintenance Table 3–28 (Cont.) CPU ROM-Based Diagnostic Descriptions Group Test Subtest Description G: 0 T: 7 S: 5 DMA Sub-Trasfer Length Test G: 0 T: 7 S: 6 DMA I/O Byte Alignment Test G: 0 T: 7 S: 7 DMA Memory Byte Alignment Test G: 0 T: 7 S: 8 DMA Maximum Transfer Length Test G: 0 T: 8 G: 0 T: 8 S: 0 XLINK Serial Cross-link Internal Loopback Test - Part 1 G: 0 T: 8 S: 1 XLINK Serial Cross-link Internal Loopback Request Test G: 0 T: 8 S: 2 XLINK Serial Cross-link Internal Loopback Reply Test G: 0 T: 8 S: 3 XLINK Serial Cross-link Internal Loopback Query Test G: 0 T: 8 S: 4 XLINK Serial Cross-link External Loopback Test G: 0 T: 8 S: 5 XLINK Serial Cross-link Communication Register Test G: 0 T: 9 RESET Test G: 0 T: 9 RESET CPU Module Hard Reset Test XLINK Test G: 1 Zone Test G: 1 T: 0 ACCESS Test G: 1 T: 0 S: 0 ACCESS Parallel Xlink Loopback Test G: 1 T: 0 S: 1 ACCESS I/O Module PATH ACCESS Test G: 1 T: 0 S: 2 ACCESS I/O Module SSC Console Uart Test G: 1 T: 1 DMA Test G: 1 T: 2 INTERRUPT Test G: 1 T: 3 ERROR Test G: 1 T: 3 G: 1 T: 4 G: 1 T: 4 S: 0 RESET CPU Module Zone Reset Test G: 1 T: 4 S: 1 RESET I/O Module Reset Test S: 0 ERROR I/O Crosscheck Test RESET Test G: 2 System Test G: 2 T: 0 Cross-link Mode Test G: 2 T: 0 S: 0 Zone A (MASTER -> RESYNC MASTER -> DUPLEX) Mode Test G: 2 T: 0 S: 1 Zone B (MASTER -> RESYNC MASTER -> DUPLEX) Mode Test G: 2 T: 1 G: 2 T: 1 S: 0 ACCESS I/O Module Path Access Test G: 2 T: 1 S: 1 ACCESS I/O Module SSC Console Uart Test G: 2 T: 1 S: 2 ERROR I/O Crosscheck Test G: 2 T: 2 Zone A MASTER - Zone B SLAVE Mode Test Zone A RESYNC_MASTER - Zone B RESYNC_SLAVE Mode Test (continued on next page) System Maintenance 3–33 Table 3–28 (Cont.) CPU ROM-Based Diagnostic Descriptions Group Test Subtest Description G: 2 T: 2 S: 0 ACCESS I/O Module Path Access Test G: 2 T: 2 S: 1 ACCESS I/O Module SSC Console Uart Test G: 2 T: 2 S: 2 ERROR I/O Crosscheck Test G: 2 T: 3 G: 2 T: 3 S: 0 ACCESS I/O Module Path Access Test G: 2 T: 3 S: 1 ACCESS I/O Module SSC Console Uart Test G: 2 T: 3 S: 2 ERROR I/O Crosscheck Test G: 2 T: 4 G: 2 T: 4 S: 0 ACCESS I/O Module Path Access Test G: 2 T: 4 S: 1 ACCESS I/O Module SSC Console Uart Test G: 2 T: 4 S: 2 ERROR I/O Crosscheck Test G: 2 T: 5 G: 2 T: 5 S: 0 ACCESS I/O Module Path Access Test G: 2 T: 5 S: 1 ACCESS I/O Module SSC Console Uart Test G: 2 T: 5 S: 2 ERROR I/O Crosscheck Test Zone B MASTER - Zone A SLAVE Mode Test Zone B RESYNC_MASTER - Zone A RESYNC_SLAVE Mode Test DUPLEX Mode Test The following example shows a CPU RBD error frame. >>> group: 0 test: 1 subtest:2 ====================================================================== ----------------------- DIAGNOSTIC TEST ERROR ---------------------GROUP: 00 Test: 01 Sub: 02 Error: 01 Pass: 00000001 Addr: 00000000 Exp: 00000000 Rec: 000000ff Xor: 000000ff Data Miscompare ======================================================================= The example shows that the P-CACHE Data/Tag Integrity Test was executed and failed. The XOR data specifies a data miscompare. 3.9.4 I/O ROM-Based Diagnostics Table 3–29 provides a brief description of the I/O ROM-based diagnostics (RBDs). Table 3–29 I/O ROM-Based Diagnostic Descriptions Group Test Subtest G: 0 Description I/O Self-Test G: 0 T: 0 I/O SSC Test G: 0 T: 0 S: 0 SSC Toy Clock Test G: 0 T: 0 S: 1 SSC Storage Uart Test G: 0 T: 0 S: 2 SSC Bus Timeout Test G: 0 T: 0 S: 3 SSC Interval Timer Test (continued on next page) 3–34 System Maintenance Table 3–29 (Cont.) I/O ROM-Based Diagnostic Descriptions Group Test Subtest Description G: 0 T: 1 G: 0 T: 1 S: 0 VIC Register Test G: 0 T: 1 S: 1 VIC Interrupt Test G: 0 T: 2 G: 0 T: 2 S: 0 Firewall Register Test G: 0 T: 2 S: 1 Firewall Rail Master Test G: 0 T: 2 S: 2 Firewall Cross Check Error Test G: 0 T: 3 G: 0 T: 3 S: 0 CACHE Control Register Bit Test G: 0 T: 3 S: 1 CACHE Minimum Bank Test G: 0 T: 3 S: 2 CACHE Data Integrity Test G: 0 T: 3 S: 3 CACHE Tag Integrity Test G: 0 T: 3 S: 4 CACHE Tag Parity Detection Test G: 0 T: 3 S: 5 CACHE Tag Parity Generation Test G: 0 T: 3 S: 6 CACHE Data Parity Checking Test G: 0 T: 4 G: 0 T: 4 S: 0 Module Data EEPROM Integrity Test G: 0 T: 4 S: 1 Module I2C EEPROM Integrity Test G: 0 T: 5 G: 0 T: 5 I/O VIC Test I/O Firewall Test I/O Cache Test I/O NVRAM Test I/O RAM Test S: 0 G: 1 SOC RAM Test I/O Eself Pcard Test G: 1 T: 0 I/O SLIM Test G: 1 T: 0 S: 0 SLIM Register Test G: 1 T: 0 S: 1 SLIM RAM Test G: 1 T: 1 G: 1 T: 1 S: 0 SWIFT Reset Test G: 1 T: 1 S: 1 SWIFT Register Test G: 1 T: 1 S: 2 SWIFT Interrupt Test G: 1 T: 1 S: 3 SWIFT Internal Loopback Test G: 1 T: 2 G: 1 T: 2 S: 0 LANCE Register Test G: 1 T: 2 S: 1 LANCE Internal Loopback Test G: 1 T: 2 S: 2 LANCE Interrupt Test I/O SWIFT Test I/O LANCE Test System Maintenance 3–35 The following example shows an I/O RBD error frame. >>> z Connecting to target...Press Ctrl/P to end connection I IO1> group: 0 test: 4 subtest:1 ====================================================================== ----------------------- DIAGNOSTIC TEST ERROR ---------------------GROUP: 00 Test: 04 Sub: 01 Error: 03 Pass: 00000001 Addr: 00000000 Exp: 00000000 Rec: 000000ff Xor: 000000ff Data Miscompare ======================================================================= The example shows that the Module I2C EEPROM Integrity Test was executed and failed. The XOR data specifies a data miscompare. 3–36 System Maintenance 4 Error Handling and Analysis 4.1 In This Chapter This chapter includes: • Error handling services overview • Field replaceable units • OpenVMS error log • Module NVRAM status and LED indicators • FTSS error reporting interface • Firmware interfaces • Firmware and OpenVMS interface data structures • Error log analysis 4.2 Error Handling Services Overview The primary function of the error handling services (EHS) is to handle and recover from high-level system interrupts generated by the hardware when an error is detected. When an error occurs, the EHS is invoked by hardware as an interrupt service routine. The interrupt service routine isolates the failure by examining various system registers. The isolation process occurs at a high system priority level; it pauses the OpenVMS operating system until it is complete. After isolating the faulty FRU, the EHS determines the appropriate actions to take. For solid errors, system deconfiguration is performed and the FRU is removed from service. This usually involves performing module resets to invoke diagnostics. Error Handling and Analysis 4–1 EHS error notification is described in Table 4–1. Table 4–1 EHS Error Notification Step Action 1. Entries are made into the system error log. 2. Status information is written to the module ID NVRAM and the DCB, where applicable. 3. The LED indicator associated with a failed module is set. 4. A call is issued to the error reporting interface (ERI) which reports the event to the FTSS$SERVER. The server process generates OPCOM messages and reports the events to a mailbox. 4.2.1 Basic Error Isolation and Handling Figure 4–1 and Table 4–2 describe the error isolation and handling procedure. Figure 4–1 Hardware Error Handling Flowchart Hardware Error A 6 Fork to IPL8 1 IPL29 Interrupt 7 Transient Error 2 Fault Detection 8 Treshold Error YES NO 3 FRU Isolation 4 Solid Failure 11 Make Error Log Entry YES 5 Deconfigure FRU 12 − Notify FTSS$SERVER through ERI NO A NO 9 Over Treshold YES 10 Deconfigure FRU 13 Done MR−0495−92RAGS 4–2 Error Handling and Analysis Table 4–2 Error Handling Flowchart Definitions Event Definition 1 Hardware reports error through a high-level interrupt and control is transferred to the EHS. 2 The EHS examines system registers to determine the type of failure which has occurred. 3 The EHS identifies the FRU that is the source of the error. FRU isolation is generally accomplished at the module level. In some cases, FRU isolation is to a set of modules. In all cases, the EHS isolates the error to an FRU or set of FRUs in one zone. 4 The EHS determines if the error is solid. 5 If the error is solid, the FRU is deconfigured from the system. 6 The EHS has successfully recovered from the error (either solid or transient) and execution is continued at IPL8. 7 and 8 If the error is transient, it is compared to its error rate threshold. 9 If the error is below the error rate threshold, an entry is made in the error log. 10 If the error is above the error rate threshold, the FRU is deconfigured from the system. 11 An entry is made in the error log. 12 The FTSS$SERVER is notified of the error through the ERI. 13 Error handling is complete. 4.2.2 EHS Structure The EHS is packaged as part of the Fault Tolerant System Services (FTSS) execlet (loadable image file). The FTSS execlet is loaded and initialized when FTSS is started after the OpenVMS operating system is booted. System errors are reported to software through an IPL 29 interrupt. When an interrupt occurs, the hardware fetches the dispatch vector from the System Control Block (SCB) and dispatches to the EHS interrupt service routine. VAXELN errors are reported to the OpenVMS operating system through an IPL 22 interrupt. The interrupts are vectored by a combination of hardware and software to the EHS interrupt service routine. Figure 4–2 illustrates the position of the EHS relative to the major hardware, system firmware, and other software components. Error Handling and Analysis 4–3 Figure 4–2 EHS Architectural Position Error Handling Services Functions System Utilities Error Reporting Interface System Error Log Error Event Notification Remote Zone Interface IZC Routines Zone Available Firmware Interface Resets Status Serial Interrupts Serial Transmit/Receive VAXELN and Diagnostics Console and Diagnostics Registers Hardware Interface Interrupts System Hardware VMS Interface Device Unavailable FRU Deconfiguration Device Drivers FTSS Reconfiguration MR−0004−93RAGS 4.2.3 System Operating Modes The error handler recognizes four modes of system operation. Each mode directly relates to the supported hardware modes of the cross-link state as summarized in Table 4–3. Table 4–3 System Operating Modes Mode Definition Simplex The cross-link state in one zone is off and the CPU, memory, and I/O subsystem of the other zone are not available for use. However, those components in the other zone may be available and can run the OpenVMS operating system. The system can be booted in this mode if one zone is not physically present or is out of service. The system can also be degraded into this mode after the failure of one zone. Degraded Duplex The cross-link state in one zone (the master zone) is set to master and the cross-link state in the other zone is set to slave. The CPU and memory in the master zone are running the OpenVMS operating system and the I/O from the slave zone is configured and in use. However, the slave zone CPU and memory are not in use. This mode can only be achieved as a result of the deconfiguration of a CPU and memory set of one zone due to an error. Resynch This mode is similar to Degraded Duplex except that all memory writes in the master zone are duplicated in the slave zone. That is, when a write to memory is performed in the master zone, the same data is written to the same memory location in the slave zone. The cross-link state in one zone is Resynch master and in the other zone, Resynch slave. This mode is used during the synchronization process to copy the master zone memory to the slave zone before entering Duplex mode. (continued on next page) 4–4 Error Handling and Analysis Table 4–3 (Cont.) System Operating Modes Mode Definition Duplex The memories in both zones are identical and both CPUs are running in lockstep. The I/O subsystems of both zones are available and in use. The cross-link state in both zones is Duplex. The system can be booted in this mode, or can transition to this mode as the result of the synchronization process from either Simplex or Degraded Duplex modes. 4.2.4 Error Types EHS recognizes 11 error types. All errors are classified as one of those described in Table 4–4. Table 4–4 Error Types Error Type Definition CPU/MEM Faults All data, ECC codes, and control signals flow over the primary rail. The mirror rail exists primarily for the purpose of performing verification checks against the primary rail. Some checks are performed by hardware between these two rails to detect failures within the boundaries of the CPU module. When such a condition is detected, a CPU/MEM fault is generated by the hardware, and results in the following set of hardware actions: 1. A high-level system interrupt occurs to report the error, causing an entry into the error handler. In some cases, the failure may be severe enough to prevent instructions from executing. 2. If the operating mode at the time of the failure is Duplex, it will be changed to Degraded Duplex mode. In this case, the other zone is interrupted as well by a report that a CPU/MEM fault occurred in the failing zone. 3. Approximately 145 microseconds after the interrupt, the failing CPU module will be reset by hardware, resulting in an entry into the system console. The purpose of this brief delay is to allow the error handler to store the contents of the CPU, JXD, and cross-link registers in the Console Communications Area (CCA). In non-Duplex modes, only one CPU is in use. This failure results in the termination of the OpenVMS operating system. CPU/MEM faults can be caused by solid or transient errors. Since software cannot distinguish between the two, they are all treated as transient. The CPU module requires service only when they exceed the operating system’s threshold, when an end action timeout occurs, or when diagnostics fail. In all cases, the FRU identified by software is the CPU module which experienced the failure. (continued on next page) Error Handling and Analysis 4–5 Table 4–4 (Cont.) Error Types Error Type Definition Double-Bit memory errors Hardware reports a double-bit error (DBE) when the ECC checkers detect this condition on a read from a main memory location. This read can occur during a DMA or CPU cycle, with two possible error causes: a memory failure or a programming error. If system software attempts to access a location beyond the bounds of physical memory, hardware will report a double-bit ECC error. This is a programming error in the OpenVMS operating system and the EHS will initiate a system crash. This will be seen as a FATMEMERR bugcheck. If system software attempts to access a valid physical memory location which does not respond, a DBE will be reported by the hardware. In this case, the cause of the problem is failed memory. The CPU with this memory failure is removed from the configuration. If the system is operating in a non-Duplex mode, the OpenVMS operating system is terminated by forcing an entry into the system console. In Duplex, the failed CPU is removed and the system continues to operate in Degraded Duplex mode. DBEs due to memory failures are always treated as solid. The failed CPU will not be reconfigured until the zone with the failure is removed and the memory is repaired. The FRU in most cases will be a pair of SIMMs on a memory mother board (MMB). In all cases, FRU isolation is done at the time of the end action when system registers are recovered from the failed CPU. In the case of an end action timeout, the CPU module will be identified as the FRU. (continued on next page) 4–6 Error Handling and Analysis Table 4–4 (Cont.) Error Types Error Type Definition Single-Bit memory errors Single-Bit Errors (SBEs) can be detected by either the JXD during a DMA read cycle which reads from main memory or the CPU during a memory read. Software action varies depending upon the system operating mode and where the error detection occurs. If the SBE is detected by the JXD during a DMA cycle in any system mode or by the CPU during a CPU cycle in any non-Duplex mode, the actions of the EHS are the same. The error is always transient, and no deconfiguration is performed. A pair of memory SIMM rows on an MMB are isolated and compared to its error rate threshold. In Duplex mode (JXD detected) when the threshold is exceeded, the CPU module on which the memory resides will be removed from service. In non-Duplex mode, since there is only one CPU active and since SBEs are always transient, the CPU is not removed from service when the threshold is exceeded. The SBE is repaired in memory by hardware if detected by the JXD, and by the EHS if detected by the CPU. If the SBE is detected during a CPU cycle while the system is in Duplex mode, the action differs due to hardware constraints. The CPU which experiences the SBE will be removed from service by hardware at the time of the error. An error log will be generated reporting the error, but FRU isolation is done at the time of the end action. The error is then compared to its error rate threshold by the OpenVMS operating system. If the threshold is not exceeded, the CPU will be resynchronized immediately by system software (FTSS$SERVER) at the time of the end action. The process of resynchronization will repair the SBE in physical memory since each location is rewritten during the memory copy. If the failed CPU does not return for resynchronization after being removed in the CPU-detected Duplex mode case, an end action timeout event will be logged which identifies the failed CPU module as the FRU. In most cases, a pair of SIMM rows and a memory mother board (MMB) are identified as the FRU in the error log. However, in some cases, end action data may not contain all the information needed to isolate to a pair of memory SIMM rows. In this case the CPU module will be identified as the FRU and will be subjected to the same threshold as a memory SIMM. Cable failures All traffic between the two zones of the system is performed across the cross-link cable. If this cable is detached or broken, the hardware will report a cable loss event to the EHS. This error can only happen in a nonSimplex system, and when it occurs, communication between the zones is lost. In all cases, the system operating mode must be changed to Simplex. If the mode before the error was not Duplex, then the slave zone is removed from service. If the mode was Duplex, then Zone B is removed from service. The EHS indicates in the error log that this error is solid and service is required, and the error is compared to its error rate threshold. If the threshold is not exceeded, the zone will be resynchronized automatically. If the threshold is exceeded, no automatic resynchronization will occur until the cross-link cable is repaired. In all cases, the FRU is the cross-link cable. (continued on next page) Error Handling and Analysis 4–7 Table 4–4 (Cont.) Error Types Error Type Definition Power failures If a zone loses power in a non-Simplex configuration, hardware generates an interrupt to report the event to the EHS. In a non-Duplex mode, software will detect this error only when the slave zone loses power. In this case, the slave zone is removed from the configuration and the system continues to run in Simplex mode. In Duplex mode, the error is detected by software when either zone loses power. Again, the failed zone is removed from the configuration and the system continues in Simplex mode. EHS indicates in the error log that this error is solid and service is required, and the error is compared to its error rate threshold. If the threshold is not exceeded, the zone will be resynchronized automatically. If the threshold is exceeded, no automatic resynchronization will occur until the zone is repaired and resynchronized manually. The failed zone is identified as the FRU for all power failures. Clock phase errors If the clocks between zones begin to run out of phase, hardware generates an interrupt to report the event to the EHS. This event can occur only in non-Simplex modes. The cause of this type of failure can be either the oscillator or the clock locking logic. An oscillator failure will prevent the CPU and I/O module clocks in the two zones from running in synchronization and will result in the termination of the OpenVMS operating system on that zone. Failure in the clock lock logic will result in two zones running diverged if the system operating mode had been Duplex. In this case, EHS will select one zone to remove, and the other zone will continue to run the OpenVMS operating system in Simplex mode. (Zone selection is based on timings within the system and could be either zone.) In Degraded Duplex mode, the slave zone is removed from the configuration and the OpenVMS operating system continues in Simplex mode. In all cases of oscillator failure, the ATM in the zone which is removed is identified as the FRU. If the error is caused by clock lock logic failure, software cannot accurately determine in which zone the failure exists. The EHS compares the error to its error rate threshold. An error log is generated at the time of the error which identifies the ATM as the FRU. If the threshold is exceeded, the error log indicates that service is required for the ATM and the zone will not be resynchronized automatically. If the threshold is not exceeded and the diagnostic tests complete successfully, the zone will be resynchronized when it becomes available. If the threshold is not exceeded and the diagnostics report a failure, the end action error log will indicate that the ATM module requires service and the zone will not be resynchronized automatically. If the zone fails to return for service and the threshold had not been exceeded, an end action timeout error log is generated which indicates the ATM requires service. (continued on next page) 4–8 Error Handling and Analysis Table 4–4 (Cont.) Error Types Error Type Definition Halt errors A halt error occurs when the system is operating in Duplex mode, the Zone Halt Enable switch on the zone control panel is pressed, and the Break key is pressed on one of the system consoles, or one zone experiences errors on its halt lines. The zone attached to the console terminal or with the error will be halted and enter the system console. In the other zone, hardware generates an interrupt to the EHS. The system operating mode will be degraded to Simplex and the OpenVMS operating system will be continued after deconfiguring the halted zone. The failed zone is identified as the FRU in the error log. This error is not subjected to thresholding. The halted zone must be resynchronized manually to be returned to service. Resynch abort errors During memory resynchronization, all memory writes are mimicked to both zones. The data is driven from the master zone across the resynch bus (also referred to as the cross-link cables) to the slave zone. The incoming data on the slave side is protected by ECC. An ECC failure on the slave side results in a CPU/MEM fault on the slave and is handled as that type of error. The data is protected on the master side by an ECC, a cross-rail ECC comparison and a data cross-check. The failure of any of these checks results in hardware generating an interrupt to the EHS reporting a resynch abort error. Resynch mode is terminated by the hardware and system operation continues in Degraded Duplex mode. Since all resynch abort errors indicate failures on the master side, the master CPU module is isolated as the FRU. This error can occur only when the system is in Resynch mode, so removal of the CPU would result in termination of the OpenVMS operating system. The error log message will indicate the master CPU as the FRU. The EHS compares the error to its error rate threshold. If the threshold is exceeded, the EHS will disable automatic resynchronization of the remote zone. Manual intervention will be required to repair this situation. Since Duplex mode cannot be achieved and the master CPU is the source of this failure, the OpenVMS operating system must be manually terminated to repair the CPU module. Nonexistent I/O errors Nonexistent I/O (NXIO) errors occur when a reference to an I/O module times out. Such a timeout can occur during a DMA or CPU cycle. In a CPU cycle, an automatic operation retry is attempted. If the retry succeeds, hardware reports the failure as transient. Otherwise, it is reported as a solid failure. All timeouts during DMA cycles are transient errors. The error log indicates if the error was solid or transient, and if it occurred on a DMA or CPU cycle. In all NXIO error cases, either an I/O or interface module will be identified as the FRU. If the error is solid, the I/O or interface module will be removed from system service by the EHS. If the error is transient, it will be compared to its error rate threshold by the EHS. If the threshold is exceeded and the system operating mode is not Simplex, the I/O or interface module will be removed from system service. No I/O module will be removed due to transient errors from a Simplex system (where alternate I/O paths are not normally available). Additional transient errors on the I/O module will generate further error logs. (continued on next page) Error Handling and Analysis 4–9 Table 4–4 (Cont.) Error Types Error Type Definition I/O errors The ATM module contains a series of checkers that verify consistency between the dual rails of the system during I/O accesses. When discrepancies are detected, the hardware generates an interrupt, invoking the EHS. System registers which reflect the state of the checkers are read and analyzed to determine the source of the error. These miscompare errors can be detected during a DMA operation or a direct CPU I/O access. When miscompares occur on CPU cycles, the hardware automatically retries the operation. If the retry succeeds, hardware reports the error as transient. Otherwise, the error is solid and the EHS deconfigures the system to remove the FRU. The error log will indicate the FRU, describe the error as solid or transient, and list any modules that were deconfigured as a result. If the FRU is a zone or an ATM, the entire zone is removed. These errors result in a CPU, ATM, interface module, or cross-link FRU. Transient errors are compared to their error rate threshold by the EHS. Errors that exceed the threshold may result in the removal of the FRU from service. Zone divergence This error type occurs when the two zones begin executing separate code paths while operating in Duplex mode. This situation is detected by hardware when an access to I/O space is performed. At that time, miscompares in the control and data signals will be detected in the crosslink chips on the ATM. This error is reported by hardware as an I/O error or an NXIO error, but software recognizes the special case and identifies it as zone divergence in the error log. When this error is detected, software will remove one zone from service (Zone selection depends on how zone divergence manifested itself). Either zone may be removed. This error is usually due to a programming error or divergence between the NVRAMs of the two zones. The error is treated as transient and the threshold error count for that error is incremented. If the threshold is not exceeded of if the diagnostics on the removed zone complete successfully, the zone will be resynchronized back into the system at end action time. If the threshold is exceeded or if the diagnostics on the removed zone report a failure, the zone will not be resynchronized at end action time. The end action error log will indicate that service is required. If the removed zone fails to return from running diagnostics, an end action timeout error log will be generated which identifies the zone as the FRU and requests service. If the threshold is exceeded, the zone will not be automatically resynchronized. Manual intervention will be required to repair the zone and return it to service. 4.2.5 VAXELN Error Handling Failures detected by VAXELN software running on the I/O expansion module are reported to the EHS through one of two mechanisms: • An IPL 22 interrupt from the module error which is dispatched into the EHS. • The EHS detects the expiration of a watchdog timer maintained by VAXELN signaling a termination of VAXELN execution. 4–10 Error Handling and Analysis Table 4–5 describes the VAXELN error classes and the actions taken by the EHS. Table 4–5 VAXELN Error Classes Error Class Description EHS Actions VAXELN Kernel Fatal This error is reported when the VAXELN kernel detects a fatal error which prevents it from continuing operation. The FRU is the I/O expansion module. This is a solid error and is not subjected to a threshold. VAXELN Kernel Recoverable A recoverable error was detected and handled by VAXELN software. Currently, this error is reported only when VAXELN software detects a repairable single-bit memory error. The FRU is the I/O expansion module. The error is compared to its error rate threshold. If the threshold is exceeded, the I/O expansion module and all attached interface modules are deconfigured from the system. I/O Expansion Module Master Fatal A fatal error detected by the VAXELN I/O expansion module master job which results in the shutdown of all VAXELN processes. The FRU is the I/O expansion module. This is considered a solid error; no threshold is applied. The I/O expansion module is deconfigured from the system. I/O Expansion Module Master Recoverable An error detected by the VAXELN I/O expansion module master job which resulted from the failure of a VAXELN job to initialize successfully. The Job ID field of the error message indicates which VAXELN job failed. The FRU in an interface module. The EHS isolates the interface module by checking the Job ID field of the error message. The error is considered solid; no threshold is applied. The module is deconfigured from the system. I/O Expansion Module Job Fatal Similar to I/O Expansion Module Master Recoverable, this error indicates that a VAXELN job has experienced a fatal error and has been terminated. The Job ID field of the error message indicates which VAXELN job failed. The FRU is an interface module. The EHS isolates the interface module by checking the Job ID field of the error message. The error is considered solid; no threshold is applied. The interface module is deconfigured from the system. VAXELN software implements a watchdog timer which is a cell in the I/O Expansion Module Communication Area (NCA). It is incremented periodically by VAXELN and monitored by the EHS. If the value in the NCA cell stops incrementing, VAXELN has crashed. This is referred to as a VAXELN kernel fatal error. The EHS examines the VAXELN NCA error log buffer area for a VAXELN error message. When it finds the error message, the EHS identifies the I/O expansion module as the FRU. The error is considered solid; no threshold is applied, and the I/O expansion module is deconfigured from the system. Error Handling and Analysis 4–11 4.3 Field Replaceable Units (FRUs) After analyzing error information and determining the error type, the EHS isolates the source of the error to a FRU. If the error was solid, the system is deconfigured to remove the FRU from service. If the error is transient, it is compared against a threshold for the error type and FRU. If the threshold is exceeded, or if the error is solid, the system is deconfigured to remove the FRU from service. 4.3.1 Isolation Table 4–6 describes the FRUs and lists the error types which could result in a FRU being isolated. Table 4–6 System FRUs FRU Description Source Error Types ATM module I/O attachment module. Performs exchange and verification of I/O control and data signals between zones. The module includes an embedded I/O expansion module. I/O errors Clock phase errors CPU module The CPU module is identified as the FRU when the failure is attributable to a CPU problem or to a problem that cannot be isolated between the CPU and memory. Resynch abort errors CPU/MEM faults Double-Bit memory errors Single-Bit memory errors Memory board A pair of rows of memory SIMMs on a memory mother board (MMB) will be identified as the FRU when the error can be isolated beyond the CPU board to a specific piece of memory. Double-Bit memory errors Single-Bit memory errors I/O expansion module An I/O expansion module can be identified as the FRU as a result of a firewall miscompare during an I/O operation or as a result of a nonexistent I/O error during a reference to the I/O expansion module or an attached interface module. Nonexistent I/O errors I/O errors VAXELN errors Interface module An interface module can be identified as a FRU only as a result of a nonexistent I/O error which occurs during a reference to the interface module. It is also possible that the I/O expansion module will be identified as the FRU. Nonexistent I/O errors VAXELN errors Zone Some error cases involve failures not directly attributable to a single module. The zone FRU is only identified in the case of solid or reproducible errors, so diagnostics should be able to isolate the failure within the zone. Power failures Halt errors Zone divergence Crosslink cable The cross-link cable is the identified FRU for any error which isolates the connections between zones. This includes the resynch and interzone buses, which are packaged into the single physical cable. Cable failures I/O errors 4–12 Error Handling and Analysis 4.3.2 Deconfiguration This section describes the actions taken by the EHS when a FRU is identified as the source of a solid error or transient errors which exceed the FRU threshold. A table is provided for each FRU that describes the actions taken by the EHS when the FRU is deconfigured. In non-Duplex modes, the EHS may respond to excessive transient failures by calling out the FRU but not removing it from service. This action prevents loss of system service due only to transient errors. 4.3.2.1 I/O Attachment Module Table 4–7 describes the OpenVMS operating system actions taken when the ATM is identified as the FRU and deconfigured by the EHS. Some actions are dependent on the system operating mode. Table 4–7 ATM Deconfiguration Actions Action Taken Description Comments Cross-link mode = off The cross-link mode is set to off. The system will continue in Simplex mode. The action may be taken by the hardware when the error occurs or by software while handling the error. Done in non-Simplex mode only. Extraneous when the error occurs in Simplex mode. CPU/MEM fault A CPU/MEM fault is forced on the zone with the failed ATM module. This results in an entry into the system console. Done when the error occurs in Duplex, Simplex or in the master zone of a Degraded Duplex configuration. Zone hard reset A zone hard reset is issued to the zone with the failed ATM to force diagnostics to run. Done only when the error occurs in the slave zone of a Degraded Duplex configuration. Set ATM LED indicator Use the module I2C bus to turn on the LED indicator for the failed ATM module. Set module status in ATM NVRAM and DCB Update the status_os and status_ sum fields in the module ID NVRAM and the DCB to indicate the module has experienced a failure. The code written depends on the failure type. The entries in Table 4–7 apply when the module is being removed because of a solid error or excessive transient errors. There is one exception. When an ATM module in a Simplex system experiences excessive transient errors, the module is not fully deconfigured since that would result in the termination of the OpenVMS operating system. In this case, the ATM LED indicators turn on, and the module status is written to the ATM NVRAM and DCB. The OpenVMS operating system continues to run. The module will not be configured when the system is booted, or when the failed zone is synchronized until the module is repaired. Error Handling and Analysis 4–13 4.3.2.2 CPU Module and Memory When memory is deconfigured from the system, it is done by removing the CPU module on which the memory resides. Table 4–8 describes the OpenVMS operating system actions taken when a CPU module or memory is identified as the FRU and is deconfigured by the EHS. These actions are identical for CPU and memory failures. Some actions are dependent on the system operating mode. Table 4–8 CPU Deconfiguration Actions Action Taken Description Comments Cross-link mode = Degraded Duplex The cross-link mode is set to master on the zone with the surviving CPU and slave on the zone with the failed CPU. The action may be taken by the hardware when the error occurs or by software while handling the error. Done in Duplex mode only. CPU/MEM fault A CPU/MEM fault is forced on the failed CPU module. This results in an entry into system console. Set CPU LED indicator The module I2C bus is used to turn on the LED indicator for the failed CPU module. Set module status in CPU NVRAM and DCB The status_os and status_sum fields in the module ID NVRAM and DCB are updated to indicate the module has experienced a failure. The code written depends on the failure type. When one CPU is in use (Degraded Duplex, Simplex, or Resynch mode), excessive transient failures will result in the EHS calling out the failed module, but not removing it from service. Removing it from service would cause termination of the OpenVMS operating system. In this case, the CPU module LED is turned on, and the module status is written to the CPU module NVRAM and DCB. The OpenVMS operating system continues to run. The CPU will not be configured when the system is booted or when the failed zone is synchronized unless the CPU is repaired. 4.3.2.3 I/O Expansion Module Table 4–9 describes the actions taken by the OpenVMS operating system when an I/O expansion module is identified as the FRU and is deconfigured by the OpenVMS operating system. 4–14 Error Handling and Analysis Table 4–9 I/O Expansion Module Deconfiguration Actions Action Taken Description I/O hard reset The I/O expansion module which is being deconfigured is reset through the cross-link I/O hard reset register. Set I/O expansion module LED indicator The module I2C bus is used to turn on the LED for the failed module. Set module status in I/O expansion module NVRAM and DCB The status_os and status_sum fields in the module ID NVRAM and DCB are updated to indicate the module has experienced a failure. The actual code written depends on the failure type. The entries in Table 4–9 apply when the module is being removed due to a solid error or excessive transient errors. There is one exception. When an I/O expansion module in a Simplex system experiences excessive transient errors, the module is not fully deconfigured since that would likely result in the loss of the only I/O path to a device. In this case, the I/O expansion module LED is turned on and the module status is written to the interface module NVRAM and the DCB. The I/O expansion module will remain in service. The NVRAM will not be configured when the system is booted or when the failed zone synchronized until the module is repaired. 4.3.2.4 Interface Module Table 4–10 describes the OpenVMS operating system actions taken when an interface module is identified as the FRU and is deconfigured by the OpenVMS operating system. Some actions are dependent on the system operating mode. Table 4–10 Interface Module Deconfiguration Actions Action Taken Description Reset interface module The interface module being deconfigured is reset through the module I2C bus. Set interface module LED indicator Use the module I2C bus to turn on the LED indicator for the failed interface module. Set module status in interface module NVRAM Update the status_os and status_sum fields in the module ID NVRAM and the DCB to indicate the module has failed. The code written depends on the failure type. The entries in Table 4–10 apply when the module is being removed because of a solid error or excessive transient errors. There is one exception. When an interface module in a Simplex system experiences excessive transient errors, the module is not fully deconfigured since that would likely result in the loss of the only I/O path to a device. In this case, the interface module LED indicator is turned on, and the module status is written to the interface module NVRAM and the DCB (See Section 4.8.2). The interface module will remain in service. The module will not be configured when the system is booted or when the failed zone is synchronized until the module is repaired. Error Handling and Analysis 4–15 4.3.2.5 Zone Table 4–11 describes the OpenVMS operating system actions taken when an entire zone is identified as the FRU and is deconfigured by the EHS. Note that some actions are dependent on the system operating mode. Table 4–11 Zone Deconfiguration Actions Action Taken Description Comments Cross-link mode = off The cross-link mode is set to off. The system will continue in Simplex mode. The action may be taken by the hardware when the error occurs or by software while handling the error. Done only in non-Simplex mode. CPU/MEM fault A CPU/MEM fault is forced on the failed zone. This results in an entry into system console. Done when the error occurs in Duplex, Simplex or in the master zone of a Degraded Duplex system. Zone hard reset A zone hard reset is issued to the failed zone. Done only in the slave zone of a Degraded Duplex or Resynch mode system. 4.3.2.6 Cross-Link Cable Table 4–12 describes the OpenVMS operating system actions taken when the cross-link cable is identified as the FRU and is deconfigured by the EHS. The cross-link cable is active only during non-Simplex modes. Table 4–12 Cross-Link Cable Deconfiguration Actions Action Taken Description Comments Cross-link mode = off The cross-link mode is set to off. The system will continue in Simplex mode. The action may be taken by the hardware when the error occurs or by software while handling the error. Done only in non-Simplex modes. CPU/MEM fault A CPU/MEM fault is forced on Zone B. This results in an entry into system console. Done only when the error occurs in Duplex mode. Zone hard reset A zone hard reset is issued to the slave zone. Done in the slave zone when the error occurs in Degraded Duplex or Resynch mode. 4–16 Error Handling and Analysis 4.3.3 Application of Thresholds Application of thresholds by the EHS is rate based. An FRU exceeds its threshold when it accumulates a certain number of a given error type in a specified time period. Table 4–13 lists the thresholds associated with each FRU and error type. In most cases, more than one type of error can result in the isolation of an FRU. For each FRU and error type, a separate threshold is applied. The threshold for an error type of a specific FRU must be exceeded before the module is deconfigured. For example, both NXIO and I/O errors may isolate an ATM module. EHS maintains separate thresholds for NXIO and I/O errors for each ATM module. When one of the errors occurs and is isolated to an ATM, the threshold for that error type on that ATM is applied. If the threshold is exceeded, the ATM is deconfigured. Table 4–13 FRU Thresholds Error Type Error Limit Time Period1 Comments CPU Module CPU/MEM faults 3 12 A CPU/MEM fault results in the temporary removal of the CPU module from service. The CPU will be reconfigured into the system if this threshold is not exceeded. Resynch abort errors 3 1 Resynch abort errors result in the termination of the Resynch operation. When the threshold for this error is exceeded, the CPU module is marked as broken. System downtime must be scheduled to repair the problem since the only CPU module has failed. Memory SIMMs Single-bit memory errors 3 12 Each single-bit memory error is attributed to a row of memory SIMMs on a single MMB. Each SIMM row has an individual threshold. When the threshold for the SIMM row is exceeded, the CPU module on which the SIMM resides will be removed from service if the system operating mode is Duplex. I/O ATM Module Clock phase errors 3 12 Each clock phase error results in the temporary removal from service of a zone. When the zone returns to service, it will be resynchronized automatically if the threshold is not exceeded. Transient I/O errors 3 12 When this threshold is exceeded, the zone in which the ATM resides is removed from service, except in a Simplex system. I/O Expansion Module 1 In hours (continued on next page) Error Handling and Analysis 4–17 Table 4–13 (Cont.) FRU Thresholds Error Type Error Limit Time Period1 Comments I/O Expansion Module Transient NXIO errors 3 12 When the threshold is exceeded, the module is deconfigured except in Simplex system. Transient I/O errors 3 12 When the threshold is exceeded, the module is deconfigured except in Simplex system. VAXELN kernel recoverable errors 3 24 When the threshold is exceeded, the module is deconfigured except in Simplex system. Interface Module Transient NXIO errors 3 12 When the threshold is exceeded, the interface module is deconfigured, except in a Simplex system. Zone Power failures 3 24 When power is lost, the zone is temporarily removed from service and the error is compared to its error rate threshold. When power is restored, the zone will be resynchronized automatically if the threshold has not been exceeded. Zone divergence 3 24 When the zones diverge, one zone is temporarily removed from the configuration and the error is compared to its error rate threshold. When the zone returns to service, it will be reconfigured if the threshold is not exceeded. This threshold is not applied directly to any FRU. The selection of which zone to remove is made based on how the error manifests itself within the system. Cross-Link Cable failures 3 24 When the cable between the zones is lost, the zone is temporarily removed from service and the error is compared to its error rate threshold. When the zone returns, it will be resynchronized automatically if the threshold has not been exceeded. Transient I/O errors 3 12 When the threshold is exceeded, the cross-link is deconfigured, which results in the removal of one of the zones from service. 1 In hours 4–18 Error Handling and Analysis 4.4 OpenVMS Error Log The EHS makes entries in the system error log for all system error interrupts. Figure 4–3 shows the format of the error log. With the exception of the Fault Data block, all blocks have fixed length. Figure 4–3 OpenVMS Error Log Format Number of Longwords Fault Summary FRU Information Deconfiguration Information Threshold Information Fault Data MR−0006−93RAGS The first longword in the error log contains the count of longwords which follow. This number is based on the fault class of the error log (see Section 4.4.1). Table 4–14 lists the different values which will appear for each of the six different fault classes. Table 4–14 OpenVMS Error Log Sizes Class Value Fault Class Decimal Size Hexidecimal Size 1 System Error 40 28 2 End Action 41 29 3 End Action Timeout 13 D 4 VAXELN Error 28 1C 5 Software Detected Error 15 F 6 CPU or Zone Unsynchable 14 E Error Handling and Analysis 4–19 4.4.1 Fault Summary The Fault Summary block contains the fault ID, fault flags describing the nature of the fault, the cross-link mode at the time the fault occurred, and the cross-link mode after the error handling was completed. All fields in this block are valid for all error entries. Figure 4–4 identifies each entry in the block and the offset from the start of the block. Table 4–15 describes the content of each field. Note The 1-byte FAULT_ID field is composed of two 4-bit subfields. Bits [07:04] indicate the class of the fault. Bits [03:00] identify the error type within the fault class. There are six fault classes. Each class has a different fault data block at the end of the error log. See Section 4.4.5 for a description of each fault class and the fault data provided in the error log. Figure 4–4 Fault Summary Block XLINK_MODE_AFTER (Crosslink Mode After) XLINK_MODE_ERROR (Crosslink Mode Error) FAULT_FLAGS (Fault Flags) FAULT_ID (Fault Identification) MR−0009−93RAGS Table 4–15 Fault Summary Block Entry Descriptions Entry Contents FAULT_ID Fault Identification type. The hexidecimal ID values are defined as: 10 - CPU-detected double-bit error 11 - JXD-detected double-bit error 12 - Cable gone between zones 13 - Power gone in other zone 14 - Clock error 15 - Other zone halted 16 - Resynch abort error 17 - CPU-detected single-bit error 18 - JXD-detected single-bit error 19 - CPU/MEM fault 1A - Nonexistent I/O 1B - I/O miscompare error 1C - Zones divergence 20 - CPU-detected DBE end action 21 - JXD-detected double-bit error end action 22 - Cable gone end action (reserved for future use) (continued on next page) 4–20 Error Handling and Analysis Table 4–15 (Cont.) Fault Summary Block Entry Descriptions Entry Contents 23 - Power gone end action (reserved for future use) 24 - Clock error end action 25 - Other zone halted end action (reserved for future use) 26 - Resynch abort error end action (reserved for future use) 27 - CPU-detected single-bit error end action 28 - JXD-detected single-bit error end action (reserved for future use) 29 - CPU/MEM fault end action 2C - Zone divergence end action timeout 30 - CPU-detected DBE end action timeout 31 - JXD-detected DBE end action timeout 32 - Cable gone end action timeout (reserved for future use) 33 - Power gone end action timeout (reserved for future use) 34 - Clock error end action timeout 35 - Other zone halted end action timeout (reserved for future use) 36 - Resynch abort error end action timeout (reserved for future use) 37 - CPU-detected SBE end action timeout 38 - JXD-detected single-bit error end action timeout (reserved for future use) 39 - CPU/MEM fault end action timeout 3C - Zone have diverged end action timeout (reserved for future use) 40 - VAXELN kernel fatal error 41 - VAXELN kernel recoverable error 42 - VAXELN master job fatal error 43 - VAXELN master job recoverable error 44 - VAXELN job fatal error 45 - VAXELN job recoverable error (reserved for future use) 50 - Software-detected error 60 - CPU is unsynchable FAULT_FLAGS The following fields are defined within FAULT_FLAGS: 00 - Transient error 01 - Solid error 02 - Error threshold exceeded 03 - Service is required (continued on next page) Error Handling and Analysis 4–21 Table 4–15 (Cont.) Fault Summary Block Entry Descriptions Entry Contents [07:04] - Not used XLINK_MODE_ ERROR Cross-link mode at the time of error. The following values are defined: 0 - Off (Simplex) 1 - Slave 2 - Master 3 - Duplex 4 - Not used 5 - RESYNCH_SLAVE 6 - RESYNCH_MASTER 7 - Not used XLINK_MODE_ AFTER Cross-link mode after error handling. The modes are as defined for XLINK_MODE_ERROR. 4.4.2 FRU Information This block contains information on the isolated FRU and is valid for all error events. Figure 4–5 identifies each entry in the block and the offset from the start of the block. Table 4–16 describes the content of each entry. Note In some cases, an FRU is not identified in the error log for a system error event. All fields in this block will be -1 (FFFFFFFF hexidecimal). In these cases, the FRU will be identified in a subsequent end action or end action timeout error log. Figure 4–5 FRU Information Block FRU_TYPE (FRU Type) 0 +4 FRU_DATA (FRU Data) MR−0010−93RAGS 4–22 Error Handling and Analysis Table 4–16 FRU Information Block Entry Descriptions Entry Contents FRU_TYPE The following bits are defined: 01 - The FRU is a module in Zone A (FRU_DATA has slot ID) 02 - The FRU is a module in Zone B (FRU_DATA has slot ID) 03 - Zone A is the FRU 04 - Zone B is the FRU 05 - The cross-link cable is the FRU 06 - The FRU is a Zone A SIMM (FRU_DATA has SIMM ID) 07 - The FRU is a Zone B SIMM (FRU_DATA has SIMM ID) FRU_DATA FRU specific data. The following bits are defined for IDs 1 and 2: 00 - CPU module in slot 0 is the FRU 01 - ATM module in slot 1 is the FRU 02 - I/O expansion module in slot 2 is the FRU [09:03] - Not used 10 - Interface module in slot 10 is the FRU 11 - Interface module in slot 11 is the FRU 12 - Interface module in slot 12 is the FRU 13 - Interface module in slot 13 is the FRU 14 - Interface module in slot 14 is the FRU 15 - Interface module in slot 15 is the FRU 16 - Interface module in slot 16 is the FRU 17 - Interface module in slot 17 is the FRU [19:18] - Not used 20 - Interface module in slot 20 is the FRU 21 - Interface module in slot 21 is the FRU 22 - Interface module in slot 22 is the FRU 23 - Interface module in slot 23 is the FRU 24 - Interface module in slot 24 is the FRU 25 - Interface module in slot 25 is the FRU 26 - Interface module in slot 26 is the FRU 27 - Interface module in slot 27 is the FRU [31:28] - Not used Note The following fields define the SIMM ID for FRU_TYPEs 06 and 07: [15:00] = MMB ID from 0 to 3. [31:16] = SIMM row ID. Values 1 to 4 represent SIMM rows A to D, respectively. This field = -1 for all other FRU_TYPE values. Error Handling and Analysis 4–23 4.4.3 Deconfiguration Information This error log block contains information about any system deconfiguration performed by the EHS. Figure 4–6 identifies each entry in the block and the offset from the start of the block. Table 4–17 describes the content of each entry. Note For errors which require no system deconfiguration, only the FT_FLAGS fields will be filled in. The last two longwords will contain 0. Figure 4–6 Deconfiguration Information Block FT_FLAGS_BEFORE (Fault Flags Before) FT_FLAGS_AFTER (Fault Flags After) 0 +4 +8 DECONFIG_INFO (Entity Deconfigured) DECONFIG_MODULES (Modules Deconfigured) +12 MR−0011−93RAGS Table 4–17 Deconfiguration Information Block Entry Descriptions Entry Contents FT_FLAGS_ BEFORE The contents of EXE$GL_FT_FLAGS at the time the system error occurred. The field is valid for all errors. FT_FLAGS_AFTER The contents of EXE$GL_FT_FLAGS after error handling is complete. If the EHS performs any system deconfiguration that includes degraded system mode in the cross-link, this field will differ from FT_FLAGS_BEFORE. Otherwise, they are the same. The field is valid for all errors. DECONFIG_INFO This field shows the entity which was deconfigured as a result of the error. This is either a module in a given zone or an entire zone. The following bits are defined: 00 - Zone A deconfigured. 01 - Zone B deconfigured. 02 - CPU module in Zone A deconfigured. 03 - CPU module in Zone B deconfigured. 04 - ATM module in Zone A deconfigured. 05 - ATM module in Zone B deconfigured. 06 - I/O expansion module in Zone A deconfigured. 07 - I/O expansion module in Zone B deconfigured. 08 - Interface module in Zone A deconfigured. 09 - Interface module in Zone B deconfigured. (continued on next page) 4–24 Error Handling and Analysis Table 4–17 (Cont.) Deconfiguration Information Block Entry Descriptions Entry Contents DECONFIG_ MODULES This field shows the Zone A modules removed from service as a result of error handling. For example, if the source of a solid or excessive transient error were an I/O expansion module, all attached interface modules have been removed from service. The following bits are defined: 00 - CPU module in slot 0 has been removed from service. 01 - I/O expansion module in slot 1 has been removed from service. Set when the expansion module portion of the ATM module in slot 1 is removed from service. Removal of this portion of the ATM module does not require deconfiguring the entire zone. 02 - I/O expansion module in slot 2 has been removed from service. 03 - ATM module in slot 1 has been removed from service. Set when the entire ATM module is removed from service. The bits for all other modules present in the zone will also be set. The entire zone is deconfigured. [09:04] - Not used. 10 - Interface module in slot 10 has been removed from service. 11 - Interface module in slot 11 has been removed from service. 12 - Interface module in slot 12 has been removed from service. 13 - Interface module in slot 13 has been removed from service. 14 - Interface module in slot 14 has been removed from service. 15 - Interface module in slot 15 has been removed from service. 16 - Interface module in slot 16 has been removed from service. 17 - Interface module in slot 17 has been removed from service. [19:18] - Not used. 20 - Interface module in slot 20 has been removed from service. 21 - Interface module in slot 21 has been removed from service. 22 - Interface module in slot 22 has been removed from service. 23 - Interface module in slot 23 has been removed from service. 24 - Interface module in slot 24 has been removed from service. 25 - Interface module in slot 25 has been removed from service. 26 - Interface module in slot 26 has been removed from service. 27 - Interface module in slot 27 has been removed from service. [31:28] - Not used. Error Handling and Analysis 4–25 4.4.4 Threshold Information When the Transient Error flag is set in the FAULT_FLAGS field of the Fault Summary block, the isolated FRU error is compared to its error rate threshold. When threshold is exceeded, the FRU will be removed from the system. In addition, the Excessive Transient Errors flag is set in the FAULT_FLAGS field. When the threshold comparison is completed, the threshold information is written to the error log. Figure 4–7 identifies each entry in the block and the offset from the start of the block. Table 4–18 describes the content of each entry. Note For errors which do not require a threshold comparison, all entries in this block will be -1 (FFFFFFFF hex). Figure 4–7 Threshold Information Block THRESH_INT (Threshold Interval) THRESH_COUNT (Threshold Count) 0 +4 +8 THRESH_LMT (Threshold Limit) +12 THRESH_ZERO (Time Since Zeroed) THRESH_TOTAL (Total Error Types) +16 MR−0012−93RAGS Table 4–18 Threshold Information Block Entry Descriptions Entry Content THRESH_INT The event threshold interval, expressed in seconds. THRESH_COUNT The number of events detected within the threshold interval, expressed in decimal. THRESH_LMT The number of events which, if detected within the threshold interval, will cause the event to be treated as a solid error by the EHS. Expressed in decimal. THRESH_ZERO Time since the threshold count was last zeroed, expressed in seconds. THRESH_TOTAL Total number of this type error since the threshold was zeroed, expressed in decimal. 4–26 Error Handling and Analysis 4.4.5 Fault Data The Fault Data block has a variable length specific to the class of the fault which occurred. The error class can be determined by the high-order four bits of the FAULT_ID field in the Fault Summary block (see Table 4–15). The six Fault Data types based on these fault classes are shown in Figure 4–8 and described in the following subsections. Figure 4–8 Fault Data Block 0 System Registers End Actions (End Action Registers) End Action Timeouts VAXELN Detected Errors Software Detected Errors Unsynchable Events +108 +112 +1 +16 +8 MR−0005−93RAGS 4.4.5.1 System Registers The EHS gathers system error information in the course of error handling. The content of these registers is written to the error log. Table 4–19 lists each register entry and its offset from the start of the block. Note For different system errors, different sets of system registers are collected. A value of -1 (FFFFFFFF hex) in a system register location in the error log indicates that the register was not recorded. Error Handling and Analysis 4–27 Table 4–19 System Register Entry Descriptions Entry Content Offset SYSFLT JXD System Fault Register 0 SYSADR JXD System Error Address Register 4 DMAADR DMA Error Address Register 8 DMA_IO_ADDR DMA Engine I/O Error Address Register 12 JCSR_A JXD Control and Status Register - Zone A 16 JCSR_B JXD Control and Status Register - Zone B 20 JDIAG_P_A JXD Diagnostic Error Register - Zone A, primary rail 24 JDIAG_M_A JXD Diagnostic Error Register - Zone A, mirror rail 28 JDIAG_P_B JXD Diagnostic Error Register - Zone B, primary rail 32 JDIAG_M_B JXD Diagnostic Error Register - Zone B, mirror rail 36 ATMERR0_A JXD ROM BUS ATM Error Register - Zone A 40 ATMERR0_B JXD ROM BUS ATM Error Register - Zone B 44 DMASTS_A DMA Status Register - Zone A 48 DMASTS_B DMA Status Register - Zone B 52 MMBERR0_A JXD ROM BUS MMB Error Register 0 - Zone A 56 MMBERR0_B JXD ROM BUS MMB Error Register 0 - Zone B 60 MMBERR1_A JXD ROM BUS MMB Error Register 1 - Zone A 64 MMBERR1_B JXD ROM BUS MMB Error Register 1 - Zone B 68 SERCRS_A Serial Cross-Link Control and Status Register - Zone A 72 SERCRS_B Serial Cross-Link Control and Status Register - Zone B 76 SERMODE_A Serial Cross-Link Mode Register - Zone A 80 SERMODE_B Serial Cross-Link Mode Register - Zone B 84 BIU_ADDR_A CPU BIU Address Register - Zone A 88 BIU_ADDR_B CPU BIU Address Register - Zone B 92 BIU_STAT_A CPU Fill Syndrome - Zone A 96 BIU_STAT_B CPU Fill Syndrome - Zone B 100 BIU_CTL_A CPU Fill Address - Zone A 104 BIU_CTL_B CPU Fill Address - Zone B 108 4.4.5.2 End Actions End action data is provided after diagnostics have completed running on a zone or CPU which was removed from service as a result of a system error. It is composed of console and diagnostic status and the contents of registers from the failed zone/CPU at the time the original system error occurred. Table 4–20 lists each register entry and its offset from the start of the data block. 4–28 Error Handling and Analysis Table 4–20 End Actions Register Descriptions Entry Content Offset SYSFLT JXD System Fault Register 0 SYSADR JXD System Error Address Register 4 JCSR JXD Control and Status Register 8 JDIAG_P JXD Diagnostic Error Register - primary rail 12 JDIAG_M JXD Diagnostic Error Register - mirror rail 16 MMBERR0 JXD ROM BUS MMB Error Register 0 20 MMBERR1 JXD ROM BUS MMB Error Register 1 24 ATMERR0 JXD ROM BUS ATM Error Register 28 DMASTS DMA Status Register 32 DMAADR DMA Error Address Register 36 SERCRS Serial Cross-Link Control and Status Register 40 SERMODE Serial Cross-Link Mode Register 44 SAVPC CPU Saved PC - Zone A 48 SAVPSL CPU Saved PSL 52 ECR CPU EBox Control Register 56 BIU_CTL CPU BIU Control Register 60 BC_TAG CPU B-cache Error Tag 64 BIU_STS CPU BIU Status Register 68 BIU_ADDR CPU BIU Address Register 72 FIL_SYN CPU Fill Syndrome 76 FIL_ADDR CPU Fill Address 80 VMAR CPU VIC Memory Address Register 84 ICSR CPU IBox Control and Status Register 88 TBADR CPU MBox TB Parity Address 92 TBSTS CPU MBox TB Parity Status 96 PCSTS CPU P-cache Status Register 100 PCCTL CPU P-cache Control Register 104 CONSOLE_STS System Console Duplex Compatibility Status 108 DIAG_STS System Diagnostics Status Longword 112 4.4.5.3 End Action Timeouts This data is provided when a zone or CPU which was temporarily removed from service due to a fault fails to communicate through the interzone communication service (IZC) to the remaining zone after running diagnostics. In many cases, such a situation results in the EHS declaring a solid error for the CPU or zone in this error log. Error Handling and Analysis 4–29 Figure 4–9 shows the format of this Fault Data block entry and its offset. Table 4–21 contains a brief description of the entry. Figure 4–9 End Action Timeout Block 0 TIMEOUT_INT (Timeout Interval) MR−0013−93RAGS Table 4–21 End Action Timeout Block Entry Description Entry Content Offset TIMEOUT End action timeout interval in seconds 0 4.4.5.4 VAXELN Detected Errors This data is provided for errors detected by VAXELN software running on the I/O expansion module. It is composed of data provided by VAXELN software when the error was detected on the I/O expansion module. Figure 4–10 shows the format of this Fault Data block and the offset of each entry from the start of the block. Table 4–22 contains a brief description of each entry. Figure 4–10 VAXELN Detected Error Block ERROR_CLASS (VAXELN Error Class) ERROR_TYPE (VAXELN Error Type) 0 +4 +8 JOB_ID (ELN Component Job with Error) +12 ERROR_CODE (Unique Error Designation Code) ERROR_DATA (Error Condition Specific Data) +16 MR−0014−93RAGS Table 4–22 VAXELN Detected Error Block Entry Descriptions Entry Contents ERROR_CLASS VAXELN error class: 1 - VAXELN kernel fatal error 2 - VAXELN kernel recoverable error 3 - VAXELN master job fatal error 4 - VAXELN master job recoverable error (continued on next page) 4–30 Error Handling and Analysis Table 4–22 (Cont.) VAXELN Detected Error Block Entry Descriptions Entry Contents 5 - VAXELN job fatal error 6 - VAXELN job recoverable error (reserved for future use) ERROR_TYPE VAXELN error type: 1 - Hardware error 2 - Software error 3 - Unknown error JOB_ID VAXELN component job with error: 0 - Interface module 0 driver job 1 - Interface module 1 driver job 2 - Interface module 2 driver job 3 - Interface module 3 driver job 4 - Interface module 4 driver job 5 - Interface module 5 driver job 6 - Interface module 6 driver job 7 - Interface module 7 driver job 8 - UART 0 driver job 9 - UART 1 driver job 10 - VAXELN master job 13 - VAXELN FIST job 14 - VAXELN background job 15 - VAXELN I/O expansion module error 17 - VAXELN kernel error ERROR_CODE Unique error designation code (in hexadecimal) 9000 Watchdog timer expired FA03 Job initialization failed FA04 Job initialization timeout CA01 Unexpected command interrupt CA02 Unexpected interface module interrupt 0 Machine check handler entered with unknown type code 11 Floating point accelerator error 15 Memory management - PTE in P0 space 16 Memory management - PTE in P1 space 17 Memory management - PTE in P0 space on M bit 18 Memory management - PTE in P1 space on M bit 19 Unused interrupt priority level 1A Microcode detected error 80 Unknown hardware error 10080 Bus timeout error. Read error - normal read (continued on next page) Error Handling and Analysis 4–31 Table 4–22 (Cont.) VAXELN Detected Error Block Entry Descriptions Entry Contents 20080 DAL parity error. Read error - normal read 30080 Cache parity error. Read error - normal read 40080 Uncorrectable read data error. Read error - normal read 50080 DMA error. Read error - normal read 60080 Firewall SOC miscompare. Read error - normal read 81 Unknown hardware error. Read error - SPTE/PCB/SCB 10081 Read error - SPTE/PCB/SCB 20081 DAL parity error. Read error - SPTE/PCB/SCB 30081 Cache parity error. Read error - SPTE/PCB/SCB 40081 Uncorrectable read data error. Read error - SPTE/PCB/SCB 50081 DMA error. Read error - SPTE/PCB/SCB 60081 Firewall SOC miscompare. Read error - SPTE/PCB/SCB 82 Unknown hardware error. Write error - normal write 10082 Bus timeout error. Write error - normal write 20082 DAL parity error. Write error - normal write 30082 Cache parity error. Write error - normal write 40082 Uncorrectable read data error. Write error - normal write 50082 DMA error. Write error - normal write 60082 Firewall SOC miscompare. Write error - normal write 83 Unknown hardware error. Write error - SPTE/PCB 10083 Bus timeout error. Write error - SPTE/PCB 20083 DAL parity error. Write error - SPTE/PCB 30083 Cache parity error. Write error - SPTE/PCB 40083 Uncorrectable read data error. Write error - SPTE/PCB 50083 DMA error. Write error - SPTE/PCB 60083 Firewall SOC miscompare. Write error - SPTE/PCB 100 Correctable read data error 200 Polled machine bus timeout error 201 Polled machine DAL parity error 202 Polled machine cache parity error 203 Polled machine uncorrectable read data error 204 Polled machine DMA error 205 Polled machine Firewall SOC miscompare 206 Polled machine battery low 400 Fatal system bugcheck 401 Nonfatal system bugcheck 402 Bugcheck from process 800 Bugcheck during boot (continued on next page) 4–32 Error Handling and Analysis Table 4–22 (Cont.) VAXELN Detected Error Block Entry Descriptions Entry Contents 1 Normal successful completion 7C04 Bad parameter count 7C0C Bad job or process creation 7C14 Bad string parameter length 7C1C Bad access mode 7C24 Bad stack 7C2C Bad object state 7C34 Bad object type 7C3C Bad parameter value 7C44 Connect circuit completed 7C4C Connect circuit pending 7C54 Connect circuit timeout 7C5C Count overflow 7C64 Count underflow 7C6C Debug signal 7C74 Device already connected 7C7C Circuit disconnected by partner 7C84 Duplicate name 7C8C Kernel stack not valid 7C94 Machine check 7C9C No access to parameter 7CA4 No destination port 7CAC No job initialization specified 7CB4 No physical memory available 7CBC No I/O mapping register available 7CC4 No message available 7CCC No object table entry available 7CD4 No process page table available 7CDC No data path register available 7CE4 No pool available 7CEC No port available 7CF4 No exit status value specified 7CFC No such device 7D04 No such name 7D0C No such port 7D14 No such program 7D1C No such service 7D24 No system page table entries available (continued on next page) Error Handling and Analysis 4–33 Table 4–22 (Cont.) VAXELN Detected Error Block Entry Descriptions Entry Contents 7D2C No virtual address space available 7D34 Power recovery signal 7D3C Quit signal 7D44 Remote port value 7D4C Process exit signal 7D54 Remote system currently unreachable 7D5C Interprocess signal 7D64 Remote system rejected username or password 7D6C Bad message size 7D74 Referenced shareable image not present 7D7C Unsupported program image format 7D84 Internal consistency failure 7D8C Port on another BI node 7D94 Third party disconnected circuit 7D9C Network is in the off state 7DA4 No such job 7F01 Time has not been previously set 7F09 Expedited message 7F11 Previous job created area 7F19 Device already exists ERROR_DATA Error condition specific data. This entry is reserved for future expansion. 4.4.5.5 Software Detected Errors This data is provided for errors detected by the OpenVMS operating system components. Such errors are not usually detected by hardware mechanisms. The data is composed of information passed by the operating system component to the EHS. Figure 4–11 shows the format of this fault data block and the offset of each entry from the start of the block. Table 4–23 contains a brief description of each entry. Note If the software component which detects the module failure does not request the setting of the module ID NVRAM status code or does not request a reset of the module, then these fields will contain -1 (FFFFFFFF hexidecimal). 4–34 Error Handling and Analysis Figure 4–11 Software Detected Error Block MODULE_STATUS RESET_REASON 0 +4 +8 RESET_ACTION MR−0007−93RAGS Table 4–23 Software Detected Error Block Entry Descriptions Entry Contents MODULE_STATUS Hexidecimal module ID NVRAM status code. The following values are defined: 0F Excessive CPU/MEM faults 1E Excessive resynchronization abort errors 2D Double-bit error 3C Excessive single-bit errors 4B Excessive clock phase errors 5A Excessive CPU I/O errors 69 Solid CPU I/O errors 78 Excessive transient NXIO errors 87 Solid NXIO error 96 VAXELN kernel fatal error A5 The module is good B4 Excessive VAXELN kernel recoverable errors C3 VAXELN master fatal error D2 VAXELN master recoverable error E1 VAXELN job fatal error F0 System software detected module failure F1 System software detected I/O expansion module primary UART failure F2 System software detected I/O expansion module auxiliary UART failure F3 Unexpected VAXELN error detected RESET_REASON Hexidecimal OpenVMS reset reason code. The following values are defined: 1 Duplex zones have diverged 2 Fatal cross-link error has occurred 3 Fatal zone error has occurred 4 Fatal ATM module error has occurred 5 Fatal CPU module error has occurred (continued on next page) Error Handling and Analysis 4–35 Table 4–23 (Cont.) Software Detected Error Block Entry Descriptions Entry Contents 6 Fatal memory error has occurred 7 Single-bit error has occurred 8 User command issued to stop a zone 9 Unexpected machine check has occurred A Software detected failure has occurred B Solid NXIO error has occurred C Excessive transient I/O expansion module errors have occurred D A solid I/O error has occurred E Excessive transient I/O errors have occurred F Excessive VAXELN kernel recoverable errors have occurred 10 A VAXELN master fatal error has occurred 11 A VAXELN job fatal error has occurred 12 Not enough SPTEs could be allocated to boot the OpenVMS operating system 13 Unexpected system error occurred 14 Interface module has occurred 15 Unexpected VAXELN error occurred 16 A VAXELN kernel fatal error has occurred RESET_ACTION Hexidecimal console reset action code. The following values are defined: 0 Unexpected CPU reset 1 No diagnostic CPU reset 2 Dispatch request CPU reset 3 Resynchronization reset CPU reset 4 Run diagnostic CPU reset 5 Reconfigure console CPU reset 6 STOP/ZONE CPU reset 10000 Unexpected I/O reset 10001 No diagnostic I/O reset 10002 Dispatch request I/O reset 10003 Z command I/O reset 10004 Load and run (VAXELN) I/O reset 10005 Upgrade flash ROM I/O reset 10006 Run diagnostic I/O reset 10007 Reconfigure console I/O reset 4.4.5.6 Unsynchable Events This data is provided if the console reports that a zone or CPU is unsynchable when no previous error had been associated with it. The error can occur when diagnostics run on a zone which was not present in the system configuration, or after a zone has been manually removed. The data is composed of console and diagnostic status from the failed zone. 4–36 Error Handling and Analysis Figure 4–12 shows the format of this Fault Data block and the offset of each field from the start of the block. Table 4–24 contains a brief description of each entry. Figure 4–12 Unsynchable Event Block COMPAT_STS (Test Status) 0 +4 DIAG_STS (Diagnostic Status) MR−0008−93RAGS Table 4–24 Unsynchable Event Block Entry Descriptions Bit Description COMPAT_STS System console duplex compatibility test status. This field indicates the results of the compatibility test performed by the console after diagnostics have completed. The following bits are defined: 00 Self test failed 01 Zone test failed 02 System test failed 03 ATM module self test failed 04 Both zones have same zone ID 05 CPU ID EEPROM is bad 06 CPU ID EEPROM has bad OpenVMS status 07 CPU ID EEPROM has bad firmware status 08 CPU ID EEPROM module ID mismatches with other zone 09 CPU ID EEPROM module name mismatches with other zone 10 CPU ID EEPROM hardware revision not compatible with other zone 11 CPU ID EEPROM firmware revision not compatible with other zone 12 CPU ID EEPROM software revision not compatible with other zone 13 ATM module ID EEPROM is bad 14 ATM module ID EEPROM has bad OpenVMS status 15 ATM module ID EEPROM has bad firmware status 16 ATM module ID EEPROM module ID mismatches with other zone 17 ATM module ID EEPROM module name mismatches with other zone 18 ATM module ID EEPROM hardware revision not compatible with other zone 19 ATM module ID EEPROM firmware revision not compatible with other zone 20 ATM module ID EEPROM software revision not compatible with other zone 21 CPU data EEPROM is bad 22 CPU data EEPROM system wide data area mismatches with other zone (continued on next page) Error Handling and Analysis 4–37 Table 4–24 (Cont.) Unsynchable Event Block Entry Descriptions Bit Description 23 CPU memory configuration mismatches with other zone 24 Cables (cross-link/resynchronization) 25 CPU is in burn-in mode 26 Ethernet EEPROM mismatches with other zone 27 CPU console firmware cannot be run in Duplex [31:28] Not used DIAG_STS System diagnostic status longword. This field is valid when any of bits [03:00] are set in COMPAT_STS. This longword gives additional detail on the diagnostic failure indicated by those bits. The following bits are defined: [07:00] Subtest number, expressed in decimal [15:08] Test number, expressed in decimal [23:16] Group number, expressed in decimal [27:24] Diagnostic flags, expressed in hexidecimal [30:28] Not used 31 Diagnostic status is valid 4.5 Module NVRAM Status and LED Indicators There are multiple I2C buses in a Model 810 zone which are used to provide access to NVRAMs and LEDs on each module. The system I2C bus connects all the modules in the primary backplane slots in a zone and has master controllers on the IO ATM module. This I2C bus is used to access the NVRAMs and the LEDs on the CPU and IO ATM modules, and the embedded primary I/O expansion module. The primary I/O expansion module has an I2C bus with a master controller and connections to each interface module to access their NVRAMs and LEDs. When the EHS identifies a module as the source of solid or excessive transient errors, it removes the module from service. At the same time, it flags the module as failed, turns on the module LED, and writes the error code to the module NVRAM through its I2C bus. When the zone is removed for service, the LED remains on. When repair is complete and system power is turned on, diagnostics on the CPU or I/O expansion module will examine the error code. If the OpenVMS operating system flagged the module as failed, or diagnostics fail, the diagnostics will not turn off the LED. The LED remains on until the module is replaced or the NVRAM is cleared. Table 4–25 lists the status codes that the EHS may write into the operating system status field of the module ID NVRAM, as well as symbol names, descriptions, and affected modules. The EHS sets the module LED every time it writes one of these status codes. Note In the case of some catastrophic ATM failures, it may not be possible to access the I2C bus for that zone to write the code and set the LED. In 4–38 Error Handling and Analysis such cases, diagnostics on the remote zone are relied on to report the failure. Table 4–25 Module ID NVRAM/DCB Status Codes Status Code Description Affected Modules 0F The threshold for CPU/MEM faults for this module has been exceeded. CPU module 1E The threshold for resynch abort errors for this module has been exceeded. CPU module 2D The module experienced a double-bit memory error. CPU module 3C The threshold for single-bit errors for a memory SIMM has been exceeded. CPU module 4B The zone in which this module resides has experienced excessive clock phase errors. ATM module 5A The module has experienced excessive transient CPU I/O errors. ATM and I/O expansion modules 69 The module has experienced a solid CPU I/O error. ATM and I/O expansion modules 78 The module has experienced excessive transient NXIO errors. ATM, I/O expansion, and Interface modules 87 The module has experienced a solid NXIO error. ATM, I/O expansion, and Interface modules 96 The module has experienced a VAXELN kernel fatal error. I/O expansion module A5 The module is good. CPU, ATM, I/O expansion, and Interface modules B4 The module has experienced excessive VAXELN kernel recoverable errors. I/O expansion module C3 The module has experienced a VAXELN master fatal error. I/O expansion module D2 The module has experienced a VAXELN master recoverable error. Interface module E1 The module has experienced a VAXELN job fatal error. Interface module F0 A failure of this module has been detected by a system software component. ATM, I/O expansion, and Interface modules F1 A failure of the system console UART port in the SSC on the I/O expansion module has been detected by a system software component. ATM and I/O expansion module F2 A failure of the auxiliary UART port in the SSC on the I/O expansion module has been detected by a system software component. ATM and I/O expansion module Error Handling and Analysis 4–39 4.6 FTSS Event Reporting Interface The EHS externalizes events by reporting them to the event reporting interface (ERI). The ERI, in turn, passes notification of the event to the FTSS$SERVER process. The server reports the event in one of three ways: 1. Generating messages that are sent to the operator console. 2. Entering additional information into the system error log. 3. Reporting the event to an external mailbox which can be read by a user application. 4.6.1 Event Reporting Interface Routines The EHS reports events by calling the following ERI routines located in the FTSS$CORE image. FTSS$ZONE_AVAILABLE is called to report the availability of the other zone or CPU. This occurs when the IZC notifies the EHS that the zone has completed diagnostics and is available for use. A message code is added by the EHS and results in an OPCOM message and an error log being generated by the server. FTSS$ERROR_REPORT is called by the EHS when a FRU is identified as the error source. This can occur as a result of a hardware or software detected failure. In this call the EHS passes error information through ERI to the server process. The server generates the appropriate messages to the operator console and user applications, and makes entries in the error log. 4.6.2 Error Event Messages The following messages are passed to OPCOM and the system error log by the server. Each message corresponds to an EHS error event and contains information that identifies the FRU. FTSS$_CABLEGONE, cross-link cable fault detected Facility: FTSS Explanation: The crosslink cable has been isolated as the cause of a system failure. One zone will be removed from service by the operating system. For transient failures, the error will be compared to its error rate threshold. If the threshold is not exceeded, the zone will be resynchronized when it completes diagnostics. User Action: If the zone is automatically resynchronized, no action is required on the part of the user. If the zone is not automatically resynchronized, the system error log should be examined for entries which correspond to the cross-link cable failure. These entries will identify an FRU. FTSS$_CLOCK_END, Clock fault end action complete Facility: FTSS Explanation: Error processing for a clock fault has been completed and the zone is available to be resynchronized. User Action: If the zone is automatically resynchronized by FTSS, then no action is needed on the part of the user. If the zone is not resynchronized, the system error log should be examined for entries which correspond to clock fault. These error logs will identify an FRU. 4–40 Error Handling and Analysis FTSS$_CLOCK_ENDTMO, Clock fault end action timeout on zone [zone_id] Facility: FTSS Explanation: When a clock fault occurs in a non-Simplex system, diagnostics normally run on the failed zone and, upon completion, report status back to the zone running the operating system. If this end action does not occur within a reasonable timeout period, the failure will be treated as solid and the zone will not be automatically resynchronized by FTSS. User Action: The system error log should be examined for entries which correspond to the clock fault and the end action timeout. These entries will indicate an FRU. FTSS$_CLOCKFLT, Clock fault detected on [module_id] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: The clocks in each of the two zones operate in phase lock. When this synchronization is lost, lockstep operation of the zones is lost. The error is compared to its error rate threshold. If the threshold is exceeded, the zone is not automatically resynchronized by FTSS. User Action: If the removed zone is automatically resynchronized after running diagnostics, no action is needed on the part of the user. If the zone is not automatically resynchronized, the system error log should be examined for entries which correspond to the clock fault. These entries will identify an FRU which must be replaced. FTSS$_CPMF_END, CPU/MEM fault end action complete Facility: FTSS Explanation: Error processing for a CPU/MEM fault has been completed and the CPU is available to be resynchronized. User Action: If the CPU is automatically resynchronized by FTSS, then no action is needed on the part of the user. If the CPU is not resynchronized, the system error log should be examined for entries which correspond to the CPU/MEM fault. These error logs will identify an FRU. FTSS$_CPMF_ENDTMO, CPU/MEM fault end action timed out on zone [zone_ id] Facility: FTSS Explanation: When a CPU/MEM fault occurs in a Duplex system, diagnostics normally run on the failed CPU and, upon completion, report status back to the zone running the operating system. If this end action does not occur within a reasonable timeout period, the failure will be treated as solid and the CPU will not be automatically resynchronized by FTSS. User Action: The system error log should be examined for entries which correspond to the CPU/MEM fault and the end action timeout. These entries will indicate an FRU. Error Handling and Analysis 4–41 FTSS$_CPUDBE, Double-bit memory fault detected on [module_id] in slot [slot_ id], zone [zone_id] Facility: FTSS Explanation: A double-bit memory error has occurred. This indicates a solid memory failure. This error will only be reported in a Duplex system and a CPU module will be removed from service when it occurs. User Action: The system error log should be examined for entries which correspond to the double-bit error. These logs will indicate the SIMM memory row which must be replaced. FTSS$_CPUSBE, A single-bit memory fault detected on [module_id] in slot [slot_ id], zone [zone_id] Facility: FTSS Explanation: A recoverable single-bit memory error has been detected and handled by the operating system. These transient errors are repaired in memory and compared to their error rate threshold. In a Duplex system, a CPU module will be removed from service if the threshold is exceeded. User Action: In most cases, no action by the user is necessary. If the rate of single-bit errors becomes excessive, replacement of a SIMM memory row or CPU module will be required. The system error log should be examined for the entries which correspond to the single-bit errors. FTSS$_CPUMEMFLT, CPU/MEM fault detected on [module_id] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: A CPU/MEM fault in a Duplex system has been detected. This results in the temporary removal of that CPU from service. This error is compared to its error rate threshold. If the threshold is not exceeded and the CPU completes diagnostics successfully, the CPU will be automatically resynchronized. If the threshold is exceeded or diagnostics fail, the CPU will be not be automatically resynchronized. User Action: If the CPU is automatically resynchronized after the completion of diagnostics, no action is required on the part of the user. If the CPU is not automatically resynchronized, the system error log should be examined for entries which correspond to the CPU/MEM fault. These entries will indicate an FRU. FTSS$_CPUUNSYNC, [module_id] in slot [slot_id], zone [zone_id] is unsynchable Facility: FTSS Explanation: When a CPU completes diagnostics with failure and reports this status to the zone running the operating system, this message is generated. The CPU with the failure will not be automatically resynchronized by FTSS. User Action: The system error log should be examined for the entry which corresponds to the unsynchable event. This entry will indicate an FRU. 4–42 Error Handling and Analysis FTSS$_DBE_END, DBE end action complete Facility: FTSS Explanation: Error processing for a double-bit memory error has been completed and the CPU is available to be resynchronized. User Action: The system error log should be examined for entries which correspond to the double-bit error. These error logs will identify an FRU. FTSS$_DBE_ENDTMO, DBE end action timed out on zone [zone_id] Facility: FTSS Explanation: When double-bit memory errors occur in a Duplex system, diagnostics run on the failed CPU and, upon completion, report status back to the zone running the operating system. If this end action does not occur within a reasonable timeout period, the failure will be treated as solid and the CPU will not be automatically resynchronized by FTSS. User Action: The system error log should be examined for entries which correspond to the double-bit error and the end action timeout. These entries will indicate an FRU. FTSS$_DIV_END, zone divergence end action complete Facility: FTSS Explanation: Error processing for a zone divergence error been completed and the zone is available to be resynchronized. User Action: If the zone is automatically resynchronized by FTSS, then no action is needed on the part of the user. If the zone is not resynchronized, the system error log should be examined for entries which correspond to zone divergence error. These error logs will identify an FRU. FTSS$_DIV_ENDTMO, zone divergence end action timed out on zone [zone_id] Facility: FTSS Explanation: When zones diverge in a Duplex system, diagnostics run on the removed zone and, on completion, report status to the zone running the OpenVMS operating system. If this end action does not occur within a reasonable timeout period, the failure will be treated as solid and the zone will not be automatically resynchronized by FTSS. User Action: The system error log should be examined for entries which correspond to the zone divergence and the end action timeout. These entries will indicate an FRU. FTSS$_DIVERGED, A synchronized, dual zone configuration has diverged Facility: FTSS Explanation: Lockstep operation between the two zones of a Duplex system has been lost. One of the zones is temporarily removed from service. The error is compared to its error rate threshold. If the threshold is not exceeded, the zone will be automatically resynchronized by FTSS after successfully completing diagnostics. If the threshold is not exceeded or diagnostics fail, the zone is not automatically resynchronized. User Action: If the zone is automatically resynchronized, no action is necessary on the part of the user. If the zone if not automatically resynchronized, the system error log should be examined for entries which correspond to the zone divergence error. These entries will indicate an FRU. Error Handling and Analysis 4–43 FTSS$_ELNJOBFATAL, VAXELN job fatal error detected on [module_id] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: A VAXELN job running on an I/O Expansion module has detected a fatal error and has terminated. This error results in the removal of the associated Interface module from the system. User Action: The system error log should be examined for entries which correspond to the VAXELN job fatal error. These entries will indicate an FRU. FTSS$_ELNJOBRECOV, VAXELN job recoverable error detected on [module_id] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: A VAXELN job running on an I/O Expansion module has detected a recoverable error. These errors are compared to their error rate threshold by the operating system. If the threshold is exceeded in a non-Simplex system, the associated Interface module is removed from the system. User Action: If the threshold is not exceeded, no action is required on the part of the user. If the threshold is exceeded, the system error log should be examined for entries which correspond to the VAXELN job recoverable error. These entries will indicate an FRU. FTSS$_ELNKERFATAL, VAXELN kernel fatal error detected on [module_id] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: The VAXELN kernel running on an I/O Expansion module has detected a fatal error and has terminated. This error results in the removal of the indicated I/O Expansion module and associated Interface modules from the system configuration. User Action: The system error log should be examined for entries which correspond to the VAXELN kernel fatal error. These entries will indicate an FRU. FTSS$_ELNKERRECOV, VAXELN kernel recoverable error detected on [module_id] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: The VAXELN kernel running on an I/O Expansion module has detected a recoverable error. These errors are compared to their error rate threshold by the operating system. If the threshold is exceeded in a non-Simplex system, the indicated I/O Expansion module and associated Interface modules are removed from service. User Action: If the threshold is not exceeded, no action is required on the part of the user. If the threshold is exceeded, the system error log should be examined for entries which correspond to the VAXELN kernel recoverable errors. These entries will indicate an FRU. 4–44 Error Handling and Analysis FTSS$_ELNMASFATAL, VAXELN master job fatal error detected on [module_ id] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: The VAXELN master job running on an I/O Expansion module has detected a fatal error and has terminated. This error results in the removal of the indicated I/O Expansion module and associated Interface modules from the system configuration. User Action: The system error log should be examined for entries which correspond to the VAXELN master job fatal error. These entries will indicate an FRU. FTSS$_ELNMASRECOV, VAXELN master job recoverable error detected on [module_id] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: The VAXELN master job running on an I/O Expansion module has detected a recoverable error. These errors are compare to their threshold by the operating system. If the threshold is exceeded in a non-Simplex system, the indicated I/O Expansion module and associated Interface modules are removed from service. User Action: If the threshold is not exceeded, no action is required on the part of the user. If the threshold is exceeded, the system error log should be examined for entries which correspond to the VAXELN master job recoverable errors. These entries will indicate an FRU. FTSS$_JXDDBE, Double-bit memory fault detected on [module_id] in slot [slot_ id], zone [zone_id] Facility: FTSS Explanation: A double-bit memory error has occurred. This indicates a solid memory failure. In a Duplex system, a CPU module will be removed from service when this error occurs. User Action: The system error log should be examined for entries which correspond to the double bit error. These logs will indicate the SIMM memory row which must be replaced. FTSS$_JXDSBE, Single-bit memory fault detected on [module_id] in slot [slot_ id], zone [zone_id] Facility: FTSS Explanation: A recoverable single-bit memory error has been detected and handled by the operating system. These transient errors are repaired in memory, and the errors are compared to their error rate threshold. In a Duplex system, a CPU module will be removed from service if the threshold is exceeded. User Action: In most cases, no action by the user is necessary. If the rate of single-bit errors becomes excessive, replacement of a SIMM memory row will be required. The system error log should be examined for the entries which correspond to the single-bit errors. Error Handling and Analysis 4–45 FTSS$_POWERGONE, Power gone fault detected on zone [zone_id] Facility: FTSS Explanation: Power has been lost in one of the zones. This error is compared to its error rate threshold. If the threshold is not exceeded, the zone will be automatically resynchronized when power returns. User Action: If power is restored and the zone is automatically resynchronized, no action is required on the part of the user. If power is restored and the zone is not automatically resynchronized, the user should examine the external system power source. FTSS$_RESYNCHFLT, Resynch abort fault detected on [module_type] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: During an attempt to resynchronize a CPU/Memory module, an error occurred on the master CPU module. This error is compared to its error rate threshold by the operating system. If the threshold is not exceeded, FTSS will retry the resynchronization process. When the threshold is exceeded, attempts to resynchronize will be terminated. User Action: If the resynchronization retry is successful, no action is required on the part of the user. If the threshold for retries is exceeded, the system error log should be examined for entries which correspond to the resynch abort failure. These entries will indicate an FRU. FTSS$_SBE_END, SBE end action complete Facility: FTSS Explanation: Error processing for a single-bit memory error has been completed and the CPU is available to be resynchronized. User Action: If the CPU is automatically resynchronized by FTSS, then no action is needed on the part of the user. If the CPU is not resynchronized, the system error log should be examined for entries which correspond to single bit error. These error logs will identify an FRU. FTSS$_SBE_ENDTMO, SBE end action timed out on zone [zone_id] Facility: FTSS Explanation: When single-bit memory errors occur in a Duplex system, diagnostics run on the failed CPU and, on completion, report status back to the zone running the operating system. If this end action does not occur within a reasonable timeout period, the failure will be treated as solid and the CPU will not be automatically resynchronized by FTSS. User Action: The system error log should be examined for entries which correspond to the single-bit error and the end action timeout. These entries will indicate an FRU. FTSS$_SOLIDIOMOD, Solid I/O fault detected on [module_type] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: A fatal I/O miscompare error was detected and attributed to the indicated module. The module is removed from service by the operating system. User Action: The system error log should be examined for entries which correspond to the I/O miscompare errors. These entries will indicate an FRU. 4–46 Error Handling and Analysis FTSS$_SOLIDNXIO, Solid NXIO fault detected on [module_type] in slot [slot_ id], zone [zone_id] Facility: FTSS Explanation: A fatal nonexistent I/O error has occurred when accessing the indicated I/O module. The module is removed from service by the operating system. User Action: The system error log should be examined for entries which correspond to the nonexistent I/O error. These entries will indicate an FRU. FTSS$_SOLIDIOXLNK, Solid I/O fault detected on the cross-link Facility: FTSS Explanation: A fatal I/O miscompare error was detected and attributed to the cross-link. One zone is selected and is removed from service by the operating system. User Action: The system error log should be examined for entries which correspond to the I/O miscompare errors. These entries will indicate an FRU. FTSS$_SOLIDIOZONE, Solid I/O fault detected on zone [zone_id] Facility: FTSS Explanation: A fatal I/O miscompare error was detected and attributed to the indicated zone. The zone is removed from service by the operating system. User Action: The system error log should be examined for entries which correspond to the I/O miscompare errors. These entries will indicate an FRU. FTSS$_SWMODERR, Software detected failure on [module_type] in slot [slot_ id], zone [zone_id] Facility: FTSS Explanation: A system software component has detected the failure of a system module. In most cases, these errors indicate the failure of an I/O module which was detected by a device driver and not reported by a system error interrupt. These errors indicate a fatal failure of the indicated module and it is removed from service. User Action: The system error log should be examined for entries which correspond to the software detected module failure. These entries will indicate an FRU. FTSS$_SWZONERR, Software detected failure on zone [zone_id] Facility: FTSS Explanation: A system software component has detected the failure of a zone. This error indicates a fatal failure of the indicated zone and it is removed from service. User Action: The system error log should be examined for entries which correspond to the software detected zone failure. These entries will indicate an FRU. Error Handling and Analysis 4–47 FTSS$_TRNSIOMOD, Transient I/O fault detected on [module_type] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: A transient I/O miscompare error was detected and attributed to the indicated module. These errors are compared to their error rate threshold. If the threshold is exceeded and the system mode is not Simplex, the module is removed from service. User Action: If the threshold is not exceeded and the module is not removed from service, no action is needed on the part of the user. If the module is removed from service, the system error log should be examined for entries which correspond to the I/O miscompare errors. These entries will indicate an FRU. FTSS$_TRNSNXIO, Transient NXIO fault detected on [module_type] in slot [slot_id], zone [zone_id] Facility: FTSS Explanation: A transient non-existent I/O error was detected when accessing the indicated module. These errors are compared to their error rate threshold. If the threshold is exceeded and the system mode is not Simplex, the module is removed from service. User Action: If the threshold is not exceeded and the module is not removed from service, no action is needed on the part of the user. If the module is removed from service, the system error log should be examined for entries which correspond to the non-existent I/O errors. These entries will indicate an FRU. FTSS$_TRNSIOXLNK, Transient I/O fault detected on the cross-link Facility: FTSS Explanation: A transient I/O miscompare error was detected and attributed to the cross-link. These errors are compared to their error rate threshold. If the threshold is exceeded and the system mode is not Simplex, then one zone is removed from service. User Action: If the threshold is not exceeded and a zone is not removed from service, no action is needed on the part of the user. If a zone is removed from service, the system error log should be examined for entries which correspond to the I/O miscompare errors. These entries will indicate an FRU. FTSS$_TRNSIOZONE, Transient I/O fault detected on zone [zone_id] Facility: FTSS Explanation: A transient I/O miscompare error was detected and attributed to the indicated zone. These errors are compared to their error rate threshold. If the threshold is exceeded and the system mode is not Simplex, the zone is removed from service. User Action: If the threshold is not exceeded and the zone is not removed from service, no action is needed on the part of the user. If the zone is removed from service, the system error log should be examined for entries which correspond to the I/O miscompare errors. These entries will indicate an FRU. 4–48 Error Handling and Analysis FTSS$_ZONEHALT, Zone Halt fault detected on zone [zone_id] Facility: FTSS Explanation: A single zone of a Duplex system has been halted. This can be caused by a user command on the system console or by a system error. User Action: If the Halt was caused by a user command on the system console, a START/ZONE command must be executed to restore the zone to service. If the Halt was not caused by a user command, the system error log should be examined for entries which correspond to the zone halt error. These entries will identify an FRU. FTSS$_ZONEUNSYNC, Zone [zone_id] is unsynchable Facility: FTSS Explanation: When a zone completes diagnostics with failure and reports this status to the zone running the operating system, this message is generated. The zone with the failure will not be automatically resynchronized by FTSS. User Action: The system error log should be examined for the entry which corresponds to the unsynchable event. This entry will indicate an FRU. 4.6.2.1 Deconfiguration Messages The following messages can be passed to OPCOM and the system error log file by the FTSS$SERVER at the request of EHS. Each message corresponds to a deconfiguration activity performed by EHS. Each message contains information (through FAO arguments) that identifies the entity deconfigured by EHS. FTSS$_DECONFIG_ATMIO, I/O expansion subsystem on I/O attachment module in slot [slot_id], zone [zone_id] has been removed from service Facility: FTSS Explanation: Due to one or more system errors, the I/O expansion subsystem on the indicated I/O ATM and its associated Interface modules have been removed from service. User Action: The system error log should be examined for entries which correspond to the removal of the I/O expansion subsystem. These entries will indicate an FRU. FTSS$_DECONFIG_CPUMOD, CPU module in slot [slot_id], zone [zone_id] has been removed from service Facility: FTSS Explanation: Due to one or more system errors, the indicated CPU module has been removed from service. In some cases, the CPU may be automatically resynchronized by FTSS when it successfully completes the execution of diagnostics. User Action: If the CPU is automatically resynchronized by FTSS after completing diagnostics, no action is required on the part of the user. If the CPU is not automatically resynchronized, the system error log should be examined for entries which relate to the removal of the CPU. These entries will indicate an FRU. Error Handling and Analysis 4–49 FTSS$_DECONFIG_EXMOD, I/O expansion module in slot [slot_id], zone [zone_ id] has been removed from service Facility: FTSS Explanation: Due to one or more system errors, the indicated I/O Expansion module and its associated Interface modules have been removed from service. User Action: The system error log should be examined for entries which correspond to the removal of the I/O expansion module. These entries will indicate an FRU. FTSS$_DECONFIG_INTMOD, Interface module in slot [slot_id], zone [zone_id] has been removed from service Facility: FTSS Explanation: Due to one or more system errors, the indicated Interface module has removed from service. User Action: The system error log should be examined for entries which correspond to the removal of the Interface module. These entries will indicate an FRU. FTSS$_DECONFIG_ZONE, Zone [zone_id] has been removed from service Facility: FTSS Explanation: Due to one or more system errors, the indicated zone has been removed from service. In some cases, the zone may be automatically resynchronized by FTSS when it successfully completes the execution of diagnostics. User Action: If the zone is automatically resynchronized by FTSS after completing diagnostics, no action is required on the part of the user. If the zone is not automatically resynchronized, the system error log should be examined for entries which relate to the removal of the zone. These entries will indicate an FRU. 4.7 Firmware Interfaces The EHS interacts with three firmware-based software entities: system console and diagnostics, I/O expansion module console and diagnostics, and the I/O expansion module VAXELN software. The system console and diagnostics and I/O expansion module console and diagnostics interfaces are discussed in the following sections. 4.7.1 System Console and Diagnostics The EHS communicates with the system console through: • System hardware resets combined with flags in the console communications area (CCA) • CCA fields referenced using the IZC service 4–50 Error Handling and Analysis 4.7.1.1 System Resets When the EHS determines that a zone or CPU should be removed from the configuration, it forces a reset on the CPU. The reset results in the system console being invoked from serial ROM by the hardware. When system console runs, it attempts to determine the reason for the reset, which in turn may determine the actions performed by the console. The EHS uses the fields in the CCA reset dispatch block (at offset CCA560$R_RESET_BLOCK) to pass reset reason codes to the console. The fields are: RDB$L_RESET_CODE - The reset reason code. This longword field is actually composed of two one-word fields: • RDB$W_ACTION - The reset action. This word instructs the console on the action that needs to be taken. The reset action codes used by the EHS are described in Table 4–26. • RDB$W_REASON - The reset reason. This field is additional data supplied by the OpenVMS operating system which indicates the reason for the reset. The code is printed in hex on the operator console after the reset action is completed. The reset reason codes used by the EHS are described in Table 4–27. RDB$L_REASON_VALID - The 1’s complement of the reset reason code longword. RDB$L_DISPATCH - This field is used only if the system console is to continue the OpenVMS operating system after completing reset actions. In all reset cases by the EHS, it will be 0. Table 4–26 System Reset Action Codes Decimal Value Description 1 This code will cause the system console to enter its halt loop, which will establish IZC to the other zone, without invoking any diagnostics. Currently, this reset action is requested only when the EHS is handling a single-bit error. 4 This code will cause the system console to invoke diagnostics. The diagnostics which run depend on the cross-link mode at the time. Following diagnostics, the system console will enter its halt loop, and establish IZC to the other zone. The code is used when a zone or CPU is being removed due to a system error. 6 This code will cause the same actions as CPURESET$K_DIAGS. This code is used when a zone is being removed by operator action (that is, a user command). Error Handling and Analysis 4–51 Table 4–27 System Reset Reason Codes Decimal Value Description 1 When the EHS detects zone divergence, it selects one zone to continue the OpenVMS operating system and one zone to stop. Note that the OpenVMS operating system is not indicating an error in this zone; it must stop one of the two. 2 When the EHS isolates a failure to the cross-link cable (for example, a cable gone error), it will reset one zone using this reason type. 3 When the EHS detects a fault in a zone that cannot be isolated to a single module, it will reset the zone with this reason type. Usually, such errors are the result of backplane failures. 4 The OpenVMS operating system will use this reset with an IO ATM module failure. Before this reset, the operating system will write an error code to the module ID EEPROM through the I2C bus. 5 The OpenVMS operating system will use this to reset a CPU module after determining that it has failed. Before the reset, the OpenVMS operating system will write an error code to the module ID EEPROM through the I2C bus. 6 The OpenVMS operating system will use this to reset a CPU module after determining that its memory has failed. 7 An SBE was detected by the CPU in Duplex mode. CPU lockstep between zones is lost on this event and it should be reestablished as soon as possible. This code is used in conjunction with the CPURESET$K_NO_DIAGS reset action code. 8 This code is used as a result of a user-issued command to remove a zone from service. 9 A fatal system machine check error has occurred. 10 A system software component detected a failure of this module. Table 4–28 lists the events which might cause the EHS to issue the reset, and the cross-link modes under which the reset might be issued. Table 4–28 Error Handler Reset Reasons Event Possible Cross-Link Modes Double-Bit Error OFF, MASTER Single-Bit Error SLAVE Cross-Link Cable Failure OFF Clock Phase Errors OFF I/O Errors OFF, MASTER, SLAVE Zone Divergence OFF Single-Bit Error SLAVE User Command OFF, SLAVE 4–52 Error Handling and Analysis 4.7.1.2 CCA Fields When a CPU or zone completes diagnostics, it enters its halt loop, which reports its status to the OpenVMS operating system in the other zone through the IZC service. The IZC service will in turn call the OpenVMS operating system to report the availability of the other zone. The operating system requires the following information to be available from the console in the other zone: • The IZC message to the operating system will contain a synchability status. If the status is unsynchable, the OpenVMS operating system will examine the CCA in the console zone. The field CCA560$L_COMPAT_STATUS will contain a reason mask which describes the reasons that the zone is not synchable. This information will be entered into the system error log. If the reason mask indicates a diagnostic failure, the CCA560$Q_DIAG_ STATUS field will contain additional information on the failure. The EHS will use the IZC service to read this information for entry into the system error log. • The EHS uses the IZC service to read system register information from the CCA of the other zone starting at offset CCA560$R_REG_BLOCK. The registers in this block were written by the EHS when the original error occurred. However, the console must preserve this area through all resets and during diagnostic execution, whenever possible (some catastrophic failures will prevent this from working). 4.7.2 I/O Expansion Module Console and Diagnostics When the EHS determines that an I/O expansion module should be removed from the configuration, it forces an I/O hard reset on the modules. This results in the I/O expansion module console being invoked by hardware. When the console runs, it attempts to determine the reason for the reset, which in turn may determine the actions performed by the diagnostics. The EHS uses two fields in the NCA reset dispatch block (at offset NCA560$L_RESET_BLOCK) to pass reset reason codes to the diagnostics. The fields are: RDB$L_RESET_CODE - The reset reason code. This longword field is actually composed of two 1-word fields: • RDB$W_ACTION - The system reset action. This word instructs the console on the action that needs to be taken. The only reset action code used by the EHS is shown in Table 4–29. • RDB$W_REASON - The reset reason. This field is additional data supplied by the operating system which indicates the reason for the reset. The reset reason codes used by the EHS are shown in Table 4–30. RDB$L_REASON_VALID - The 1’s complement of the reset reason code longword. RDB$L_DISPATCH - This field is used only if the console is to continue the operating system after completing reset actions. In all cases of I/O resets by the EHS, it will be 0. Error Handling and Analysis 4–53 Table 4–29 I/O Reset Action Code Description Decimal Value Description 6 This reset code will cause the I/O expansion module console to invoke diagnostics. The diagnostics which run depend upon the mode of the cross-link at the time. After diagnostics, console will enter its halt loop. Table 4–30 I/O Reset Reason Code Descriptions Decimal Value Description 11 The module has experienced a solid NXIO error. 12 The module has experienced excessive transient NXIO errors. 13 The module has experienced a solid I/O miscompare error. 14 The module has experienced excessive transient I/O miscompare errors. 15 The module has experienced excessive VAXELN kernel recoverable errors. 16 The module has experienced a VAXELN master fatal error. 4.8 Firmware and OpenVMS Interface Data Structures Figure 4–13 shows the OpenVMS operating system and firmware data structure memory map. The following sections describe the data structures used by the console: • Console Communication Area (CCA) • Device Configuration Block (DCB) • Page Frame Number Bitmap (PFN) The firmware constructs, initializes, and shares the data structures with the OpenVMS operating system. Figure 4–13 Firmware and OpenVMS Data Structure Memory Map Page Frame Number (PFN) Bitmap Zone A Sub−Device Configuration Block (SubDCB) Zone A Device Configuration Block (DCB) Zone B Sub−Device Configuration Block (SubDCB) Zone A Device Configuration Block (DCB) Console Communications Area (CCA) Remainder of Main Memory MR−0019−93RAGS 4–54 Error Handling and Analysis 4.8.1 Console Communications Area The console communications area (CCA) is the main data structure used by the console to interface with the OpenVMS operating system. Table 4–31 describes the CCA components. Table 4–31 CCA Component Descriptions Parameter Size Description CCA size 2 bytes Size of the CCA in bytes. Initialized by firmware. CCA revision 1 byte Revision of the CCA. Initialized by firmware. CCA base 4 bytes Physical address of the CCA. Initialized by firmware. Header flags 4 bytes CCA flags. Field breakdown by bit: • 00 = Bootstrap in progress. Set by firmware when bootstrap operation is started. Cleared by the OpenVMS operating system. Used to control the bootstrap operation. • 01 = Restart in progress. Set by firmware when restart operation is started. Cleared by the OpenVMS operating system. Used to control the restart operation. • 02 = Automatic bootstrap. Set by firmware when a manual bootstrap occurred. • 03 = Reboot in progress. Set by the OpenVMS operating system when a bootstrap operation is requested by the operating system using the default boot specification. • 04 = Failsafe mode. Set by firmware to indicate that the zone is in Failsafe mode. (Failesafe mode refers to the method used for bootstrapping.) • 05 = Synchable status. Set by firmware to indicate that the zone is synchable (Duplex compatibility test passed). If bit is clear, test failed. Use the Duplex compatibility test results component to obtain the reason for failure. • 06 = Halted from bootstrap. Set by VMB to indicate to the firmware that it is not to report a bootstrap error. This bit overrides the state of the bootstrap in progress bit 0 with respect to handling errors during the bootstrap operation. • [31:07] = Reserved for firmware use. (continued on next page) Error Handling and Analysis 4–55 Table 4–31 (Cont.) CCA Component Descriptions Parameter Size Description Bootability test results 4 bytes Results of the bootstrap test. Written by the firmware. Field breakdown by bit: • 00 = CPU/ATM check. Set when the CPU and ATM are good. • 01 = Cable state. Set when cables are present and good. • 02 = Other zone power state. Set when the power is on in the other zone. • 03 = Other zone OpenVMS operating system state. Set when the other zone is running the OpenVMS operating system. • 04 = Other zone CPU/ATM check. Set when the CPU and ATM in the other zone are good. • [31:07] = Reserved for firmware use. PFN bitmap address 4 bytes Physical address of the PFN bitmap. Initialized by firmware. PFN bitmap size 4 bytes Size of the PFN bitmap in bytes. Initialized by firmware. PFN bitmap checksum 4 bytes Checksum of the PFN bitmap. Checksum = integer sum of all bytes in the PFN bitmap. System serial number 12 bytes System serial number. 12 ASCII characters. Initialized by firmware. Copied from the CPU module data EEPROM. Zone A DCB offset 4 bytes Offset to the Zone A DCB. Offset is the byte offset (signed) from the CCA base. Initialized by firmware. Zone A DCB size 4 bytes Size in bytes of the DCB for Zone A. The size includes the DCB and any SubDCBs for Zone A. Initialized by firmware. Zone B DCB offset 4 bytes Offset to the Zone B DCB. Offset is the byte offset (signed) from the CCA base. Initialized by firmware. Zone B DCB size 4 bytes Size in bytes of the DCB for Zone B. The size includes the DCB and any SubDCBs for Zone B. Initialized by firmware. (continued on next page) 4–56 Error Handling and Analysis Table 4–31 (Cont.) CCA Component Descriptions Parameter Size Description Diagnostic status 8 bytes Results of the diagnostic tests. Initialized by firmware. Breakdown of the status fields: • [07:00] = Error number • [15:08] = Subtest number • [23:26] = Test number • [27:24] = Group number • [30:28] = Diagnostic flags. For firmware use only. • 31 = Set when bits 27:00 indicate a valid failure code. The high-order four bytes are reserved for firmware. Duplex compatibility test results 4 bytes Results of the compatibility test. Written by firmware. See Section 4.8.1.1 for the test descriptions and fault codes. Reset dispatch block 16 bytes Used by firmware and the OpenVMS operating system to notify the firmware how to handle a reset entry to firmware. See Section 4.8.1.2 for dispatch block description. Boot parameter table 164 bytes Boot parameter table. Initialized by firmware. See Section 4.8.1.3 for the description. Saved register block 132 bytes Register block saved by the OpenVMS operating system on a CPU/MEM fault. Initialized and used by the operating system. Reserved 64 bytes Reserved for future expansion. 4.8.1.1 Duplex Compatibility Test On firmware entry, the console program verifies a number of conditions that are required for system operation in Duplex mode. These conditions determine if the zone is synchable, that is, able to join a partner zone in Duplex operation. The IZC protocol is used by the console program to execute the Duplex compatibility test. Once the console establishes the IZC service, it executes the test and notifies the other zone of the results. A zone is considered synchable if it passes the test. The compatibility test is responsible for storing the results in the CCA. The following items are test parameters. • Diagnostic status: CPU self-test passes CPU zone test passes Primary I/O expansion module self-test passes CPU system test does not fail (not run assumes a passed condition) • Zone identification: One Zone A, one Zone B. Error Handling and Analysis 4–57 • CPU module ID EEPROM: Valid checksum OpenVMS and firmware status byte is good Module ID and module name compatible with other zone Module hardware revision compatible with other zone (major) Firmware and software revisions compatible with other zone (major) • I/O ATM module ID EEPROM: Valid checksum OpenVMS and firmware status byte is good Module ID and module name compatible with other zone Module hardware revision compatible with other zone (major) Firmware and software revisions compatible with other zone (major) • CPU module data EEPROM: Valid checksum System data area must be the same in both zones • Memory restrictions for synchronization: Same memory configuration on both zones • Cross-link and resynch cables functional • Operational modes must be compatible (that is, burnin state) • Ability of the CPU console firmware to run in cross-link in Duplex mode Table 4–32 lists the test failure codes. Each bit represents the results of checking the given condition. The test will attempt to check all conditions, and updates the bits as it performs the test (set bit indicates failure). Table 4–32 Duplex Compatibility Test Failure Codes Failure Code Bit Number Code Description 00 CPU self-test failed 01 CPU zone test failed 02 CPU system test failed 03 ATM self-test failed 04 Both zones have the same zone ID 05 CPU ID EEPROM is bad 06 CPU ID EEPROM OpenVMS status field shows module is bad 07 CPU ID EEPROM firmware status field shows module is bad 08 CPU ID EEPROM module type field mismatches between zones 09 CPU ID EEPROM module name field mismatches between zones 10 CPU ID EEPROM hardware revision (major) mismatches between zones 11 CPU ID EEPROM firmware revision (major) mismatches between zones (continued on next page) 4–58 Error Handling and Analysis Table 4–32 (Cont.) Duplex Compatibility Test Failure Codes Failure Code Bit Number Code Description 12 CPU ID EEPROM software revision (major) mismatches between zones 13 ATM ID EEPROM is bad 14 ATM ID EEPROM OpenVMS status field shows module is bad 15 ATM ID EEPROM firmware status field shows module is bad 16 ATM ID EEPROM module type field mismatches between zones 17 ATM ID EEPROM module name field mismatches between zones 18 ATM ID EEPROM hardware revision (major) mismatches between zones 19 ATM ID EEPROM firmware revision (major) mismatches between zones 20 ATM ID EEPROM software revision (major) mismatches between zones 21 CPU data EEPROM is bad 22 CPU data EEPROM system wide area mismatches between zones 23 CPU/memory configuration mismatches between zones 24 Cables (cross-link and/or resynch) are not functional 25 CPU is in burnin state 26 Ethernet EEPROM address mismatches between zones 27 CPU console firmware cannot be synchable (cannot run in Duplex mode) [31:28] Reserved for future use 4.8.1.2 Dispatch Block Description The firmware validates a reset entry using a dispatch block, located in memory, to determine the next operation. Figure 4–14 shows the dispatch block structure. Table 4–33 describes the block components. Figure 4–14 Dispatch Block Structure Base + 00 Dispatch Reason Code Base + 04 Dispatch Address Base + 0C Dispatch Reason Complement MR−0018−93RAGS Error Handling and Analysis 4–59 Table 4–33 Dispatch Block Components Block Content Offset Description Dispatch reason code Base + 00h 4 bytes Code identifying reset reason. Bytes 03:02 identify the reason for the reset. Bytes 01:00 identify the end action to be taken by the console as specified below: • 00 = POWERUP. Default or unexpected reset. Run diagnostics and halt (enter the console). • 01 = NO_DIAGS. Halt (enter the console). • 02 = DISPATCH. Dispatch requested. Jump to the dispatch address. • 03 = RESYNCH. Resynch reset. Jump to the dispatch address. • 04 = DIAGS. Run diagnostics and halt (enter the console). • 05 = STOP_ZONE. OpenVMS issued a STOP_ ZONE. Run diagnostics and halt (enter the console). • 06 = RECONFIG. Reconfigure firmware (for firmware use only). Dispatch address Base + 04h 8 bytes Physical address where console will jump. In the Model 810, only the first 4 bytes are used. Upper 4 bytes must be 0. Dispatch reason complement Base = 0Ch 4 bytes The 1’s complement of the dispatch reason code. Used for checking the dispatch block validity. 4.8.1.3 Boot Parameter Block Description The boot parameter block (BPB) is a structure built by firmware to reflect the primary bootstrap code (VMB) of the boot device that is used during the bootstrap sequence. Table 4–34 describes the BPB components. Table 4–35 describes the entry components in the DCB structure. Table 4–34 BPB Components Component Length Description Number of entries 4 bytes Number of entries in the BPB. Written by firmware. Is 0 if no entries are present. BPB entries 5 bytes per entry An entry describes a boot path. Written by firmware. Maximum number of entries is 32. (See Table 4–35 for entry description.) 4–60 Error Handling and Analysis Table 4–35 BPB Entry Components Component Length Description Unit number 2 bytes Device unit number. Valid numbers are in the 0 to 999 (decimal) range. Device 2 bytes Device name in ASCII (that is, EP and DI). Path identifier 1 byte Path to device. Field breakdown is: • [06:00] = Slot number of the adapter module in the 10 to 17 (hex) and 20 to 27 (hex) range. • 07 = Zone identification of the adapter module: 0 = Zone A, 1 = Zone B. 4.8.2 Device Configuration Block The device configuration block (DCB) reflects the configuration of the available modules in the system. There is a DCB in each zone. The DCB is built by firmware during the power up sequence and updated each time INIT and BOOT are executed. The OpenVMS operating system uses the DCB to configure the system. Table 4–36 describes the DCB components. Table 4–37 describes the DCB entry components. Table 4–36 DCB Components Component Length Description Number of entries 4 bytes Number of entries in the DCB. Initialized by firmware. Is 0 if no entries are present. DCB entries 168 bytes per entry An entry describes a module found by the firmware. Initialized by firmware. Maximum number of entries is eight. (See Table 4–37 for entry description.) Table 4–37 DCB Entry Components Component Length Description Slot number 1 byte Physical slot number of the module. Valid slot numbers are: 0 to 2 for CPU and I/O ATM modules 0 to 7 for interface modules attached to the I/O ATM Module type 1 byte Code identifying the module. Module types are copied from the module ID EEPROM. Valid module types are: 1 = Not used 2 = SWIFT adapter card 3 = I/O ATM module 4 = DSF module 5 = CPU module 6 = LANCE adapter card 7 = Not used 8 = FDDI adapter card F = Unknown module (continued on next page) Error Handling and Analysis 4–61 Table 4–37 (Cont.) DCB Entry Components Component Length Description Status summary 1 byte Module status summary. This field is a summary of the OpenVMS and firmware status fields. The field should be updated whenever OpenVMS or firmware status fields are updated. Codes are initially copied from the module ID EEPROM. Valid codes (in hex) are: A5 = Module is good. B4 = Module is bad, marked by OpenVMS. See OpenVMS status field. C3 = Module is bad, marked by firmware. See firmware status field. FF = Module is bad, marked by OpenVMS and firmware. OpenVMS status 1 byte Module status as marked by OpenVMS (and maintained by OpenVMS). Codes are initially copied from the module ID EEPROM. Valid codes (in hex) are: A5 = module is good. non A5 = module is bad. Firmware status 1 byte Module status as marked by firmware (and maintained by firmware). Codes are initially copied from the module ID EEPROM. Valid codes (in hex) are: A5 = Module is good. non A5 = Module is bad. Module name 4 bytes ASCII module name. Copied from the module ID EEPROM. Module serial number 12 bytes Module serial in ASCII. Copied from the module ID EEPROM. Hardware revision 6 bytes Identifies the module hardware revision. Copied from the module ID EEPROM. Divided in: Minor revision (bytes 02:00) Major revision (bytes 05:03) Firmware revision 2 bytes Console/diagnostic firmware revision of the module. Copied from the module ID EEPROM. Divided in: Minor revision (byte 00) Major revision (byte 01) Software revision 2 bytes Functional firmware revision of the module. Copied from the module ID EEPROM. Divided in: Minor revision (byte 00) Major revision (byte 01) (continued on next page) 4–62 Error Handling and Analysis Table 4–37 (Cont.) DCB Entry Components Component Length Description Ethernet address 32 bytes Module Ethernet address. Follows the DEC STD format. Valid only for CPU module and LANCE adapter card. Copied from the Ethernet EEPROM by firmware for the CPU. Copied from the LANCE ROM for the LANCE adapter card. Extended data 32 bytes Module-specific data. The field is copied by firmware from the functional firmware ROM. Memory size 4 bytes Size of the module’s memory in 512 byte segments. For CPU refers to the size of main memory. For I/O ATM refers to the size of local (SOC) memory. For interface modules refers to the size of buffer RAM. SubDCB 4 bytes Offset to the module SubDCB (Sub-Device Configuration Block). Offset is the byte offset (signed) from the base of the DCB. Is 0 if no SubDCB available. Reserved 64 bytes Reserved for future use. 4.8.2.1 Sub-Device Configuration Blocks The SubDCBs reflect the configuration of the interface or memory modules attached to a module. SubDCBs may be available for the CPU and I/O ATM modules. The SubDCB is built by firmware during the power up sequence and updated each time INIT and BOOT are executed. A SubDCB is present when there are interface modules attached to a given module and its existence is represented in that module’s DCB entry. When the SubDCB offset field on a DCB entry is nonzero, the value is used to calculate the location of its SubDCB block. If the SubDCB offset field on a DCB entry is zero, there is no SubDCB block present (that is, no interface modules are attached to that module). The format of a SubDCB is the same as for the DCB block. The field containing the number of entries follows the same format as a DCB entry (except the CPU module SubDCB). Figure 4–15 shows how the SubDCBs are linked to the DCB. Error Handling and Analysis 4–63 Figure 4–15 SubDCB Links to DCB SubDCB for DCB Entry 1 CCA Number of Entries DCB Entry 1 DCB Entry 2 Zone A DCB Offset CCA Base + Offset Zone B DCB Offset DCB Entry n−1 DCB Entry n Zone A DCB Number of Entries DCB Entry 1 DCB Entry n DCB Base + Offset DCB Base + Offset SubDCB for DCB Entry n Number of Entries DCB Entry 1 DCB Entry 2 DCB Entry n−1 DCB Entry n MR−0020−93RAGS 4.8.2.2 CPU Module SubDCB The CPU SubDCB is used to represent the memory modules (MMBs) available on the CPU module. Table 4–38 describes the CPU SubDCB components. Table 4–39 describes the CPU SubDCB entry components. 4–64 Error Handling and Analysis Table 4–38 CPU SubDCB Components Component Length Description Number of entries 4 bytes Number of entries in the SubDCB. Initialized by firmware. Is 0 if no entries are present. SubDCB entries 16 bytes per entry An entry describes an MMB found by the firmware. Initialized by firmware. Maximum number of entries is four. Table 4–39 CPU SubDCB Entry Components Component Length Description SIMM block 16 bytes MMB SIMM description. This field is an array of eight elements (SIMM0 to SIMM7). Each element is 2 bytes in size and contains: Byte 00 - SIMM size in Mbytes. Byte 01 - SIMM status. Values for SIMM status (in hex) are: A5 = SIMM is good. B4 = SIMM is broken. C3 = SIMM is absent. 4.8.3 Page Frame Number Bitmap The page frame number (PFN) bitmap is a data structure that indicates which pages in memory are considered usable by the OpenVMS operating system. The bitmap is built by diagnostics as a side effect of the memory tests run during the power up sequence. The bitmap starts on a page boundary and resides at the top of memory. The bitmap requires 1 Kbyte for each 4 Mbytes of main memory, that is: • A 32-Mbyte system requires an 8-Kbyte bitmap • A 512-Mbyte system requires a 128-Kbyte bitmap The bitmap does not map itself or anything above it. There may be memory above the bitmap which has good and bad pages. Each bit in the PFN bitmap corresponds to a page in main memory. There is a one-to-one correspondence between a page frame number (origin 0) and a bit index in the bitmap. A 1 in the bitmap indicates that the page is good and can be used. A 0 indicates that the page is bad and should not be used. By default, a page is flagged bad if a multiple bit error occurs when referencing the page. Single-bit errors, regardless of frequency, will not cause a page to be flagged bad. Error Handling and Analysis 4–65 4.9 Error Log Analysis 4.9.1 CPU/MEM Fault Error Log Entry V A X / V M S SYSTEM ERROR REPORT ******************************* ENTRY ERROR SEQUENCE 1033. DATE/TIME 2-FEB-1993 18:15:45.55 SYSTEM UPTIME: 0 DAYS 01:47:45 SCS NODE: SIXSHL COMPILED 3-FEB-1993 09:33:44 PAGE 40. 686. ******************************* LOGGED ON: SID 17000002 SYS_TYPE 02010101 VAX/VMS T5.5-D34 INT60 ERROR KA560 CPU FW REV# 2. CONSOLE FW REV# 0.1 REGISTER COUNT 00000028 Fault Summary Block ! FAULT ID 19 FAULT FLAG 02 XLNK MODE ERROR 03 XLNK MODE AFTER 02 " CPU/mem fault Solid error Duplex Master # FRU Information Block FRU TYPE 00000004 FRU DATA 00000001 Module in zone B $ CPU in slot 0 Deconfiguration Information FLT FLGS BEFORE 33003301 Full configuration active Zone A CPU present Zone B CPU present Zone A I/O present Zone B I/O present Zone A CPU in use Zone B CPU in use Zone A I/O in use Zone A I/O in use FLT FLGS AFTER 33003301 Full configuration active Zone A CPU present Zone B CPU present Zone A I/O present Zone B I/O present Zone A CPU in use Zone B CPU in use Zone A I/O in use Zone A I/O in use DECONFIG INFO 00000008 Zone B cpu removed from service DECONFIG MODULE 00000001 % CPU in slot 0 removed from service Threshold Information Block 4–66 Error Handling and Analysis V A X / V M S SYSTEM ERROR REPORT COMPILED 3-FEB-1993 09:33:44 PAGE 41. THRESHOLD INTER.0000A8C0 THRESHOLD INTER. SECONDS = 43200. THRESHOLD COUNT 00000001 THRESHOLD COUNT = 1. THRESHOLD LIMIT 00000003 THRESHOLD LIMIT = 3. THRESHOLD ZEROED0000190E THRESHOLD ZEROED SECONDS = 6414. THRESHOLD TOTAL 00000001 Fault Data Block THRESHOLD TOTAL = 1. & SYSTEM ERROR SYSFLT 19 30020010 I/O error, zone A CPU/memory fault, zone B XLINK MODE = Duplex SYSADR 61200034 DMAADR 0269BC00 SYSADR = 61200034(X) DMAADR = 0269BC00(X) DMA Address Register Invalid JCSR_A CTL/STAT 00000088 System errors enabled Bcache on JCSR_B Register Invalid DIAG_P_A REG CAC00000 DMA most error (non-crc) Burn-in mode I/O divide = 6 CPU divide = A DIAG_M_A REG CAC00000 DMA most error (non-crc) Burn-in mode I/O divide = 6 CPU divide = A DIAG_P_B Register Invalid DIAG_M_B Register Invalid ATMERR_A REG 00000000 Zone ID = A ATMERR_B Register Invalid DMA STAT REG A 00000040 CPU I/O error DMASTS_B Register Invalid MMBERR0_A REG 00000000 MMBERR0_B Register Invalid MMBERR1_A REG 00000000 Error Handling and Analysis 4–67 V A X / V M S SYSTEM ERROR REPORT COMPILED 3-FEB-1993 09:33:44 PAGE 42. MMBERR1_B Register Invalid SERCSR_A REG 00000080 Loopback request Enable query interrupt SERCSR_B Register Invalid SERMODE_A REG 00200912 ' Master Operating System is running Clock fault enable Clock select 0 = Master, 1 = Slave Halt source 0 = A, 1 = B SERMODE_B Register Invalid BIU_ADDR_A Register Invalid BIU_ADDR_B Register Invalid BIU_STAT_A Register Invalid BIU_STAT_B Register Invalid BIU_CTL_A Register Invalid BIU_CTL_B Register Invalid ! This block reflects the content of the four fields of the Fault Summary Block. " The FAULT ID, FAULT FLAG, FRU TYPE, and FRU DATA fields should always be reviewed. They will generally provide the most immediate FRU information. # The system operating mode has been changed from Duplex to Degraded Duplex, with Zone A as the master. $ A solid error has been identified and the FRU removed from service. However, if the CPU has not exceeded its threshold and diagnostics pass, the CPU will be reconfigured into the system. % At this point, the Zone B CPU has not been removed from service. & The Zone B CPU is being removed from service due to the solid error and change in operating mode. ' OpenVMS is running in Zone A. 4–68 Error Handling and Analysis 4.9.2 CPU/MEM Fault End Action Error Log Entry V A X / V M S SYSTEM ERROR REPORT ******************************* ENTRY ERROR SEQUENCE 1048. DATE/TIME 2-FEB-1993 18:16:21.40 SYSTEM UPTIME: 0 DAYS 01:48:21 SCS NODE: SIXSHL COMPILED 3-FEB-1993 09:33:46 PAGE 56. 701. ******************************* LOGGED ON: SID 17000002 SYS_TYPE 02010101 VAX/VMS T5.5-D34 INT60 ERROR KA560 CPU FW REV# 2. CONSOLE FW REV# 0.1 REGISTER COUNT 00000029 Fault Summary Block ! FAULT ID 29 FAULT FLAG 0A CPU/mem fault end action Solid error Service is required XLNK MODE ERROR 03 XLNK MODE AFTER 02 " # Duplex Master $ FRU Information Block FRU TYPE 00000004 FRU DATA 00000001 Module in zone B CPU in slot 0 % Deconfiguration Information FLT FLGS BEFORE 33003301 Full configuration active Zone A CPU present Zone B CPU present Zone A I/O present Zone B I/O present Zone A CPU in use Zone B CPU in use Zone A I/O in use Zone A I/O in use FLT FLGS AFTER 31003300 Zone A CPU present Zone B CPU present Zone A I/O present Zone B I/O present Zone A CPU in use Zone A I/O in use Zone A I/O in use & DECONFIG INFO 00000008 Zone B cpu removed from service DECONFIG MODULE 00000001 CPU in slot 0 removed from service Threshold Information Not Valid Error Handling and Analysis 4–69 V A X / V M S SYSTEM ERROR REPORT COMPILED 3-FEB-1993 09:33:46 PAGE 57. Fault Data Block END ACTION SYSFLT 29 30020020 I/O error, zone B CPU/memory fault, zone B XLINK MODE = Duplex SYSADR 61200034 SYSADR = 61200034(X) CNTRL/STAT REG 00000008 System errors enabled DIAG_P REG CAC08000 Memory double bit error DMA most error (non-crc) Burn-in mode I/O divide = 6 CPU divide = A DIAG_M REG CAC08000 Memory double bit error DMA most error (non-crc) Burn-in mode I/O divide = 6 CPU divide = A MMBERR0 REG 01010101 MMBERR1 REG ATMERR REG 00000000 40404040 DMA STAT REG 00000040 DMAADR 0269BC00 SERCSR REG 00000080 MMB #3 double bit error ' Zone ID = B CPU I/O error DMAADR = 0269BC00(X) Loopback request Enable query interrupt SERMODE REG 00002101 Slave Clock fault enable Zone ID 0 = A, 1 = B PCADR SAVPSL REG 00000000 0000B039 C-BIT N-BIT T-BIT INTEGER OVERFLOW TRAP ENABLE INTERRUPT PRIORITY LEVEL = 00. PREVIOUS MODE = KERNEL CURRENT MODE = KERNEL FIRST PART DONE CLEAR ECR 0000004A fbox enable fbox st4 bypass enable timeout clock pmf pmux = 00 pmf emux = 00 4–70 Error Handling and Analysis V A X / V M S BIU CTL SYSTEM ERROR REPORT COMPILED 3-FEB-1993 09:33:46 PAGE 58. DFE0DEF9 Generate/Expect ECC on check_h pins output enable of cache rams direct mapped 2X CPU Cycle IO Map = 1(X) 512 Kbytes BC TAG 07913800 tag_match tag control V tag control D tag P BC TAG = 03C8(X) BIU STAT 500E3070 Bits 33,32 BIU Addr Reg = 1(X) Bits 33,32 Fill Addr Reg = 1(X) FILL SYN 00000000 L0 ECC Syn bits Low Longword = 0(X) Hi ECC Syn bits High Longword = 0(X) FILL ADDR 000002A8 VMAR 000007E0 FILL ADDR = 000002A8(X) Sub Block Select = 0(X) Row Index = 3F(X) Error Address Field = 00000000(X) ICSR 00000001 TBADR TBSTS 00000000 00000000 enable VIC s5 cmd corresp to tb perr = 00 source of ref causing tb perr = 00 PCSTS 00000000 PCCTL 00000000 PCSTS.LOCK(0) NOT SET Performance Monitor Mode = 0(X) COMPAT/STAT REG 00006008 ATM self test failed ATM ID EEPROM is bad ATM ID EEPROM has bad os status DIAG STATUS REG 00000000 Register is not "VALID" ! This block reflects the content of the four fields of the Fault Summary Block. " This entry type (end action) is provided after diagnostics have completed running on a zone or CPU which has been removed from service as a result of a system error. This is the end action for the previous example (CPU/MEM Fault Error Log Entry). # This message specifies that a physical FRU replacement is required. $ The system operating mode has been changed from Duplex to Degraded Duplex with Zone A as the master. % The FRU may be one of five items: CPU module, or one of the four MMBs. & The Zone B CPU has been removed from service. ' Double-bit errors are always treated as solid faults. The failed CPU will not be reconfigured until Zone B memory is repaired. MMB 3 is the most likely FRU. Error Handling and Analysis 4–71 4.9.3 CPU or Zone Unsynchable Error Log Entry V A X / V M S SYSTEM ERROR REPORT ******************************* ENTRY ERROR SEQUENCE 1099. DATE/TIME 2-FEB-1993 18:16:21.40 SYSTEM UPTIME: 0 DAYS 01:48:21 SCS NODE: SIXSHL COMPILED 3-FEB-1993 09:33:46 PAGE 56. 743. ******************************* LOGGED ON: SID 17000002 SYS_TYPE 02010101 VAX/VMS T5.5-D34 INT60 ERROR KA560 CPU FW REV# 2. CONSOLE FW REV# 0.1 REGISTER COUNT 0000000E Fault Summary Block ! FAULT ID 60 FAULT FLAG 0A CPU or zone unsynchable Solid error Service is required XLNK MODE ERROR 02 XLNK MODE AFTER 02 Master Master " FRU Information Block FRU TYPE 00000004 FRU DATA 00000001 Module in zone B CPU in slot 0 Deconfiguration Information FLT FLGS BEFORE 31003300 Zone A CPU present Zone B CPU present Zone A I/O present Zone B I/O present Zone A CPU in use Zone A I/O in use Zone A I/O in use # FLT FLGS AFTER 31003301 Zone A CPU present Zone B CPU present Zone A I/O present Zone A CPU in use Zone A I/O in use Zone A I/O in use DECONFIG INFO 00000008 Zone B cpu removed from service DECONFIG MODULE 00000001 $ CPU in slot 0 removed from service Threshold Information Not Valid Fault Data Block 4–72 Error Handling and Analysis V A X / V M S SYSTEM ERROR REPORT COMPILED 3-FEB-1993 09:33:46 PAGE 57. CUP or ZONE UNSYNCHABLE EVENTS COMPAR/STAT REG 02000000 CPU is in burnin mode DIAG STATUS REG FFFFFFFF Diagnostic status is valid DIAG ERR NUM FF DIAG SUBTEST NUM FF DIAG TEST NUM FF DIAG GROUP NUM 0F DIAG ERR NUM = 255 DIAG SUBTEST NUM = 255 DIAG TEST NUM = 255 DIAG GROUP NUM = 15. Diag Flag = 7(X) ! This block reflects the content of the four fields of the Fault Summary Block. " The system was unable to synchronize and reach Duplex mode. Consequently, the before and after XLINK_MODE fields (Fault Summary Block) reflect Degraded Duplex mode. # Since the Zone B CPU was unsynchable, it is not in use. $ The Zone B CPU was removed from service, and will remain out of service until it is repaired. Error Handling and Analysis 4–73 5 FRU Removal and Replacement Procedures 5.1 In This Chapter This chapter includes: • Field replaceable unit list • Before you begin • FRU removal and replacement 5.2 Field Replaceable Unit List A complete list of field replaceable units (FRUs) is given in Table 5–1. Table 5–1 Model 810 FRUs FRU Part Number Modules: CPU 54-21075-01 Memory mother board (MMB) 54-21085-01 Single-sided SIMMs (4 Mbytes per SIMM) 54-21139-CA Double-sided SIMMs (8 Mbytes per SIMM) 54-21139-DA I/O attachment module (ATM) 54-21083-01 Zone control panel 54-22130-01 Fan current sense board (FCSB) 54-22126-01 Console extender module 54-21067-01 Cross-link assembly 70-03710-01 Fan 12-27848-01 Power: AC front end unit (FEU) H7884-AA 5V regulator (DC5) H7179-AA 3.3V regulator (DC3) H7178-AA Power system controller (PSC) H7851-AA Domestic power distribution box BA22J-AE International power distribution box BA22J-AJ (continued on next page) Error Handling and Analysis 5–1 Table 5–1 (Cont.) Model 810 FRUs FRU Part Number Control and miscellaneous power module (CAMP) 54-21073-01 Options: Ethernet interface module (EIM) 54-21081-01 DSSI extender module 54-21063-01 DSSI interface module (DIM) 54-21065-01 DSSI disk drawer assembly 70-30569-01 Storage: 18.2 Gbyte magazine tape subsystem TF857-AA/AB 2.6 Gbyte cartridge tape drive TF85C-BA 2 Gbyte disk drive RF73-EA 852 Mbyte disk drive RF35-EA 2.6 Gbyte cartridge tabletop tape drive TF85-TA Cable kit for the TF85-TA drive CK-KDXDA-BA 4 Gbyte half-rack storage array with two RF73 drives and one SF73-HK assembly 1.7 Gbyte half-rack storage array with two SF35 drives and one SF35-HK assembly Cables: DIM to storage device with terminator (84 inches) 17-03537-03 DIM to storage device with terminator (62 inches) 17-03537-02 DIM to storage device with terminator (24 inches) 17-03537-01 Fan to fan tray 17-03514-01 Fan tray to FCSB 17-03513-01 FCSB to centerplane 17-03512-01 VT420 to UPS (power cable) 17-00442-17 Zone control panel to centerplane 17-01148-03 DSSI disk drawer to centerplane 17-03805-01 DSSI disk drawer power/signal to centerplane 17-03806-01 5–2 Error Handling and Analysis 5.3 Before You Begin Warning Hazadous voltages exist within the system. Bodily injury or equipment damage can result when service procedures are performed incorrectly. Note FRUs should be handled only by qualified maintenance personnel. You do not need to shut down the entire system to remove and replace a FRU. You can shut down the zone that houses the faulty FRU while the other zone continues to operate. Section 5.3.2 explains how to shut down a zone. There are two types of FRU removal and replacement procedures: • Cold swaps • Warm swaps During a cold swap, you shut down the zone that houses the faulty FRU while the operating system continues to run in the other zone. FRUs that require cold swaps include: Logic modules Fan modules Power supplies DIM modules EIM modules Zone control panel During a warm swap, the power remains on in both zones. The operating system continues to run in both zones while the faulty FRU is replaced. FRUs that allow a warm swap include: RF35 disk drives RF73 disk drives SF35 disk drives SF73 disk drives TF85 tape drives TF857 tape subsystems DSSI disk drawer assemblies Chapter 6 explains how to perform a warm swap procedure. Error Handling and Analysis 5–3 5.3.1 Handling FRUs Static electricity can damage FRUs. When you handle FRUs, follow the rules in Table 5–2. Table 5–2 Handling FRUs Rule Action 1 Wear an electrostatic discharge (ESD) wrist strap. 2 When possible, use a grounded ESD workmat. 3 Attach both the wrist strap and the workmat to the system chassis. 4 Before you remove the FRU from the antistatic box, be sure you ground the box to the system chassis. 5 Wear an ESD wrist strap when you remove the FRU from the antistatic box. 6 Ask the operator or system manager to shut down the zone you will be working in. 5.3.2 Shutting Down a Zone Typically, the shutdown is performed by the operator or the system manager. 1. Enter the SHOW ZONE command to see the status of each zone. • Active — The zone is running. • Stopped — The zone is not running the operating system. It may be running diagnostics or is available for synchronizing. • Absent — The zone is not available. • Synchronizing — The zone is synchronizing with the other zone. • Providing I/O only — The zone has detected a CPU/MEM fault, and has placed the CPU and memory off line. 2. Enter the STOP/ZONE zone-id command. 3. At the zone control panel (A or B), simultaneously press both Logic Power OFF switches to remove logic power from the zone. Note Pressing the Logic Power - OFF switches does not affect the fan or the expansion cabinet power unless the drives (disk or tape) are turned off. If the drives are turned off, the fan will run for about 30 seconds after you press the switches. 5–4 Error Handling and Analysis Example 5–1 How to Shut Down a Zone $ SHOW ZONE Zone A is ACTIVE Zone B is PROVIDING I/O ONLY ! Displays the status of each zone. ! Zone A is running. ! Zone B has a faulty component. $ STOP/ZONE B ! Stops zone B. At the console terminal of the zone that continues to run (in this case, zone A), the OPCOM messages show that zone synchronization has been lost and virtual circuits are closed. 5.3.3 Verifying Zone Shutdown The SHOW ZONE command may be used to verify that the STOP/ZONE zone-id command was successful. Example 5–2 How to Verify Zone Shutdown $ SHOW ZONE Zone A is ACTIVE Zone B is ABSENT ! Displays the status of each zone. ! Zone A is running. ! Zone B has been shut down. 5.3.4 Starting Up a Zone Typically, the startup is performed by the operator or the system manager. 1. At the zone control panel (A or B), press the Logic Power - ON switch. 2. Enter the SHOW ZONE command to verify that the zone is shut down. 3. Enter the START/ZONE command to start up the zone. 5.3.5 Accessing the FRUs Figure 5–1 shows the latches at the front and rear of the system. To open a door, pull the latch. The electrostatic discharge (ESD) kit and module extraction tool are located inside the rear door of the CPU cabinet. Error Handling and Analysis 5–5 Figure 5–1 Latches Latch Location Expander Cabinet CPU Cabinet Expander Cabinet CPU Cabinet s TM X ft tem Sys VA Front View Rear View MR-0457-92DG 5.4 FRU Removal and Replacement The following sections contain FRU removal and replacement procedures. Caution Service procedures may be performed only by qualified personnel. They must be familiar with ESD procedures and power procedures for the Model 810 system. Excessive shock or incorrect handling can damage the logic modules. Note When specific replacement procedures are not given, replace the FRU by reversing the steps in the removal procedure. 5–6 Error Handling and Analysis 5.4.1 CPU and ATM Modules You use the same steps to remove the CPU and ATM modules. Figure 5–2 shows the locations of the modules. Table 5–3 describes the removal procedure. Figure 5–2 CPU Module and ATM Module Locations Captive Screws Module Release Levers ATM Module CPU Module CPU Cabinet MR−0435−92RAGS Table 5–3 CPU Module and ATM Module Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the front door of the cabinet. 3 Loosen the captive screws on the module. The CPU module has four captive screws; the ATM module has two captive screws. 4 Open the module release levers and slide the module out. Error Handling and Analysis 5–7 5.4.2 SIMMs Figure 5–3 shows the locations of the SIMMs. Table 5–4 describes the removal procedure. Note SIMMs are configured on the MMBs in rows, with a pair of SIMMs (two) in each row. You always replace a pair of SIMMs (a two-SIMM row). Figure 5–3 SIMM Locations Retaining Clip SIMMs (Row D) SIMMs (Row C) SIMMs (Row B) SIMMs (Row A) MMB3 MMB0 MMB1 MMB2 CPU Module MR-0453-92DG Table 5–4 SIMM Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the front door of the cabinet. 3 Remove the CPU module using the procedure in Table 5–3. 4 Press the two retaining clips until the SIMM pops up at a 45-degree angle. 5 Remove the pair of SIMMs (a two-SIMM row) from the MMB. 5–8 Error Handling and Analysis 5.4.3 MMBs Figure 5–4 shows the locations of the MMBs. Table 5–5 describes the removal procedure. Figure 5–4 MMB Locations Mounting Bracket Screws Mounting Bracket MMB3 MMB0 Mounting Bracket MMB1 MMB2 CPU Module MR-0414-92DG Table 5–5 MMB Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the front door of the cabinet. 3 Remove the CPU module using the procedure in Table 5–3. 4 The MMBs are tension mounted on the CPU module with two screws. These screws are located on the MMB mounting brackets. Loosen one screw by turning it two or three times. Then loosen the other screw the same way. Alternate between the two screws until the MMB is free from the CPU module. (continued on next page) Error Handling and Analysis 5–9 Table 5–5 (Cont.) MMB Removal Procedure Step Action 5 Remove the three screws that secure each of the mounting brackets on the MMB. 6 Note the configuration of the SIMMs on the MMB. They must be removed from the faulty MMB and installed in the same locations on the replacement MMB. 7 Remove the SIMMs from the MMB using the procedure in Table 5–4. 5.4.4 Fan and FCSB Figure 5–5 shows the location of the fan. Figure 5–6 shows the location of the FCSB. Table 5–6 describes the removal procedure. Figure 5–5 Fan Location Front Captive Screws Fan Handle CPU Cabinet MR−0439−92RAGS 5–10 Error Handling and Analysis Table 5–6 Fan and FCSB Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the rear door of the cabinet. 3 Set the FEU circuit breaker to the off position. 4 Open the front door of the cabinet. 5 Loosen the three captive screws that secure the fan in the CPU cabinet. 6 Grasp the handle and pull the fan out of the cabinet. 7 Locate the FCSB inside the fan assembly. 8 Disconnect the FCSB from the fan tray to FCSB cable. See Figure 5–6. 9 Disconnect the FCSB from the FCSB to centerplane cable. See Figure 5–6. 10 Remove the FCSB from the four mounting standoffs. See Figure 5–6. Figure 5–6 FCSB Location Fan Tray to FCSB Cable FCSB to Centerplane Cable Mounting Standoffs FCSB MR−0437−92RAGS Error Handling and Analysis 5–11 5.4.5 RF35 Disk Drive Removal and Replacement Figure 5–7 shows an RF35 disk drive in the DSSI disk drawer. Table 5–7 describes the RF35 disk drive removal procedure. Figure 5–7 RF35 Disk Drive Location Release Lever Bracket Phillips Screws (6) Captive Screws (4) Release Pin Captive Screws RF35 Disk Drive LDC Bracket Release Pin MR-0025-93DG 5–12 Error Handling and Analysis Table 5–7 RF35 Disk Drive Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the front door of the cabinet. 3 Turn off the RF35 disk drive. 4 Loosen the four screws that secure the DSSI disk drive rack in the CPU cabinet. 5 Pull the DSSI disk drive rack out until it locks in place. 6 Swing the LDC bracket out until you can see the disk drives. See Figure 5–7. 7 Label the DSSI, power, and disk signal cables, and disconnect them from the RF35 drive you are removing. 8 Loosen the captive screws at the bottom of the drive. 9 Remove the drive and bracket. 10 Remove the six Phillips screws that secure the bracket on the drive. Error Handling and Analysis 5–13 5.4.6 DSSI Disk Drawer Figure 5–7 shows the components in the DSSI disk drawer. Table 5–8 describes the DSSI disk drawer removal procedure. Table 5–8 DSSI Disk Drawer Removal Procedure Step Action 1 Ask the operator or system manager to dismount the drive. 2 Open the rear door of the cabinet. 3 Set the FEU circuit breaker to the off position. 4 Open the front door of the cabinet. 5 Turn off all the RF35 disk drives. 6 Loosen the four screws that secure the DSSI disk drive rack in the CPU cabinet. 7 Pull the DSSI disk drive rack out until it locks in place. 8 Swing the LDC bracket out until you can see the disk drives. See Figure 5–7. 9 Label each of the RF35 disk drives.1 10 Label the DSSI, power, and disk signal cables, and disconnect them from each of the RF35 drives. 11 Loosen the captive screws at the bottom of each of the drives. 12 Remove all the drives from the DSSI disk drawer. 13 At the rear of the DSSI disk drawer, label the two DSSI cables and the power cable. Then disconnect them. 14 Press the release lever on the left side of the DSSI disk drawer and slide the drawer out of the cabinet. 1 Label each drive before you remove it. The RF35 disk drives must be removed from the DSSI disk drawer and installed in the same locations in the replacement DSSI disk drawer. 5.4.7 Zone Control Panel Figure 5–8 shows the zone control panel. Table 5–9 describes the removal procedure. 5–14 Error Handling and Analysis Figure 5–8 Zone Control Panel Captive Screws Zone Control Panel Bracket Signal Cable 34 Controller Module Handle Phillips Screws (6) Captive Screws MR−0023−93RAGS Table 5–9 Zone Control Panel Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the front door of the cabinet. 3 Loosen the four captive screws that secure the zone control panel on the cabinet. 4 Grasp the handle and pull the zone control panel out until you can access the controller module signal cable. 5 Disconnect the signal cable from the controller module. 6 Remove the six Phillips screws that secure the controller module on the zone control panel bracket. Error Handling and Analysis 5–15 5.4.8 FEU, 3.3V Regulator, 5V Regulator, PSC Modules You use the same steps to remove these four FRUs. Figure 5–9 shows the locations of the modules. Table 5–10 describes the removal procedure. Figure 5–9 FEU, 3.3V Regulator, 5V Regulator, and PSC Locations +3.3V Regulator +5V Regulator PSC Rear Circuit Breaker Release Handle FEU CPU Cabinet MR−0443−92RAGS 5–16 Error Handling and Analysis Caution Removing/replacing these four modules without shutting down 48V_DRCT may cause damage to the power components. Table 5–10 FEU, 3.3V Regulator, 5V Regulator, and PSC Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the rear door of the cabinet. 3 Set the FEU circuit breaker to the off position. 4 If you are removing the FEU, disconnect the ac power cable from the FEU. 5 Loosen the screws that secure the module in the cabinet. The FEU is secured with four screws. The 3.3V regulator, 5V regulator, and PSC are secured with two screws. 6 Grasp the module release handles and pull the power module out of the cabinet. Error Handling and Analysis 5–17 5.4.9 Cross-Link Assembly Figure 5–10 shows the location of the cross-link assembly. Table 5–11 describes the removal procedure. Figure 5–11 shows you how to use the module extraction tool. Figure 5–10 Cross-Link Assembly Rear Upper Retaining Bar Crosslink Module Middle Retaining Bar Crosslink Cable Upper Retaining Bar Middle Retaining Bar Crosslink Module CPU Cabinet MR−0447−92RAGS Note The cross-link assembly consists of two cross-link modules (one per zone) and one cross-link cable. These three parts are considered to be one FRU. 5–18 Error Handling and Analysis Table 5–11 Cross-Link Assembly Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the rear door of the cabinet. 3 Remove the four screws from the upper retaining bar. 4 Remove the four screws from the middle retaining bar. 5 Insert the module extraction tool into the hole in the cross-link module. Turn the module extraction tool to the right until it is fastened to the module. See Figure 5–11. 6 Pull the cross-link module out of the cabinet. 7 Repeat steps 3 through 6 for the other zone. Figure 5–11 Module Extraction Tool Module Extraction Tool Tighten Loosen Pull to Remove MR−0024−93RAGS Error Handling and Analysis 5–19 5.4.10 Console Extender Module Figure 5–12 shows the location of the console extender module. Figure 5–13 shows the layout of the console extender module. Table 5–12 describes the removal procedure. Figure 5–12 Console Extender Module Location Rear Upper Retaining Bar Console Extender Module Middle Retaining Bar CPU Cabinet MR−0036−93RAGS 5–20 Error Handling and Analysis Table 5–12 Console Extender Module Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the rear door of the cabinet. 3 Remove the four screws from the upper retaining bar. 4 Remove the four screws from the middle retaining bar. 5 Turn off any devices connected to the console extender module. 6 Label any cables connected to the console extender module. Then disconnect them. See Figure 5–13. 7 Insert the module extraction tool into the hole in the console extender module. Turn the tool to the right until it is fastened to the module. See Figure 5–11. 8 Pull the console extender module out of the cabinet. Figure 5–13 Console Extender Module Layout Local Remote LU OP CS A L RM EO MD O E TM E A L A R M UPS Modem Alarm MR−0456−92RAGS Error Handling and Analysis 5–21 5.4.11 DSSI Extender Module Figure 5–14 shows the locations of the DSSI extender modules. Table 5–13 describes the removal procedure. Figure 5–14 DSSI Extender Module Locations Rear Upper Retaining Bar DSSI Extender Modules DIMs Middle Retaining Bar DSSI Cables DSSI Extender Modules DIMs CPU Cabinet MR−0032−93RAGS 5–22 Error Handling and Analysis Table 5–13 DSSI Extender Module Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the rear door of the cabinet. 3 Remove the four screws from the upper retaining bar. 4 Remove the four screws from the middle retaining bar. 5 Turn off all the devices connected to the console extender module. 6 Label the two DSSI cables and disconnect them from the module. See Figure 5–14. 7 Insert the module extraction tool into the hole in the DSSI extender module. Turn the tool to the right until it is fastened to the module. See Figure 5–11. 8 Pull the DSSI extender module out of the cabinet. Error Handling and Analysis 5–23 5.4.12 CAMP Module Figure 5–15 shows the locations of the CAMP modules. Table 5–14 describes the removal procedure. Caution Removing/replacing the CAMP module without shutting down 48V_DRCT may cause damage to the CAMP module. Figure 5–15 CAMP Module Locations Rear CAMP Module CPU Cabinet MR−0475−92RAGS 5–24 Error Handling and Analysis Table 5–14 CAMP Module Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the rear door of the cabinet. 3 Set the FEU circuit breaker to the off position. 4 Remove the four screws from the upper retaining bar. 5 Remove the four screws from the middle retaining bar. 6 Turn off all the devices connected to the CAMP module. 7 Insert the module extraction tool into the hole in the CAMP module. Turn the tool to the right until it is fastened to the module. See Figure 5–11. 8 Pull the CAMP module out of the cabinet. Error Handling and Analysis 5–25 5.4.13 DSSI Interface Module (DIM) Figure 5–16 shows the location of the interface logic modules. Figure 5–17 shows how to remove the DIMs. Table 5–15 describes the removal procedure. Figure 5–16 DIM Location Rear Middle Retaining Bar Interface Logic Modules (DIMs and EIMs) Lower Retaining Bar CPU Cabinet MR−0433−92RAGS 5–26 Error Handling and Analysis Figure 5–17 DIM Removal Rear Connector DSSI Cable CPU Cabinet Expansion Cabinet MR−0046−93RAGS Table 5–15 DIM Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the rear door of the cabinet. 3 Remove the four screws from the middle retaining bar. 4 Remove the four screws from the lower retaining bar. 5 Turn off all the devices connected to the DIM you are removing. 6 Disconnect the DSSI cable from the DIM by loosening the two thumb screws. See Figure 5–17. 7 Insert the module extraction tool into the hole in the DIM. Turn the tool to the right until it is fastened to the module. See Figure 5–11. 8 Pull the DIM out of the cabinet. Error Handling and Analysis 5–27 5.4.14 Ethernet Interface Module (EIM) Figure 5–16 shows the location of the interface logic modules. Figure 5–18 shows how to remove the EIMs. Table 5–16 describes the removal procedure. Figure 5–18 EIM Removal Rear Ethernet Switch Ethernet Cable Connector Ethernet Cable Terminator CPU Cabinet Expansion Cabinet MR−0455−92RAGS 5–28 Error Handling and Analysis Table 5–16 EIM Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the rear door of the cabinet. 3 Remove the four screws from the middle retaining bar. 4 Remove the four screws from the lower retaining bar. 5 Turn off all the devices connected to the EIM you are removing. 6 Disconnect the Ethernet cable from the EIM. See Figure 5–18. 7 Disconnect the terminator from the EIM, if one is present. See Figure 5–18. 8 Insert the module extraction tool into the hole in the EIM. Turn the tool to the right until it is fastened to the module. See Figure 5–11. 9 Pull the EIM out of the cabinet. 5.4.15 DSSI Cable Removal and Replacement Table 5–17 describes the removal procedure. Table 5–17 DSSI Cable Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Open the rear door of the cabinet. 3 Turn off all the devices connected to the DSSI cable you are removing. 4 Disconnect one end of the DSSI cable from the device by loosening the two screws on the DSSI connector. 5 Route the DSSI cable through the access hole between the system cabinets. 6 Disconnect the other end of the DSSI cable from the DIM by loosening the two screws on the DSSI connector. Error Handling and Analysis 5–29 5.4.16 TF85C-BA Tape Drive Figure 5–19 and Figure 5–20 show how to remove an TF85C-BA tape drive from the system. Table 5–18 describes the removal procedure. Warning Two people are required to lift and carry the TF85C-BA tape drive enclosure. Figure 5–19 TF85C-BA Tape Drive, Rear View DSSI Connectors 230 115 Power Supply Fault Indicator (Behind Panel) FAULT Line Voltage Selector Switch (Behind Panel) MR-0454-92DG 5–30 Error Handling and Analysis Figure 5–20 TF85C-BA Tape Drive Removal Tape Drive Enclosure Release Tab Front Plate Screws (4) Screws (3) TF85 Tape Drive Front Plate MR-0038-93DG Table 5–18 TF85C-BA Tape Drive Removal Procedure Step Action 1 Ask the operator or system manager to dismount the tape. 2 Ask the operator or system manager to dismount the tape drive. 3 Unload the tape magazine, if one is present. 4 At the front of the drive, set the power switch to off (0). All the indicators should be off. 5 Disconnect the power cable from the rear of the drive. See Figure 5–19. 6 Disconnect the two DSSI cables from the rear of the drive. See Figure 5–19. 7 At the front of the drive, remove the three screws that secure the tape drive enclosure in the cabinet. See Figure 5–20. 8 Slide the tape drive enclosure out of the expansion cabinet. 9 Remove the four screws that secure the front plate on the tape drive enclosure. 10 Push the release tab down and pull the drive straight out of the slot. Error Handling and Analysis 5–31 5.4.17 SF73 Disk Drive Figure 5–21 and Figure 5–22 show how to remove the SF73 disk drives from the system. Figure 5–23 shows how to remove an SF73 disk drive enclosure from the system. Figure 5–24 shows how to remove an SF73 disk ISE from a drive. Table 5–19 describes the removal procedure. Warning Two people are required to lift and carry the SF73 disk drive enclosure. Figure 5–21 SF73 Disk Drive, Rear View DSSI Connectors 1 0 AC Power Switch Power Supply Fault Indicator (Behind Panel) 230 115 FAULT Line Voltage Selector Switch (Behind Panel) MR-0422-92DG 5–32 Error Handling and Analysis Figure 5–22 SF73 Disk Drive, Front View digi tal Write Ready Protect Fault DSSI ID DSSI ID Write Ready Protect Fault Captive Screws Front Cover Door Captive Screws MR-0035-93DG Table 5–19 SF73 Disk Drive Enclosure Removal Procedure Step Action 1 Ask the operator or system manager to dismount the drive. 2 Turn off the disk drive enclosure. 3 Disconnect the power cable from the rear of the drive. See Figure 3–9. 4 Disconnect the two DSSI cables from the rear of the drive. See Figure 3–9. 5 Remove the mounting screws from the retainers that secure the drive enclosure in the cabinet. See Figure 5–23. 6 Slide the disk drive enclosure out of the expansion cabinet. 7 Remove the retainer screws that secure the retainers on the disk drive enclosure. See Figure 5–23. 8 Loosen the captive screws that secure the front cover on the disk drive enclosure. See Figure 5–22. 9 Disconnect all cables from the disk ISE. Slide the disk ISE out of the disk drive enclosure. See Figure 5–24. Error Handling and Analysis 5–33 Figure 5–23 SF73 Disk Drive Enclosure Removal Retainer Screws Chassis Retainer Mounting Screws Retainer Screws Retainer MR-0484-92DG 5–34 Error Handling and Analysis Figure 5–24 SF73 Disk ISE Removal NOTE TO ILLUSTRATOR: front panel for this hardware is SHR_X1127_89 ISOL and reduced 17/64 (.265625) SI D SID Re ad e F r it W te c t ro y P au lt DSSI Cable SI D SID di Re gi ad ta y l e F r it W te c t o Pr au lt 10-Pin OCP Cable NOTE TO ILLUSTRATOR: This was created by rotating SHR_x1074A_89_SCN RW,Z120 SHR-X0135-90 THIS REPRESENTS 6-Pin Power Cable A RF72 SHR-X0128-90-SCN Skid Plate Guide Disk ISE MR-0034-93DG Error Handling and Analysis 5–35 5.4.18 SF35 Storage Array Figure 5–23 shows how to remove an SF35 storage array from the system. Figure 3–7 and Figure 5–26 show the rear and front views of the SF35 storage array. Figure 5–27 shows how to remove an SF35 disk ISE from the storage array. Table 5–20 describes the removal procedure. Warning Two people are required to lift and carry the SF35 storage array. Figure 5–25 SF35 Storage Array, Rear View DSSI Connectors A B C D E F digi tal 1 0 AC Power Switch Power Supply Fault Indicator (Behind Panel) 230 115 FAULT Line Voltage Selector Switch (Behind Panel) MR-0421-92DG 5–36 Error Handling and Analysis Figure 5–26 SF35 Storage Array, Front View Operator Control Panel (OCP) Front A B Reeaarr R C D E F A B C D E F Ready Write Protect Fault A A B C D E F B C Front D E F A B C Rear D E F Drive DC Power Switches F E Re ar D C B A y ad Re e rit W ec t ot Pr ult Fa F E Fr t on D C B B A D A F C F E Re arD C E B A F E Fr C t on D B A MR-0470-92DG Error Handling and Analysis 5–37 Figure 5–27 SF35 Disk ISE Removal A B C Fro nt D E F Re A ad y W r Pr ite ot ec t Fa ul t A B C Re ar D E F C B E Carrier Lever A B C Fro nt D F E F A B C Re ar D E F D Screw Carrier Lever MR-0033-93DG Table 5–20 SF35 Storage Array Removal Procedure Step Action 1 Ask the operator or system manager to dismount the disk. 2 Turn off the storage array. 3 Disconnect the power cable from the rear of the storage array. See Figure 3–7. 4 Disconnect the two DSSI cables from the rear of the storage array. See Figure 3–7. 5 Remove the mounting screws from the retainers that secure the storage array in the cabinet. See Figure 5–23. 6 Slide the disk drive enclosure out of the expansion cabinet. 7 Remove the retainer screws that secure the retainers on the storage array. See Figure 5–23. 8 Remove the screw from the carrier lever. See Figure 5–27. 9 Pull the carrier lever forward and slide the disk ISE out of the slot. See Figure 5–27. 5–38 Error Handling and Analysis 5.4.19 TF857-CA Tape Drive Figure 5–28 shows how to remove the TF857-CA tape drive from the system. Table 5–21 describes the removal procedure. Warning Two people are required to lift and carry the TF857-CA tape drive enclosure. Figure 5–28 TF857-CA Tape Drive, Rear View DSSI Cable Cable Clip Tiewraps Power Cable Push Cable Tie MR-0420-92DG Error Handling and Analysis 5–39 Table 5–21 TF857-CA Tape Drive Removal Procedure Step Action 1 Ask the operator or system manager to shut down the zone using the procedure in Section 5.3.2. 2 Ask the operator or system manager to dismount the tape drive. 3 Unload the tape magazine, if one is present. 4 At the front of the drive, set the power switch to off (0). All the indicators should be off. 5 Disconnect the power cable from the rear of the drive. See Figure 5–28. 6 Disconnect the two DSSI cables from the rear of the drive. See Figure 5–28. 7 Remove the mounting screws from the retainers that secure the drive enclosure in the cabinet. See Figure 5–23. 8 Slide the tape drive enclosure out of the expansion cabinet. 9 Loosen the shipping restraint screw until the shipping bracket drops. See Figure 5–29. If the shipping bracket does not drop when you loosen the shipping restraint screw, push the shipping bracket down with a screwdriver. 10 Slide the tape drive enclosure out of the expansion cabinet. Figure 5–29 Loosening the Shipping Restraint Screw Shipping Bracket Shipping Restraint Screw MR-0466-92DG 5–40 Error Handling and Analysis Note If you are replacing the TF857 tape loader, you must set the node ID. Refer to Figure 5–30 for the node ID DIP switch location. Figure 5–30 Setting the TF857 Tape Loader Node ID Node ID DIP Switch 4 3 2 1 Drive Enclosure Controller Module 1 2 3 4 TF857 Tape Drive Assembly Ej ec Lo Sl t ad ot /U nl Se oa le d ct 0 W Lo rit e ad o Pr Fa ul te ct 1 t 2 4 5 MR-0467-92DG Error Handling and Analysis 5–41 5.4.20 Power Distribution Box Figure 5–31 shows a domestic power distribution box. Figure 5–32 shows an international power distribution box. Table 5–22 describes the removal procedure. Figure 5–31 Domestic Power Distribution Box AC Power Outlets (8) Hex Screws I CB Circuit Breaker DEC Power Bus Switch AC Power Cable Access Hole Hex Screws MR-0044-93DG 5–42 Error Handling and Analysis Figure 5–32 International Power Distribution Box AC Power Outlets (6) Hex Screws AC Power Connector Circuit Breaker DEC Power Bus Switch Access Hole Hex Screws MR-0045-93DG Table 5–22 Power Distribution Box Removal Procedure Step Action 1 Turn off any devices connected to the power distribution box. 2 Set the circuit breaker to the off position. See Figure 5–31 or Figure 5–32. 3 Set the DEC power bus switch to the local position. See Figure 5–31 or Figure 5–32. 4 If you are removing a domestic power distribution box, disconnect the ac power cable from facility power. See Figure 5–31. If you are removing an international power distribution box, disconnect the ac power cable from the ac power connector and from facility power. See Figure 5–32. 5 Disconnect any ac power cables connected to the ac power outlets and route the cables through the access hole. See Figure 5–31 or Figure 5–32. 6 Remove the four hex screws that secure the power distribution box in the cabinet. See Figure 5–31 or Figure 5–32. 7 Remove the power distribution box from the cabinet. Error Handling and Analysis 5–43 6 Managing Integrated Storage Elements 6.1 In This Chapter This chapter includes: • Loading the DUP driver • Using VMS DUP • Using the server setup switch • Assigning DSSI unit numbers • Warm swapping an ISE 6.2 Loading the DUP Driver If the VMS diagnostic utility protocol (DUP) class driver is not loaded, load it as follows: $ MCR SYSGEN Return SYSGEN> CONNECT FYA0/NOADAPTER SYSGEN> EXIT Return Return 6.3 Using VMS DUP Use the VMS DUP to change configuration data on mass storage devices. With DUP, you can connect the terminal to a storage controller with the following DCL command: SET HOST/DUP/SERVER=MSCP$DUP/TASK=taskname nodename where: taskname – is the utility or diagnostic program name to be executed on the target storage system nodename – is the node name of the ISE You can use SET HOST/DUP to create a virtual terminal connection to the MSCP$DUP server and to execute a utility or diagnostic program on the MSCP storage controller that uses the DUP standard dialogue. Once the connection is established, operations are under the control of the utility or diagnostic program. When the utility or program ends, control returns to the local system. PARAMS is the DUP management utility to examine and change ISE parameters such as node name, allocation class, and unit number. PARAMS is also used to display the state of the ISE and performance statistics maintained by the ISE. PARAMS prompts for a command with the PARAMS> prompt. Once you enter a command, PARAMS executes it, and prompts you for another command. Managing Integrated Storage Elements 6–1 To stop the PARAMS utility, press Ctrl/C , Ctrl/Y , Ctrl/Z , or type EXIT at the PARAMS prompt. Table 6–1 lists PARAMS commands. Table 6–1 PARAMS Commands Command Description EXIT Stops the PARAMS utility HELP Displays information on how to use PARAMS commands SET Changes internal ISE parameters SHOW Displays the setting of a parameter or a class of parameters WRITE Records in nonvolatile RAM the device parameter changes you made with SET Additional information is available on ISE tasks and commands in the RF/TF-series installation guides. 6.4 Using the Server Setup Switch The server setup (SU) switch facilitates the installation of a new or incorrectly initialized ISE on a running system. Use SET HOST and configure parameters for the ISE with DUP, before VMS recognizes the ISE as an available resource. Table 6–2 explains how to disable RF-series and SF35, SF73, and SF72 disks. Table 6–2 Switches For Disabling the MSCP Disks To Disable More information in RF-Series Press the SU switch to disable the MSCP/TMSCP server within the ISE VAXft Systems Owner’s Manual SF72 or SF 73 Set the drive positions DSSI ID number and the left-most MSCP to disable the ISE. The icon on the front of the door indicates the location of the drive. VAXft Systems Operating Information SF35 Press the MSCP switch to disable the ISE. The MSCP switch is located on the Operator Control Panel. VAXft Systems Operating Information 6.5 Assigning DSSI Unit Numbers By default, the disk drive forces the unit number to the same value as the DSSI node address for the drive. Since the drives in zone A and zone B initially have the same DSSI unit number, reassign unit numbers to remove configuration conflicts and improve system management. All unit numbers must be unique within an allocation class. Change the UNITNUM and FORCEUNI ISE parameters (see Table 6–3) to override the default values that assign the unit the same value as its node address. Reassign unit numbers so that they have values greater than 99. For example, Figure 6–1 and Figure 6–2 use a 100-, 200-, 300-, 400-, 500-, and 600- numbering scheme for SF35s and SF73s. 6–2 Managing Integrated Storage Elements Figure 6–1 VAXft Model 810 Front View Front 700 700 800 500 600 300 400 100 200 701 SF73 B 101 D 103 F 105 A 100 C 102 E 104 SF35 Expansion Cabinet CPU Cabinet MR−0050−93RAGS 6.6 Warm Swapping an ISE Warm swapping is the procedure by which an ISE can be replaced or added to a running system without interrupting system operations. Caution The procedure must be followed carefully. If a parameter is not entered correctly, then a system reboot is necessary or the ISE (and possibly the system) is rendered unusable. The VMS operating system recognizes an ISE by its unique values for the NODENAME and SYSTEMID parameters. If only one of these parameters is changed, VMS inhibits connections to the old and new parameters for the ISE. Variations of this procedure depend on the purpose for the warm swap. An ISE can be warm swapped for the following reasons: • Removal and replacement for storage Managing Integrated Storage Elements 6–3 Figure 6–2 VAXft Model 810 Rear View Rear 800 700 600 500 703 400 300 200 100 702 SF73 B 107 D 109 F 111 A 106 C 108 E 110 CPU Cabinet Expansion Cabinet SF35 MR−0051−93RAGS • Replacement in a system that is running • Installation in a system that is running When replacing an ISE or installing a new ISE, determine the parameter values for the ISE before performing the warm swap procedure. Assign values for each of the ISE parameters described in Table 6–3. 6–4 Managing Integrated Storage Elements Table 6–3 ISE Parameters Parameter Description 1 ALLCLASS Allocation class. The default value is 0. Set the ALLCLASS value to the allocation class chosen for the system. Note that shadowed disk devices must be set to a nonzero allocation class. FORCENAM Force name parameter. Determines if the ISE is to use the NODENAME parameter value instead of the manufacturing name given to the ISE. The value must be 0. If the value is 1, the ISE uses a generic device name such as RF31x. FORCEUNI Force unit parameter. To use UNITNUM as the device unit number, set the FORCEUNI parameter to 0. The factory default value of 1 uses the DSSI node address (hardwired on the backplane) as the unit number. NODENAME Node name for an ISE. Each ISE has a node name that is stored in EEPROM. The node name is determined in the manufacturing process and is unique to each ISE. The node name can be changed depending on the needs of the site. SYSTEMID System identification number. All SYSTEMIDs must be unique within the system. Do not change this parameter when introducing a new ISE to the system. UNITNUM Unit number. Specifies a numeric value for the device name. Use a unit number that is unique within the allocation class to which you are configuring the unit. Follow the unit numbering scheme described in Section 6.5 or use one that meets the requirements. 1 RF-series devices only More information is available on ISE parameters in the RF/TF-series installation guides. 6.6.1 Setting ISE Parameters Digital Equipment Corporation recommends maintaining a worksheet of the parameters for all ISEs, as well as the serial number of each ISE. This is especially important at sites that maintain a set of spare drives that may be stored for some time before they are used. The worksheet aids in: • Preventing duplicate parameters, which render an ISE unusable until the duplication is isolated and corrected • Finding the parameter settings of a non-operational ISE to create a replacement unit with identical parameters Use the ISE parameter worksheets in Appendix B to identify and record critical parameter names and values. When installing a new ISE, select parameter values that meet the site ISE configuration or guidelines. Then continue with Section 6.6.4. When replacing an ISE, make sure the parameters selected are not being used for another ISE in the configuration. If the parameter values were not recorded, perform the following steps to extract the information required from your system: 1. Enter SHOW DEVICE DI to display the following information: • Device name The device names in the sample output below are $1$DIA22 and $1$DIA21. • NODENAME Managing Integrated Storage Elements 6–5 The node name is shown in parentheses. In the following sample output, the node names are RIRRBA and RICYAA. • ALLCLASS The allocation class is found in the device name between the dollar signs ($). In $1$DIA21, the ISE has an allocation class of 1. If the allocation class was 0, the node name would display as RICYAA$DIA21. • UNITNUM The unit number is the number following the DIA. In $1$DIA21, the UNITNUM is 21. It is the MSCP unit number. • FORCENAM The force unit name is set to 0 if NODENAME is anything other than an RF31x. The x corresponds to a DSSI node ID (A = 0, B = 1, and so on). • FORCEUNI The force unit parameter is not shown in the sample, but it should be 0 if the configuration rules given in the VAXft Systems Configuration Guide were followed. 2. Determine whether the VMS DUP class driver is loaded by entering the following DCL command: $ SHOW DEVICE FYA0 Return If the driver is not loaded, load it as follows: $ MCR SYSGEN Return SYSGEN> CONNECT FYA0/NOADAPTER SYSGEN> EXIT Return Return 3. Enter SET HOST/DUP to establish a DUP connection with the ISE as follows: $ SET HOST/DUP/SERVER=MSCP$DUP/TASK=PARAMS nodename This invokes DUP on the ISE and runs the PARAMS utility. If a connection can not be established with the ISE DUP, use ANALYZE/SYSTEM to find information on some of the parameters. In the following sample output, the SYSTEMID is 94100302 and the ALLCLASS is 1. $ ANALYZE/SYSTEM Return VMS System Analyzer SDA> SHOW DEVICE $1$DIA21 I/O data structures ------------------$1$DIA21 RF31 UCB address: 802D65D0 Device status: 00021810 online,valid,unload,lcl_valid Characteristics: 1C4D4108 dir,rct,fod,shr,avl,mnt,elg,idv,odv,rnd 000022A1 clu,mscp,srv,nnm,loc 6–6 Managing Integrated Storage Elements Owner UIC [000010,000001] PID 00000000 Alloc. lock ID 00B000E5 Alloc. class 1 Class/Type 01/38 Def buf. size 512 DEVDEPEND 00000000 DEVDEPND2 00000000 FLCK index 34 DLCK address 00000000 Operation count 1116 Error count 0 Reference count 1 Online count 2 BOFF 0000 Byte count 0000 SVAPTE 00000000 DEVSTS 0004 RWAITCNT 0000 ORB address 802D6700 DDB address 804DA680 DDT address 80308BD8 VCB address 802E2750 CRB address 8048C250 PDT address 802A5F80 CDDB address 802D6410 I/O wait queue empty Press RETURN for more. SDA> Return I/O data structures --------------------- Primary Class Driver Data Block (CDDB) 802D6410 --Status: 0040 alcls_set Controller Flags 80D4 icf_mlths,cf_this,cf_misc,cf_attn,cf_replc Allocation class 1 System ID 94100302 4041 Contrl. ID 94100302 01644041 Response ID 00000000 MSCP Cmd status FFFFFFFF CDRP Queue Restart Queue DAP count Contr. timeout Reinit Count Wait UCB Count empty empty 3 60 0 0 DDB address 804DA860 CRB address 8048C250 CDDB link 80344C30 PDT address 802A5F80 Original OCB 00000000 UCB chain 802D65D0 *** I/O request queue is empty *** Press RETURN for more. SDA> EXIT Return $ $ SHOW DEVICE DI Device Name $1$DIA22 $1$DIA21 Return (RIRRBA) (RICYAA) Device Status Mounted Online Error Count 0 5 Volume Free Trans Mnt Label Blocks Count Cnt DISK22 744282 1 1 6.6.2 ISE Removal When you replace an ISE, initialize the new ISE with the same parameters as the ISE being replaced. Refer to the worksheet maintained for that ISE. (See Section 6.6.1.) You can turn off power and replace an ISE in a running system without interrupting system services or users. When the ISE is replaced, the new ISE must be correctly initialized to: • Supersede pre-set manufacturing values • Store the modified values in EEPROM To replace an ISE in a system that is running, perform the following steps: Managing Integrated Storage Elements 6–7 Caution You must use an ESD wrist strap, ground clip, and grounded ESD workmat whenever you handle ISEs. Use the static protective service kit (PN 29-262446). Use great care when you handle an ISE; excessive shock can damage the head-disk-assembly (HDA). 1. If the ISE is mounted, logically dismount it from the system. 2. Make the device unavailable to the system by entering the following DCL command: $ SET DEVICE/NOAVAILABLE devicename Return 3. Verify that the device has been marked as unavailable by entering the following DCL command: $ SHOW DEVICE $1$DIA21 Return Device Name $1$DIA21 Device Error Status Count Unavailable 5 (RICYAA) Volume Free Trans Mnt Label Blocks Count Cnt 4. Set the ISE power switch to off (0). Wait 45 seconds for drive to stop spinning (and for RF-disks, the interlock solenoid to release). 5. Remove the ISE from the slot. Follow the steps in the device owner’s manual, and observe all FRU handling procedures. 6.6.3 ISE Replacement When you replace an ISE in a system that is running, use the following steps to restore the parameters from the ISE being replaced. When you install a new ISE in a system that is running, use the steps described in Section 6.6.4. Caution You must use an ESD wrist strap, ground clip, and grounded ESD workmat whenever you handle ISEs. Use the static protective service kit (PN 29-262446). Use great care when handling an ISE. Excessive shock can damage the HDA. 6–8 Managing Integrated Storage Elements 1. Disable the MSCP server as described in Table 6–4. Table 6–4 Disabling the MSCP Disks Action RF-series Press and hold the SU switch/button SF72 or SF72series Set the MSCP enable switch SF35 Press the MSCP/Fault switch (LED is green when enabled) 2. Set the ISE power switch to on (1). Wait for the drive to start spinning (and, on RF-series disks, the interlock solenoid to lock). 3. If you have an RF-series disk, release the server setup switch. If you have an SF-series disk, continue with Step 4. 4. Verify that the device has been marked as available by entering the following DCL command: $ SHOW DEVICE devicename Return 5. Find the NODENAME parameter for the replacement ISE by entering SHOW CLUSTER. (SHOW DEVICE will not work at this time.) In the sample output below, R1QSAA is the replacement ISE. $ SHOW CLUSTER Return View of Cluster from system ID 63973 node CLOUDS +-----------------------------+ | SYSTEMS | MEMBERS | +-----------------------------+ | NODE | SOFTWARE | STATUS | +-----------------------------+ | CLOUDS | VMS V5.4 | MEMBER | | RICYAA | RFX V2001| | | RIRRBA | RFX V200 | | | R1QSAA | RFX V200 | | +-----------------------------+ 6. Determine whether the VMS DUP class driver is loaded by entering the following DCL command: $ SHOW DEVICE FYA0 Return If the driver is not loaded, load it by entering the following: $ MCR SYSGEN Return SYSGEN> CONNECT FYA0/NOADAPTER SYSGEN> EXIT Return Return 7. Enter SET HOST/DUP to establish a DUP connection with the ISE as follows: $ SET HOST/DUP/SERVER=MSCP$DUP/TASK=PARAMS nodename This invokes DUP on the ISE and runs the PARAMS utility. 8. Refer to the parameters listed in Table 6–3, and enter the SET command to set appropriate values for the parameters. Be sure to record the new parameters on the worksheet for the ISE. 1 Firmware version number Managing Integrated Storage Elements 6–9 For example: $ SET HOST/DUP/SERVER=MSCP$DUP/TASK=PARAMS R1QSAA Return %HSCPAD-I-LOCPROGEXE, Local program executing - type ^\ to exit Copyright (C) 1993 Digital Equipment Corporation PARAMS> SHOW NODENAME Return Parameter Current Default Type Radix ---------- ------------- -------------- ---------- --------NODENAME R1QSAA RF31 String Ascii PARAMS> SET NODENAME RICYAA Return PARAMS> SHOW SYSTEMID Return Parameter Current Default Type Radix ---------- ------------- -------------- ---------- --------SYSTEMID 593200495860 0000000000000 Quadword Hex B PARAMS> SET SYSTEMID 0404194100302 Return PARAMS> SHOW ALLCLASS Return Parameter Current Default Type Radix ---------- ------------- -------------- ---------- --------ALLCLASS 0 0 Byte Dec B PARAMS> SET ALLCLASS 1 Return PARAMS> SHOW FORCENAM Return Parameter Current Default Type Radix ---------- ------------- -------------- ---------- --------FORCENAM 0 0 Boolean 0/1 B PARAMS> SHOW UNITNUM Return Parameter Current Default Type Radix ---------- ------------- -------------- ---------- --------UNITNUM 0 0 Word Dec U PARAMS> SET UNITNUM 21 Return PARAMS> SHOW FORCEUNI Return Parameter Current Default Type Radix ---------- ------------- -------------- ---------- --------FORCEUNI 1 1 Boolean 0/1 U PARAMS> SET FORCEUNI 0 PARAMS> WRITE Return Return Changes require controller initialization, ok? [Y/(N)] Y Initializing... HSCPAD-S-REMPGMEND, Remote program terminated - message number 3 %HSCPAD-S-END, Control returned to CLOUDS $ 9. Make the device available to the system by entering the following DCL command: $ SET DEVICE/AVAILABLE devicename Return 10. Mount the ISE in the system and restore the shadow sets. 11. On SF-series drives, enable the MSCP switch. When initialization is complete, the replacement ISE and its parameters are made available to the VMS operating system. 6–10 Managing Integrated Storage Elements Note The SHOW CLUSTER command continues to show the name of the ISE replaced. This does not harm the system. After the next reboot, the replacement ISE name appears. Note also that the following message is displayed if another node is already assigned the same SYSTEMID and NODENAME: %PWA0-REMOTE SYSTEM CONFLICTS WITH KNOWN SYSTEM In this case, shut down the new node and issue a unique SYSTEMID and NODENAME for the new node. 6.6.4 Installing an ISE in a Running System When you install a new ISE in a system that is running, perform the following steps to initialize the new ISE parameters: 1. Disable the MSCP server as described in Table 6–5. Table 6–5 Disabling the MSCP Disks Action RF-series Press and hold the SU switch/button SF 72 or SF73 Set the MSCP enable switch SF35 Press the MSCP/Fault switch (LED is green when enabled) 2. Set the ISE power switch to on (1). Wait for the drive to start spinning (and on RF-series disks, the interlock solenoid to lock. 3. If you have an RF-series disk, release the server setup switch. If you have an SF disk, continue with Step 4. 4. Refer to Table 6–3 and Section 6.6.1, and select values for the following parameters: • ALLCLASS • FORCENAM • FORCEUNI • NODENAME • UNITNUM 5. Determine whether the VMS DUP class driver is loaded by entering the following DCL command: $ SHOW DEVICE FYA0 Return If the driver is not loaded, load it by entering the following: $ MCR SYSGEN Return SYSGEN> CONNECT FY0/NOADAPTER SYSGEN> EXIT Return Return 6. Enter SET HOST/DUP to establish a DUP connection with the ISE as follows: $ SET HOST/DUP/SERVER=MSCP$DUP/TASK=PARAMS nodename Managing Integrated Storage Elements 6–11 This invokes DUP on the ISE and runs the PARAMS utility. 7. Use SET to assign appropriate values for the parameters. Be sure to record the new parameters on the worksheet for the ISE. In the following sample output, the new ISE is configured to be device $1$DIA22. The device is initialized with these parameters: • ALLCLASS — 1 • FORCENAM — 0 • FORCEUNI — 0 • NODENAME — DISK22 • SYSTEMID — no change • UNITNUM — 22 $ SET HOST/DUP/SERVER=MSCP$DUP/TASK=PARAMS R1QSAA Return %HSCPAD-I-LOCPROGEXE, Local program executing - type ^\ to exit Copyright (C) 1990 Digital Equipment Corporation PARAMS> SHOW NODENAME Return Parameter Current Default Type Radix ---------- ------------- -------------- ---------- --------NODENAME R1QSAA RF31 String Ascii PARAMS> SET NODENAME DISK22 Return PARAMS> SHOW ALLCLASS Return Parameter Current Default Type Radix ---------- ------------- -------------- ---------- --------ALLCLASS 0 0 Byte Dec B PARAMS> SET ALLCLASS 1 Return PARAMS> SHOW FORCENAM Return Parameter Current Default Type Radix ---------- ------------- -------------- ---------- --------FORCENAM 0 0 Boolean 0/1 B PARAMS> SHOW UNITNUM Return Parameter Current Default Type Radix ---------- ------------- -------------- ---------- --------UNITNUM 0 0 Word Dec U PARAMS> SET UNITNUM 22 Return PARAMS> SHOW FORCEUNI Return Parameter Current Default Type Radix ---------- ------------- -------------- ---------- --------FORCEUNI 1 1 Boolean 0/1 U PARAMS> SET FORCEUNI 0 PARAMS> WRITE Return Return Changes require controller initialization, ok? [Y/(N)] Y Initializing... HSCPAD-S-REMPGMEND, Remote program terminated - message number 3 %HSCPAD-S-END, Control returned to CLOUDS $ 6–12 Managing Integrated Storage Elements When initialization is complete, the new ISE and its parameters are made available to the VMS operating system. 8. On SF-series drives, enable the MSCP switch. Note The SHOW CLUSTER command continues to show the name of the ISE you replaced. This does not harm the system. After the next reboot, the new ISE name appears. Managing Integrated Storage Elements 6–13 A Miscellaneous System Information A.1 In This Appendix This appendix includes: • Processor Halt codes • Console Halt codes • Error register descriptions • I/O physical address space • System control block description A.2 Processor Halt Codes Table A–1 provides the processor Halt code definitions. Table A–1 Processor Halt Code Definitions Halt Code Number Definition CPM$K_EXT_HALT ?02 External halt CPM$K_RESET ?03 Reset CPM$K_BAD_ISP ?04 Interrupt stack not valid CPM$K_DBL_ERR1 ?05 Machine check during execution CPM$K_HALT ?06 Halt instruction executed CPM$K_SCB_ERR3 ?07 SCB vector bits [01:00] = 11 CPM$K_SCB_ERR2 ?08 SCB vector bits [01:00] = 10 CPM$K_CHM_FRM_ISTK ?0A CHMx executed while on interrupt stack CPM$K_CHM_TO_ISTK ?0B CHMx to interrupt stack CPM$K_SCB_READ_ERR ?0C SCB read error CPM$K_MERR_V ?10 ACV or TNV during machine check CPM$K_KSP_V ?11 ACV or TNV during KSP exception CPM$K_DBL_ERR2 ?12 Machine check during machine check CPM$K_DBL_ERR3 ?13 Machine check during KSP not valid CPM$K_PSL_EXC5 ?19 PSL [26:24] = 101 during interrupt or exception CPM$K_PSL_EXC6 ?1A PSL [26:24] = 110 during interrupt or exception (continued on next page) Miscellaneous System Information A–1 Table A–1 (Cont.) Processor Halt Code Definitions Halt Code Number Definition CPM$K_PSL_EXC7 ?1B PSL [26:24] = 111 during interrupt or exception CPM$K_PSL_REI5 ?1D PSL [26:24] = 101 during REI CPM$K_PSL_REI6 ?1E PSL [26:24] = 110 during REI CPM$K_PSL_REI7 ?1F PSL [26:24] = 111 during REI The following example shows a processor Halt code output. Table A–2 defines the Halt Reason fields. >>> ?03 Reset (Reason = 0017) PC= 01E00000 PSL= 041F0300 Table A–2 Processor Halt Reason Code Definitions Reason Code (Hex) Definition 0001 Duplex zones have diverged 0002 Fatal cross-link error has occurred 0003 Fatal zone error has occurred 0004 Fatal ATM error has occurred 0005 Fatal CPU module error has occurred 0006 Fatal memory error has occurred 0007 Single bit memory error has occurred 0008 User command issued to stop a zone 0009 Unexpected machine check has occurred 000A Software detected failure has occurred 000B Solid NXIO error has occurred 000C Excessive transient NCIO errors have occurred 000D A solid IO error has occurred 000E Excessive transient IO errors have occurred 000F Excessive VAXELN kernel recoverable errors have occurred 0010 A VAXELN master fatal error has occurred 0011 A VAXELN job fatal error has occurred 0012 Not enough SPTEs could be allocated to boot OpenVMS 0013 Unexpected system error occurred 1 0014 Interface module failure has occurred 0015 Unexpected VAXELN error occurred 1 Reset reason 0013 indicates that an unexpected system error occurred. The contents of the SYSFLT, SYSADR, and DMAADR registers will be saved in the CCA area. See Figure A–4 for the CCA offsets of these registers. Use the register bitmaps and description in Section A.4 to determine the cause of the error. (continued on next page) A–2 Miscellaneous System Information Table A–2 (Cont.) Processor Halt Reason Code Definitions Reason Code (Hex) Definition 0016 A VAXELN kernel fatal error has occurred 0017 Initializing VAXELN before starting reconfiguration A.3 Console Halt Codes The following example shows a console Halt code output. Table A–3 defines the Halt Reason fields. >>> ?03 Reset (Reason = 0013) PC= 01E00000 PSL= 041F0300 Table A–3 Console Halt Reason Code Definitions Reason Code (Hex) Definition 0000 Power-up reset 0001 Duplex zones have diverged 0002 Fatal cross-link error has occurred 0003 Fatal zone error has occurred 0004 Fatal ATM error has occurred 0005 Fatal CPU module error has occurred 0006 Fatal memory error has occurred 0007 Single bit memory error has occurred 0008 User command issued to stop a zone 0009 Unexpected machine check has occurred 000A Software detected failure has occurred 000B Solid NXIO error has occurred 000C Excessive transient NCIO errors have occurred 000D A solid IO error has occurred 000E Excessive transient IO errors have occurred 000F Excessive VAXELN kernel recoverable errors have occurred 0010 A VAXELN master fatal error has occurred 0011 A VAXELN job fatal error has occurred 0012 Not enough SPTEs could be allocated to boot OpenVMS 0013 Unexpected system error occurred1 0014 Interface module failure has occurred 1 Reset reason 0013 indicates that an unexpected system error occurred. The contents of the SYSFLT, SYSADR, and DMAADR registers will be saved in the CCA area. See Figure A–4 for the CCA offsets of these registers. Use the register bitmaps and description in Section A.4 to determine the cause of the error. (continued on next page) Miscellaneous System Information A–3 Table A–3 (Cont.) Console Halt Reason Code Definitions Reason Code (Hex) Definition 0015 Unexpected VAXELN error occurred 0016 A VAXELN kernel fatal error has occurred 0017 Initializing VAXELN before starting reconfiguration A.4 Error Register Descriptions A.4.1 System Fault (SYSFLT) Register This register is not rail or zone unique (Figure A–1). Software does not take special precautions when reading this register. In addition, the register is continuously updated. The setting of one error bit does not prevent other bits from being set. The register contains bits which cause IPL29 interrupts. All bits in this register have the following characteristics: default = 0, type = ro, reset = hr. Figure A–1 System Fault Register 31 30 SFB 29 28 27 26 XLM 25 24 23 22 21 20 19 18 17 16 LCK RSA CBG PWG CPB CPA HTB HTA MFB MFA 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 MDB MDA MSB MSA JDB JDA JSB JSA NXB NXA IOB IOA DNB DNA DMB DMA MR−0583−92RAGS Register Address: CPU = E110 1100 (CCA offset = 15C) [31]: SFB - Solid Fault Bit. Latched when an automatic retry on an I/O operation fails to complete properly. [30:28]: XLM - Xlink Mode [2:0]. This field, sourced by the Xlink, is read-only and indicates the Xlink mode specified in Table A–4. Table A–4 Xlink Mode Coding Code Mode 000 Xlink Off 001 Xlink Slave 010 Xlink Master 011 Xlink Duplex 100 Not Used (continued on next page) A–4 Miscellaneous System Information Table A–4 (Cont.) Xlink Mode Coding Code Mode 101 Resync Slave 110 Resync Master 111 Not Used [27:26]: - Not used. [25]: LCK - Lock. Latched when an error occurs during an interlock I/O access. (Interlock access refers to the special I/O access mode.) [24]: RSA - Resync Abort. Latched when an error occurs during resync mode. Resync mode is automatically canceled. [23]: CBG - Cable Gone. Latched when a cable gone signal is detected. CBG set will force the Xlink to the off mode. [22]: PWG - Power Gone. Set when the other zone power gone signal is detected. PWG set will force the Xlink to the off mode. [21]: CPB - Clock Phase Error (Zone B). Latches a high level assertion on the Clock Phase Error line coming from the Xlink. The high level will remain until a 1 is written to the bit. If the Clock Phase Error signal line is still high after the write 1 to clear, the bit is again set to 1. [20]: CPA - Clock Phase Error (Zone A). Latches a high level assertion on the Clock Phase Error line coming from the Xlink. The high level will remain until a 1 is written to the bit. If the Clock Phase Error signal line is still high after the write 1 to clear, the bit is again set to 1. [19]: HTB - Halt Error (Zone B). Latches a high level assertion on the Halt Request line coming from the Xlink. The high level will remain until a 1 is written to the bit. If the Halt Error signal line is still high after the write 1 to clear, the bit is again set to 1. [18]: HTA - Halt Error (Zone A). Latches a high level assertion on the Halt Request line coming from the Xlink. The high level will remain until a 1 is written to the bit. If the Halt Error signal line is still high after the write 1 to clear, the bit is again set to a 1. [17]: MFB - CPMF (Zone B). Set when the error logic determines that a CPMF is required. [16]: MFA - CPMF (Zone A). Set when the error logic determines that a CPMF is required. [15]: MDB - Memory Double-Bit Error (Zone B). Set when a double-bit ECC error or single-bit ECC error is detected during memory writes on the internal Jet Bus ECC checker. This causes a CPMF. [14]: MDA - Memory Double-Bit Error (Zone A). Set when a double-bit ECC error or single-bit ECC error is detected during memory writes on the internal Jet Bus ECC checker. This causes a CPMF. [13]: MSB - Memory Single-Bit Error (Zone B). Set when a single-bit ECC error is detected in memory during a read and the JXD was not the requester of the data. The bit is set regardless of the state of the Error Enable bit. The error is automatically corrected at the CPU. An IPL26 interrupt is generated causing Miscellaneous System Information A–5 a two-zone system to diverge. Hardware generates an IPL29 interrupt to both zones within three clock cycles. [12]: MSA - Memory Single-Bit Error (Zone A). Set when a single-bit ECC error is detected in memory during a read and the JXD was not the requester of the data. The bit is set regardless of the state of the Error Enable bit. The error is automatically corrected at the CPU. An IPL26 interrupt is generated causing a two-zone system to diverge. Hardware generates an IPL29 interrupt to both zones within three clock cycles. [11]: JDB - JXD Double-Bit Error (Zone B). Set when a double-bit ECC error is detected on the internal Jet Bus ECC checker. [10]: JDA - JXD Double-Bit Error (Zone A). Set when a double-bit ECC error is detected on the internal Jet Bus ECC checker. [09]: JSB - JXD Single-Bit Error (Zone B). Set when a single-bit ECC error is detected on the internal Jet Bus ECC checker and is detected in memory. The check operation is triggered during Jet Bus transactions. The bit is set regardless of the state of the Error Enable bit. The error is automatically corrected on JXD reads from memory. Detection of this error causes the current DMA address to be latched. The DMA operation is allowed to complete. When finished, the DMA driver will check this bit, and if set will force a mini resync by reading the location pointed to by the DMA Error Address register. [08]: JSA - JXD Single-Bit Error (Zone A). Set when a single-bit ECC error is detected on the internal Jet Bus ECC checker and is detected in memory. The check operation is only triggered during Jet Bus transactions. The bit is set regardless of the state of the Error Enable bit. The error is automatically corrected on JXD reads from memory. Detection of this error causes the current DMA address to be latched. The DMA operation is allowed to complete. When finished, the DMA driver will check this bit, and if set will force a mini resync by reading the location pointed to by the DMA Error Address register. [07]: NXB - Nonexistent I/O (Zone B). Set after any bus timeout. If the retry passes, the Solid Fault bit will not be set. [06]: NXA - Nonexistent I/O (Zone A). Set after any bus timeout. If the retry passes, the Solid Fault bit will not be set. [05]: IOB - I/O Error (Zone B). Set by errors that occur from nonfatal or recoverable CPU initiated transactions. Errors resulting from CPU to I/O transactions are retried. [04]: IOA - I/O Error (Zone A). Set by errors that occur from nonfatal or recoverable CPU initiated transactions. Errors resulting from CPU to I/O transactions are retried. [03]: DNB - DMA NXIO (Zone B). Set when a bus timeout occurs and the CROME bus is performing a DMA operation. [02]: DNA - DMA NXIO (Zone A). Set when a bus timeout occurs and the CROME bus is performing a DMA operation. [01]: DMB - DMA Error (Zone B). Set by DMA errors. If the bit is set, the DMA is aborted. A DMA error may generate a CPMF. [00]: DMA - DMA Error (Zone A). Set by DMA errors. If the bit is set, the DMA is aborted. A DMA error may generate a CPMF. A–6 Miscellaneous System Information A.4.2 System Error Address (SYSADR) Register This register latches when any error is detected at the JXD Jet Bus and below (Figure A–2). It contains the address the CPU was accessing at the time the error occurred. The register is read only and cleared by clearing errors. All bits in this register have the following characteristics: default = 0, type = ro, reset = hr. Figure A–2 JXD System Error Address Register 31 30 29 28 27 26 25 24 23 21 20 19 18 17 16 06 05 04 03 02 01 00 ADR DL 15 22 14 13 12 11 10 09 08 07 ADR MR−0581−92RAGS Register Address: CPU = E110 1030 (CCA_BASE+160) [31:30]: DL - Data length: 00 - Hexword 01 - Longword 10 - Quadword 11 - Octaword [29:00] ADR - 30-bit error address latched on CPU operations to the JXD. A.4.3 DMA Error Address (DMAADR) Register When a single-bit ECC error is detected at the JXD, the current DMA subtransfer address into main memory is latched in this register and an IPL29 interrupt is generated. Software allows the DMA to complete and later use this information to fix the bad location in memory (Figure A–3). All bits in this register have the following characteristics: default = 0, type = ro, reset = hr. Figure A–3 JXD DMA Error Address Register 31 30 29 28 27 26 25 24 23 21 20 19 18 17 16 06 05 04 03 02 01 00 DEA DL 15 22 14 13 12 11 10 09 08 07 DEA MR−0572−92RAGS Miscellaneous System Information A–7 Register Address: CPU = E110 1040 (CCA_BASE+180) [31:30]: DL - DMA data length: 00 - Hexword 01 - Longword 10 - Quadword 11 - Octaword [29:00]: DEA - DMA 30-bit address latched during error. A.4.4 Reset Reason 0013 Fault Analysis The following example shows the content of the SYSFLT and SYSADR registers after a Reset Halt. The following paragraph analyzes the register content and identifies the faulty FRU. ?03 Reset (Reason = 0013) PC= 01E00000 PSL= 041F0300 >>> E/P 1E9AD5C P 01E9AD5C 300000C0 >>> E/P 1E9AD60 P 01E9AD60 799F0000 ! examine saved SYSFLT register contents ! from CCA_BASE+15C ! NXIO, Zone A (bus timeout) ! NXIO, Zone B (bus timeout) ! XLINK MODE = Duplex ! examine saved SYSADR register contents ! from CCA_BASE+160 ! Zone B, slot 17 P-card address CCA Base Address MEMORY SIZE CCA_BASE -------------------------32-Mbyte 1E9AC00 64-Mbyte 3E9AC00 96-Mbyte 5E9AC00 128-Mbyte 7E9AC00 160-Mbyte 9E9AC00 192-Mbyte BE9AC00 224-Mbyte DE9AC00 256-Mbyte FE9AC00 The SYSFLT register indicates a NXIO (nonexistent I/O) error. The SYSADR register contains a 30-bit address of 399F0000. However, after sign extended to 32 bits the address is translated to F99F0000. Figure A–4 shows that F99F0000 is the address of an interface module in Zone B, slot 17. The module failed to respond to its address causing a bus timeout. Replace the module. A.5 I/O Physical Address Space Figure A–4 shows the I/O physical address space. A–8 Miscellaneous System Information Figure A–4 I/O Physical Address Space 0000 0000 1FFF FFFF 2000 0000 3FFF FFFF Main Memory (512−Mbytes, 30−bit) (current VMS addressable limit) CPU Private Space E000 0000 SYSADR Register E110 1030 (CCA offset = 15C) DMAADR Register E110 1040 (CCA offset = 160) SYSFLT Register E110 1100 (CCA offset = 180) Reserved for Zone A (M=0) Zone A I/O ATM, Slot 1 Main Memory (512−Mbytes, 32−bit) (support by later VMS release) 4000 0000 Unsupported Memory (1−Gbytes) (M=1) Zone A ATM Pcard, Slot 10 (*P=8) F198 0000 Zone A ATM Pcard, Slot 11 (*P=9) F199 0000 Zone A ATM Pcard, Slot 12 (*P=A) F19A 0000 Zone A ATM Pcard, Slot 13 (*P=B) F19B 0000 Zone A ATM Pcard, Slot 14 (*P=C) F19C 0000 Zone A ATM Pcard, Slot 15 (*P=D) Zone A ATM Pcard, Slot 16 (*P=E) Zone A ATM Pcard, Slot 17 (*P=F) F19D 0000 F19E 0000 F19F 0000 F1A0 0000 8000 0000 B Cache Tags (1−Gbytes) C000 0000 E000 0000 FFFF FFFF Zone A I/O ATM Firewall Space Zone A I/O RAM/Flash ROM Unsupported Memory (512−Mbytes) EFFF FFFF F000 0000 F100 0000 Reserved for Zone A future I/O, Slot 2 F1AF FFFF F1B0 0000 F1FF FFFF F200 0000 (M=2) I/O Space (512−Mbytes) F2FF FFFF FM00 0000 ~ ~ FMAF FFFF Unsupported Zone A I/O (M=3 − 7) Reserved for Zone B (M=8) F800 0000 Zone B I/O ATM, Slot 1 F900 0000 (M=9) Zone B ATM Pcard, Slot 10 (*P=8) F998 0000 Zone B ATM Pcard, Slot 11 (*P=9) F999 0000 Zone B ATM Pcard, Slot 12 (*P=A) F99A 0000 Zone B ATM Pcard, Slot 13 (*P=B) F99B 0000 Zone B ATM Pcard, Slot 14 (*P=C) F99C 0000 Zone B ATM Pcard, Slot 15 (*P=D) Zone B ATM Pcard, Slot 16 (*P=E) Zone B ATM Pcard, Slot 17 (*P=F) F99D 0000 F99E 0000 F99F 0000 F9A0 0000 Zone B I/O ATM Firewall Space F9AF FFFF Zone B I/O RAM/Flash ROM Reserved for Zone B future I/O, Slot 2 F9B0 0000 F9FF FFFF FA00 0000 (M=A) FAFF FFFF Unsupported Zone B I/O (M=B − F) FM00 0000 ~ ~ FMFF FFFF PKO−0150−93RAGS Miscellaneous System Information A–9 A.6 System Control Block Description The System Control Block (SCB) contains vectors for servicing interrupts and exceptions. The SCB address should be aligned on a page boundary. The SCB address is contained in the System Control Block Base register (SCBB) (Figure A–5). Microcode forces a longword-aligned SCBB by clearing bits [01:00] of the new value before loading the register. Figure A–5 System Control Block Base Register 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 05 04 03 02 01 00 0 0 Physical Page Address of SCB 15 14 13 12 11 10 09 08 07 06 Physical Page Address of SCB SBZ MR−0021−93RAGS An SCB vector is an aligned longword in the SCB through which the CPU microcode dispatches interrupts and exceptions. Each SCB vector has the format shown in Figure A–6. Figure A–6 System Control Block Vector Format 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 05 04 03 02 01 00 Longword Address of Service Routine 15 14 13 12 11 10 09 08 07 06 Longword Address of Service Routine Code MR−0022−93RAGS [31:02]: Longword Address - Virtual address of the service routine for the interrupt or exception. The routine must be longword aligned since the microcode forces the two low-order bits to 0. [01:00]: Code - The code field is defined in Table A–5. Table A–5 Code Field Definition Code Definition 00 The event is to be serviced on the kernel stack unless the CPU is already on the interrupt stack, in which case the event is serviced on the interrupt stack. (continued on next page) A–10 Miscellaneous System Information Table A–5 (Cont.) Code Field Definition Code Definition 01 The event is to be serviced on the interrupt stack. If the event is an exception, the IPL is raised to 1F (hex). 10 Unimplemented, results in a console error halt. 11 Unimplemented, results in a console error halt. The SCB content is specified in Table A–6. Table A–6 SCB Layout Vector Name Type Parameter Notes 00 Unused — — — 04 Unused — — — 08 Machine check Abort 6 Parameters reflect machine state; must be serviced on the interrupt stack 0C Unused — — — 10 Reserved privileged instruction Fault 0 — 14 Customer reserved instruction Fault 0 XFC instruction 18 Reserved operand Fault/abort 0 Not always recoverable 1C Reserved addressing mode Fault 0 — 20 Access control violation/ vector alignment fault Fault 2 Parameters are virtual address and status code 24 Translation not valid Fault 2 Parameters are virtual address and status code 28 Trace pending Fault 0 — 2C Breakpoint instruction Fault 0 — 30 Unused — — Compatibility mode in other VAX systems 34 Arithmetic trap Fault 1 Parameter is type code 38 to 3C Unused — — — 40 CHMK Trap 1 Parameter is signextended operand word 44 CHME Trap 1 Parameter is signextended operand word (continued on next page) Miscellaneous System Information A–11 Table A–6 (Cont.) SCB Layout Vector Name Type Parameter Notes 48 CHMS Trap 1 Parameter is signextended operand word 4C CHMU Trap 1 Parameter is signextended operand word 50 Unused — — — 54 Soft error notification Interrupt 0 IPL is 1A (hex) 58 to 5C Unused — — — 60 Hard error notification Interrupt 0 IPL is 1D (hex) 64 Unused — — — 68 Vector unit disabled Fault 0 Vector instructions 6C to 80 Unused — — — 84 Software level 1 Interrupt 0 88 Software level 2 Interrupt 0 Ordinarily used for AST delivery 8C Software level 3 Interrupt 0 Ordinarily used for process scheduling 90 to BC Software levels 4 to 15 Interrupt 0 — C0 Interval timer Interrupt 0 IPL is 16 (hex) C4 Unused — — — C8 Emulation start Fault 10 Same mode exception, FPD=0; parameters are opcode, PC, specifiers CC Emulation continue Fault 0 Same mode exception, FPD=1; parameters are opcode, PC, specifiers D0 Device vector Interrupt 0 IPL is 14 (hex) D4 Device vector Interrupt 0 IPL is 15 (hex), includes console interrupts D8 Device vector Interrupt 0 IPL is 16 (hex), includes interprocessor interrupts DC Device vector Interrupt 0 IPL is 17 (hex) E0 to F4 Unused — — — F8 to FC Unused — — — 100 to FFCC Unused — — — A–12 Miscellaneous System Information B ISE Parameter Worksheets B.1 In This Appendix This appendix includes: • Individual ISE parameter worksheets • ISE zone parameter worksheets B.2 Individual ISE Parameter Worksheets Use the following worksheets to record parameters for each ISE. Serial Number: NODENAME: SYSTEMID: ALLCLASS: UNITNUM: FORCEUNI: FORCENUM: Serial Number: NODENAME: SYSTEMID: ALLCLASS: UNITNUM: FORCEUNI: FORCENUM: MR−0052−93RAGS ISE Parameter Worksheets B–1 Serial Number: NODENAME: SYSTEMID: ALLCLASS: UNITNUM: FORCEUNI: FORCENUM: Serial Number: NODENAME: SYSTEMID: ALLCLASS: UNITNUM: FORCEUNI: FORCENUM: Serial Number: NODENAME: SYSTEMID: ALLCLASS: UNITNUM: FORCEUNI: FORCENUM: MR−0053−93RAGS B–2 ISE Parameter Worksheets B.3 ISE Zone Parameter Worksheets Use the following worksheets to record parameters for each ISE. Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: MR−0054−93RAGS ISE Parameter Worksheets B–3 Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: Serial No: Serial No: NODENAME: NODENAME: UNITNUM: UNITNUM: MR−0054−93RAGS B–4 ISE Parameter Worksheets Index A Application of thresholds, 4–17 ATM module removal and replacement, 5–7 ATM module deconfiguration actions, 4–13 B Before you begin, 5–3 Boot parameter block data structures, 4–60 Bootstrap procedures, 2–7, 2–8 C CAMP module removal and replacement, 5–24 CCA fields firmware interfaces, 4–53 CIO mode console commands BOOT, 2–7 CIO mode, entering, 2–8 Console command language syntax, 2–6 control characters, 2–5 description, 2–1, 2–3 entering console mode, 2–4 exiting console mode, 2–4 operating modes, 2–3, 2–4 operations, 2–1 Console commands, 2–22 BOOT, 2–9 CLEAR, 2–10 ! (comment), 2–22 CONTINUE, 2–11 DUP, 2–13 EXAMINE, 2–13 FIND, 2–15 HELP, 2–15 INITIALIZE, 2–16 MATCH_ZONES, 2–16 MOVE, 2–16 REPEAT, 2–17 SET, 2–17 SET BOOT DEFAULT, 2–18 SHOW, 2–18 Console commands (cont’d) START, 2–19 TEST, 2–20, 3–30 X, 2–21 Z, 2–22, 3–31 Console communications area data structures, 4–55 Console extender module removal and replacement, 5–20 Controls and indicators disk drawer, 3–19 CPU and expansion cabinets system component descriptions, 1–1 CPU and memory deconfiguration actions, 4–14 CPU module removal and replacement, 5–7 CPU module subDCB data structures, 4–64 CPU or zone unsynchable error log entry, 4–72 CPU ROM-based diagnostics system diagnostics, 3–31 CPU/MEM fault end action error log entry, 4–69 CPU/MEM fault error log entry, 4–66 Cross-link assembly removal and replacement, 5–18 Cross-link cable deconfiguration actions, 4–16 D Deconfiguration information block, 4–24 Deconfiguration messages, 4–49 Device configuration block data structures, 4–61 Device fault indicators, 3–19 Device status indicators, 3–19 DIM removal and replacement, 5–26 Disk drawer controls and indicators, 3–19 Disk drives RF35 disk drawer, 3–19 SF35-BK/HK/JK, 3–21 SF73-HK/JK, 3–24 Dispatch block description data structures, 4–59 Index–1 Documentation road map, iii DSSI cable removal and replacement, 5–29 DSSI disk drawer removal and replacement, 5–14 DSSI extender module removal and replacement, 5–22 DSSI interface module removal and replacement, 5–26 DUP, 6–1 PARAMS utility, 6–1 SET HOST, 6–1 Duplex compatibility test, 4–57 E EHS, 4–1 EHS structure, 4–3 EIM removal and replacement, 5–28 Eject button unload function, 3–28 End action timeouts, 4–29 End actions, 4–28 Error event messages, 4–40 Error handling services (EHS), 4–1 Error isolation and handling, 4–2 Error log analysis, 4–66 Error register descriptions, A–4 DMA error address register, A–7 system error address register, A–7 system fault register, A–4 Error types, 4–5 ESD procedures, 5–4 Ethernet interface module removal and replacement, 5–28 Event reporting interface routines, 4–40 F Fan removal and replacement, 5–10 Fault data, 4–27 Fault summary, 4–20 FCSB removal and replacement, 5–10 FEU removal and replacement, 5–16 Firmware and OpenVMS interface data structures, 4–54 Firmware interfaces, 4–50 FRU deconfiguration, 4–13 FRU handling, 5–4 FRU information, 4–22 FRU isolation, 4–12 FRU list, 5–1 Index–2 FRUs, 4–12 access, 5–5 FTSS event reporting interface, 4–40 G General troubleshooting procedure system maintenance, 3–4 H Halt codes console halt codes, A–3 processor halt codes, A–1 I I/O expansion module console and diagnostics firmware interfaces, 4–53 I/O expansion module deconfiguration actions, 4–14 I/O physical address space, A–8 I/O ROM-based diagnostics system diagnostics, 3–34 Interface module deconfiguration actions, 4–15 ISE, 6–1 finding parameter values, 6–5 individual parameter worksheet, B–1 installing new, 6–11 parameters, 6–4 replacing, 6–8 setting, 6–5 removal, 6–7 system parameter worksheet, B–3 L Load/Unload button reset function, 3–29 M Maintenace strategy system maintenance, 3–1 MMB removal and replacement, 5–9 Module fault LEDs system maintenance, 3–6 Module NVRAM status and LED indicators, 4–38 O OpenVMS error log, 4–19 Operating rules and cautions system maintenance, 3–2 P Page frame number bitmap data structures, 4–65 POST, 3–27 Power distribution box removal and replacement, 5–42 Power distribution boxes system component descriptions, 1–9 Power modules, 3–12 system component descriptions, 1–8 Power system maintenance, 3–12 Power system overview system maintenance, 3–7 Power-on, 3–27 Power-on self-test (POST) status of OCP indicators, 3–27 PSC removal and replacement, 5–16 R Removal and replacement ATM module, 5–7 CAMP module, 5–24 console extender module, 5–20 CPU module, 5–7 cross-link assembly, 5–18 DIM, 5–26 DSSI cable, 5–29 DSSI disk drawer, 5–14 DSSI extender module, 5–22 DSSI interface module, 5–26 EIM, 5–28 Ethernet interface module, 5–28 fan, 5–10 FCSB, 5–10 FEU, 5–16 MMB, 5–9 power distribution box, 5–42 PSC, 5–16 RF35 disk drive, 5–12 SF35 storrage array, 5–36 SF73 disk drive, 5–32 SIMM, 5–8 TF857-CA tape drive, 5–39 TF85C-BA tape drive, 5–30 5V regulator, 5–16 3.3V regulator, 5–16 zone control panel, 5–14 Reset load/Unload button, 3–29 Reset reason fault analysis error register descriptions, A–8 RF35 disk drawer disk drives, 3–19 RF35 disk drive removal and replacement, 5–12 ROM-based diagnostics system diagnostics, 3–29 S SCB description, A–10 Server setup switch, 6–2 Services error handling, 4–1 SET HOST, 6–1 SF35 storage array removal and replacement, 5–36 SF35-BK/HK/JK storage array disk drives, 3–21 SF73 disk drive removal and replacement, 5–32 SF73-HK/JK storage array disk drives, 3–24 Shutting down a zone, 5–4 SIMM removal and replacement, 5–8 Software detected errors fault data, 4–34 Starting up a zone, 5–5 Sub-device condiguration block data structures, 4–63 System console and diagnostics firmware interfaces, 4–50 System control block description, A–10 System operating modes, 4–4 System registers fault data, 4–27 System resets firmware interfaces, 4–51 T Tape devices TF857 tape loader, 3–27 TF857 tape loader controls and indicators, 3–27 TF85C tape drive, 3–26 TEST command system diagnostics, 3–30 TF857 tape loader controls and indicators tape devices, 3–27 TF857-AA tape loader operating procedures, 3–27 TF857-CA tape drive removal and replacement, 5–39 TF85C tape drive tape devices, 3–26 TF85C-BA tape drive removal and replacement, 5–30 Index–3 Threshold information block, 4–26 TK85C-BA cartridge tape drive indicators, 3–27 fault data, 4–30 VAXELN error handling, 4–10 U W Unit number assignment, 6–2 Unsynchable events fault data, 4–36 Warm swapping, 6–3 V Z command system diagnostics, 3–31 Zone control panel removal and replacement, 5–14 system component descriptions, 1–6 Zone deconfiguration actions, 4–16 5V regulator removal and replacement, 5–16 3.3V regulator removal and replacement, 5–16 VAXELN detected errors Index–4 Z
Home
Privacy and Data
Site structure and layout ©2025 Majenko Technologies