Digital PDFs
Documents
Guest
Register
Log In
AA-J833B-TK
May 1985
166 pages
Original
5.6MB
view
download
Document:
SPEAR Manual Sep85
Order Number:
AA-J833B-TK
Revision:
0
Pages:
166
Original Filename:
AA-J833B-TK_SPEAR_Manual_Sep85.pdf
OCR Text
TOP5-10/TOP5-20 SPEAR Manual AA-J8338-TK September 1985 This manual describes the SPEAR product (Standard Package for Error Analysis and Reporting). SPEAR is a library of functions that reports on the errors and events that are recorded by the operating system. This manual supersedes the TOPS-10ITOPS-20 SPEAR Manual, order number AA-J833A-TK. OPERATING SYSTEM: SOFTWARE: TOPS-10 V7.02 TOPS-20 (KS/KL Model A) V4.1 TOPS-20 (KL Model B) V6.1 SPEAR V2.0 Software and manuals should be ordered by title and order number. In the United States, send orders to the nearest distribution center. Outside the United States, orders should be directed to the nearest DIGITAL Field Sales Office or representative. Western Region Northeast/MId-Atlantic Region Central Region Digital Equipment Corporation PO Box CS2008 Nashua, New Hampshire 03061 Telephone :(603)884-6660 Digital ~quipment Corporation Digital Equipment Corporation Accessories and Supplies Center Accessories and Supplies Center 1050 East Remington Road 632 Caribbean Drive Sunnyvale, California 94086 Schaumburg, Illinois 60195 Telephone:(312)640-5612 Telephone:(408)734-4915 digital equipment corporation. marlboro massactlusctts First Printing, April 1982 Revised, September 1985 © Digital Equipment Corporation 1982, 1985. All Rights Reserved. The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document. The software described in this document is furnished under a license and may only be used or copied in accordance with the terms of such license. No responsibility is assumed for the use or reliability of software on equipment that is not supplied by DIGITAL or its affiliated companies. The following are trademarks of Digital Equipment Corporation: ~D~DD~DTM DEC DECmate DECsystem-10 DECSYSTEM-20 DECUS DECwriter DIBOL MASSBUS PDP P/OS Professional Q-BUS Rainbow RSTS RSX RT UNIBUS VAX VMS VT Work Processor The postage-prepaid READER'S COMMENTS form on the last page of this document requests the user's critical evaluation to assist us in preparing future documentation. CONTENTS PREFACE CHAPTER 1 1.1 1.2 CHAPTER 2 SPEAR OVERVIEW INTRODUCTION • • • • • • • • • • • • • • • • • 1-1 USER PROFILES AND INTERACTION • • • • 1-2 THE SYSTEM EVENT FILE 2.1 INTRODUCTION • • ENTRY CATEGORIES • 2.2 2.2.1 Software Entries • • 2.2.2 Hardware Entries • • 2.2.2.1 CPU and Memory Failures 2.2.2.2 Channel and Controller Failures 2.2.2.3 I/O Device Failures 2.2.3 Performance Entries . . . . 2.3 RECORDING EVENTS • • 2.3.1 Record Format • • • • • • • • • 2.3.2 Record Conventions for Numbers and Dates CHAPTER 3 3.1 3.2 3.2.1 3.2.2 3.3 3.3.1 3.3.2 3.4 3.4.1 CHAPTER 4 • • • • • • • • . . . • • • • 2-1 2-2 • 2-2 • 2-2 • 2-3 • 2-3 • 2-4 • 2-4 • 2-4 • 2-5 2-6 ANALYZING FAILURES INTRODUCTION • • • • • • • • • • TYPES OF FAILURES • • • • • • • • • • • Characteristics of Solid Failures •••• Characteristics of Intermittent Failures ERROR DETECTING AND ERROR CHECKING • Hardware Error Detectors Software Error Checking • • • • • ISOLATION TECHNIQUES • • • • • • • • Verification • • • • • • • • •• • 3-1 3-1 • 3-2 3-2 • 3-2 • 3-2 • 3-3 • 3-4 3-5 THE SPEAR LIBRARY 4.1 INTRODUCTION • • • • • • 4-1 4.2 RUNNING SPEAR • • • • • • • • • • 4-1 4.2.1 Prompts, Responses, and Arguments • • • • • 4-2 4.2.2 Separators and Terminators •••• 4-3 4.2.3 Help Features • • • • • • • • • • • • • 4-3 4.2.4 File Specifications • • • • • • • • • 4-4 4.2.5 SPEAR Switches • • • • & • • • • • • • 4-4 4.2.6 Exiting from SPEAR • • • • • • • • • • 4-5 4.3 INSTRUCT • • • • • • • • • • • • • 4-5 4.3.1 Setting Up a Student ID • • • • • • • • 4-6 4.3.2 Us i ng INSTRUCT as a Re f e rence Too 1 • • • • • • • 4-7 4.4 RETRIEVE • • • • • • 4-8 4.4.1 RETRIEVE Input • • • • • • • • 4-8 4.4.2 RETRIEVE Output • • • • • • ••• • 4-10 RETRIEVE Procedure • • • • • 4-11 4.4.3 Retrieving Selected Events • • • • • • 4-12 4.4.3.1 4.4.3.2 Sample RETRIEVE Session • • • • • • • • 4-19 4.4.3.3 Short Format • • • • 4-20 4.4.3.4 Octal Format • • • ~ • • • • • 4-21 4.4.3.5 Full Format 4-23 4.5 KLERR • • • • • • • • • e • • • • • • •• 4-24 Q iii • • • • • • • • • 4.5.1 4.5.2 4.5.3 4.5.4 4.6 4.6.1 4.6.2 4.6.3 4.6.4 4.7 4.7.1 4.8 4.8.1 4.8.2 4.8.3 4.8.4 4.8.5 CHAPTER 5 5.1 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6 5.2.7 5.2.8 5.2.9 5.2.10 5.2.11 5.2.12 5.2.13 5.2.14 5.2.15 5.2.16 5.2.17 5.2.18 5.2.19 5.2.20 5.2.21 5.2.22 5.2.23 5.2.24 5.2.25 5.2.26 5.2.27 5.3 5.3.1 5.3.2 5.3.3 5.3.4 5.3.5 5.3.6 5.3.7 5.3.8 5.3.9 5.3.10 5.3.11 5.3.12 5.3.13 KLERR Input • • • • • • • • • • • • • • KLERR Procedure • • • • • • • • • Sample KLERR Session • •••• KLERR Output • • • • • SUMMARIZE • • • • • • The SUMMARIZE Report • • • • • • • • • • • Error Register Codes SUMMARIZE Procedure • • • • • Sample SUMMARIZE Session • TOPS-20 KLSTAT MODE • • • • • • • KLSTAT Procedure • COMPUTE • • • • • • •••• COMPUTE Reports • • • • • COMPUTE Formulas • • • • • • COMPUTE Procedures • • COMPUTE Summary Report • • • • • COMPUTE Full Report 0 4-25 4-25 4-29 4-30 4-32 4-32 4-38 4-40 4-44 4-45 4-46 4-47 4-47 4-48 4-49 4-53 4-53 • ENTRY DESCRIPTIONS INTRODUCTION • • • • • • • • • TOPS-10 ENTRIES • • • • • • • System Reload • • • • • • • Non-Reload Monitor Error • • Crash Extract • • • • • Data Channel Error •••••••• • DAEMON Started • • ••••• •• • MASSBUS Disk Error • • • • • • • • • • DX20 Device Error • • • • • • • • Software Event • • • • • • • • • • • • Configuration Status Change System Log Entry • • • • • • Software Requested Data • • • • • • • Magtape System Error • • • • • • • • • • • Front End Device Report • • • • Front End Reload • • • • • • • o. KS10 Halt Status Block • • • • • Magtape Statistics • • • • • • • Disk Statistics • • • • DL10 Communications Error • • •• KL10 Parity or NXM Interrupt • • • • • • • KS10 NXM Trap • • • • • KL10 or KS10 Parity Trap • • • • Memory Sweep for NXM • • • • • • • •••• Memory Sweep for Parity • • • • CPU Status Block • • • • • • • • • • • • • Device Status Block • • • • • • Line Printer Error • •••• • • • •• Unit Record Error • • • • • • • • TOPS-20 ENTRIES • • • • • • • • • • • •• TOPS-20 System Reloaded TOPS-20 BUGCHKs and BUGHLTs • • • • •• MASSBUS Device Error • • • • • • • • • • • DX20 Device Error • • • • • • •••• •• Drive Statistics Entries • • • Configuration Status Change • • • • •• System Log Entry • • • • • • • • • • •• Front-End Device Report • • • • • • • • • Front End Reloaded • • • • • • • • Processor Parity Trap • • • • • • •• Processor Parity Interrupt. • • • • • •• KL CPU Status Block • • •• MF20 Device Report • • • • • • •• 9 • 0 • 9 • o. 9. o. iv 5-1 5-2 • 5-3 • 5-3 5-5 • 5-8 5-8 • 5-9 5-10 5-13 5-14 5-15 5-16 5-16 5-18 5-18 5-18 5-19 5-20 5-22 5-22 5-23 5-24 5-25 5-26 5-26 5-28 5-29 5-30 5-30 5-30 5-31 5-33 5-37 5-39 5-40 5-41 5-41 5-42 5-43 5-44 5-45 5-47 • • • 5.3.14 KLERR Front End Device Report 5.3.14.1 The HSC50 Error Log a • • • 5.4 DECNET ENTRIES (V2.1) 5.4.1 Network Control Started 5.4.2 Network Up-Line Dump. 5.4.3 Network Down-Line Load ~ • • • • • • 5.4.4 Network Hardware Error ~ • • ••• 5.4.5 Network CHECKll Report * • • • • • 5.4.6 Network Line Statistics 5.5 DECNET ENTRIES (V3.0 AND V4.0) APPENDIX A SPEAR MESSAGES APPENDIX B COMMAND AND CONTROL FILES APPENDIX C EVENT CODES APPENDIX D DISK SUBSYSTEM ERROR BITS APPENDIX E NETWORK EVENT PARAMETERS APPENDIX F GLOSSARY 5-47 5-50 5-52 5-52 5-52 5-53 5-53 5-54 5-55 5-56 INDEX FIGURES 2-1 Components of a Computer System • • • • • • • • • 2-3 4-1 4-2 4-3 4-4 4-5 4-6 4-7 4-8 5-1 5-2 A-I A-2 A-3 A-4 C-l Device Types • • • • • • • • • Network Event Classes Subprompts for Device Types Error Types • • • • • • • • • MASSBUS Disk Registers • • • • Tape Registers • • • • • • • • • • • • Subprompts for Device Types • • • • • Supported Devices • • • • Network Event Classes Network Events • • • • • • • • • User Validation Messages Dialogue Usage Messages Warning Messages • • • • • • • • • Event File Messages •••• TOPS-10 and TOPS-20 Event Codes TABLES v · . . • • • • • • • • • • • • • • • • • • • • 4-9 . . . 4-9 4-14 4-15 4-38 • • 4-40 4-41 4-45 5-56 • • 5-57 • • • A-I • • • A-2 • • • A-3 • • • A-4 • • • C-l PREFACE This manual describes Version 2.0 of SPEAR on TOPS-10 and TOPS-20. The primary audience for this manual is a person with experience in the following areas: 1. Fault isolation techniques 2. KL10 instruction set 3. All hardware connected TOPS-10 or TOPS-20 to the various configurations of If you do not have the above experience, refer to: TOPS-10 Operators Guide TOPS-20 Operators Guide DECsystem-10/DECSYSTEM-20 Processor Reference Manual DECsystem-10 Hardware Reference Manual READING PATH This manual has three functions: it serves as a learning aid, a user's. guide, and a reference tool for those who already have learned to use the SPEAR library. As a learning aid: Chapters 1, 2, and 3 provide an overview of the SPEAR library. They also provide background information necessary to understand and use the SPEAR library. As a user's guide: Chapter 4 provides step-by-step procedures for using the SPEAR functions; INSTRUCT, RETRIEVE, KLERR, SUMMARIZE, and COMPUTE. This chapter explains the command syntax and the response parameters associated with each function. As a reference tool: Chapter 5 and the appendixes provide reference material such as system event file formats, error messages, and a glossary. This material is not meant to be read from beginning to end. Use Chapter 5 and the appendixes as a reference when you need them. vii CONVENTIONS USED IN THIS MANUAL The following conventions are used throughout this manual: Contrasting colors Red - where examples contain both user input and computer output, the characters you type are in red; the characters SPEAR prints are in black. Lowercase letters Lowercase letters in a command string indicate variable information you must supply. UPPERCASE LETTERS Uppercase letters in a command string indicate fixed (literal) information th~t you must enter as shown. [ Square brackets indicate optional information that you can omit from a command string. Do not type the square brackets. ] Examples All examples were produced on either the TOPS-10 or the TOPS-20 operating system. This symbol represents where the Escape key. you press This symbol represents where the RETURN key. you press viii CHAPTER 1 SPEAR OVERVIEW 1.1 INTRODUCTION This chapter introduces you to the SPEAR product and gives an overview of its use. The name SPEAR is an acronym for Standard Package for Error Analysis and Reporting. The main function of SPEAR is to help isolate the cause of a failure through information contained in the system event file. Most failures are intermittent; that is, they are active at one instant causing system malfunction and inactive at another instant allowing system operation. The task at hand is to find th~ cause of the failure and correct the problem in the least amount of time. SPEAR helps to accomplish this task. SPEAR is a library of functions that reports on the errors and events that are recorded by the operating system, TOPS-10 or TOPS-20. In the past, the field service engineer was forced to analyze intermittent failures by sorting through error reports generated by SYSERR, looking for common failure patterns. For example, the engineer examined several disk reports looking for common media failures, common disk head failures, or common failures of the read/write circuitry. Now, SPEAR can do the tedious work. SPEAR uses the system event file for analysis. The system event file contains entries made by the operating system and the communications subsystems (if any). Each time certain events occur, the operating system records and stores pertinent data in the system event file. The operating system continually monitors and records information about every disk, tape, and memory parity error as they occur, along with errors from other subsystems. At your discretion, you can call on SPEAR to generate a report of selected events. For more information on the system event file, refer to Chapter 2. For samples of events your operating system can record, refer to Chapter 5. The SPEAR program consists of a library of five functions: • INSTRUCT • RETRIEVE • KLERR • SUMMARIZE • COMPUTE 1-1 SPEAR OVERVIEW These function names are also the primary commands you type to run the particular function of SPEAR in which you are interested. INSTRUCT is a computer-aided instruction program designed to ensure that you' have the background knowledge and experience necessary to use the other functions in the SPEAR library. To run .INSTRUCT, refer to Section 4.3. RETRIEVE reads the binary data in the system event file and produces an ASCII report for each entry selected. RETRIEVE also allows you to save specific entries either for later analysis and translation or for record-keeping purposes. RETRIEVE is described in Section 4.4. KLERR provides signal name translation and summaries, CRAM word translation, and other useful features to help you analyze log files resulting from a KL10 crash. KLERR is described in Section 4.5. SUMMARIZE reads the binary data in the system event file and produces an ASCII report. Refer to Section 4.6 for a description of SUMMARIZE. COMPUTE calculates and reports overall system availability, effectiveness, and reliability. COMPUTE is described in Section 4.8. Chapter 4 describes these functions in detail, along additional feature available only on TOPS-20, KLSTAT mode. 1..2 with an USER PROFILES AND INTERACTION There are three main groups of SPEAR users: 1. Field Service and Software Support specific maintenance responsibilities. 2. System operators who must recovery procedures. 3. System managers who have a need to monitor performance and schedule system use. recognize personnel failures who have and initiate overall system These groups each have varying degrees of expertise in software and hardware areas. SPEAR can not only handle the needs of each group but can also guide the new user as well as the experienced user. The system operator and Field Service engineer can cooperate by using SPEAR as a tool for both preventive and corrective maintenance. SPEAR also has the COMPUTE function that allows the system manager a closer look at system performance. Refer to Chapter 4 for information on COMPUTE. 1-2 CHAPTER 2 THE SYSTEM EVENT FILE 2.1 INTRODUCTION This chapter discusses the file that SPEAR uses for input, the system event file. Specifically, this chapter discusses what events are recorded, how they are recorded, and what form they take within their respective files. Each operating system and communications subsystem has its own error logging facility to gather and maintain information on system errors and events as they occur. The error logging facility detects a variety of hardware and software errors, providing a detailed record of system activity. When an error occurs, the facility gathers significant data about the current state of the system; the type of data it gathers depends on the type of error detected. In addition to detecting actual errors, the facility monitors events that reflect other aspects of system performance. The recording of such events helps to define the system context in which actual errors occur. The events are recorded in a system event file, ERROR.SYS. The logical name for the location of this file {structure and directory} depends on which operating system you are using. The following list gives you the names to use to locate your system event file: • TOPS-10 V7.02 SYS:ERROR.SYS • TOPS-20 V4.1 SYSTEM:ERROR.SYS • TOPS-20 V6.1 SERR:ERROR.SYS Events that occur during the operation of the system are logged into the system event file for use in preventive maintenance as well as corrective maintenance. These events occur within the various hardware and software components of the system, such as: Hardware Software CPU Memory Operating system Memory management I~ I~ Console File system Some of the events that can occur include parity errors, address failures, operator log entries, system reloads, device mounts and dismounts. Each time one of these events occurs, an entry is appended to the system event file in binary format. 2-1 THE SYSTEM EVENT FILE 2.2 ENTRY CATEGORIES There are two general categories of entries in the system event file, error and nonerror. Both categories can be broken down further into the following: 1. Software entries 2. Hardware entries 3. Performance entries The following three sections describe the found in the system event file. 2.2.1 entry types that can be Software Entries The software error entries that SPEAR is concerned with are internal software errors. On TOPS-10, these errors result in a STOPCD; on TOPS-20, these errors result in a BUGHLT, BUGCHK, or BUGINF. A STOPCD is represented by a 3-letter message that is printed at the operator1s terminal (CTY) when the operating system detects a serious error. Sometimes the operating system crashes immediately following this message; at other times the operating system continues to run but halts the current job. The action the operating system takes depends on the severity of the problem. There are five types of STOPCDs: 1. HALT - The system halts and you must manually dump reload the operating system. 2. STOP - All jobs are aborted, and dumps and reloads itself. 3. CPU - This is the same as STOP except this message occurs on dual processors. Jobs are aborted only on the processor where the error occurs. 4. JOB - The current job is aborted and processing continues. 5. DEBUG - A message prints and processing continues. the system The list of all stopcode messages is documented specification in the TOPS-10 Software Notebooks. and and automatically in the STOPCD The TOPS-20 operating system errors also range in severity. A BUGHLT is the most serious. It is a non-recoverable error detected by the operating system. A BUGCHK is a recoverable error detected by the operating system, while a BUGINF is a message informing you that a certain event related to the operating system has occurred. BUGHLTs, BUGCHKs, and BUGINFs are listed in the TOPS-20 Operators Guid~. 2.2.2 Hardware Entries The hardware entries come from a variety of subsystems; CPU, memory, I/O, console, and networks. The number and type of components depends on the system configuration. In general, Figure 2-1 represents the major components or subsystems that can contribute entries to the system event file. 2-2 THE SYSTEM EVENT FILE Figure 2-1: Components of a Computer System Hardware error entries are the most frequent type of error. These errors are caused by a failure in the hardware itself. Each time an event of this type occurs, an entry is made into the system event file. Hardware error entries can be divided into three general categories: 1. CPU-instruction and CPU-addressing failures 2. Controller and channel failures 3. I/O errors Because the system hardware cannot be expected to operate continuously without failure, the design of the hardware includes facilities to monitor the hardware operation. (One such facility is the parity check.) Once the system has detected an error, it can either signal the CPU and system software that an error has occurred or attempt to recover from the error and notify the software if it cannot recover successfully. This activity is recorded in the form of one or more entries in the system event file. 2.2.2.1 CPU and Memory Failures - The first category is a failure occurring in the CPU and main storage section of the system. This type of failure is perhaps the most difficult to handle correctly. These failures can easily modify either the operating system software or a user program or cause instructions to be incorrectly executed. A failure in an addressing section can cause the system to operate with incorrect data or unknowingly modify some other job's program or data. For these reasons, CPU errors ordinarily cause the crash of a job or the entire system, depending on whether a user or the operating system is in control. 2.2.2.2 Channel and Controller Failures - The second category of hardware error entry is a channel or controller failure. The system controllers monitor and control several I/O devices of the same type, and the channels of various types connect the CPU and/or main storage units with the I/O controllers and devices. These errors are likely to affect several jobs or users because each controller or channel can handle several I/O devices being used by many jobs or processes. Detected errors are signalled to the CPU, and the operating system may 2-3 THE SYSTEM EVENT FILE stop the current operation if the error is serious. An example is a controller's parity check of a command issued by the cpu. If this parity check fails, the command will not be performed, and the error will be signalled to the cpu. Such an event is recorded in the system event file for subsequent retrieval by SPEAR. 2.2.2.3 I/O Device Failures - The third category of hardware error entry is a failure of an I/O device. Errors detected by a single I/O device are recovered in the same manner as channel and controller failures but usually the error affects only one job or task. Some I/O failures are caused by faulty media. The most frequently used form of error recovery in this case is to retry the failing operation. If the failure continues for a specified number of consecutive retries, the job or task is crashed. Each failure is recorded in the system event file. 2~2.3 Performance Entries The system event file contains more than just error entries. It also contains entries concerning day-to-day events of the system. These events vary depending on the operating system. But in general, you might find entries of the following nature: 1. System reloads 2. Tape and disk mounts/dismounts 3. Operator messages These entries add another dimension to your environment. Keeping track of system performance can be a useful tool in preventive maintenance. The COMPUTE function, described in Chapter 5, also uses this type of entry to help derive system availability and effectiveness. 2.3 RECORDING EVENTS The operating system continually detects and records events concerning every disk, tape, and memory parity error as they occur. The operating system: 1. Detects the event 2. Identifies the type of event 3. Associates it with a device 4. Gathers information about it 5. Records the date and time 6. Stores the information as an entry by system event file 7. In some cases, tries to recover or error 2-4 appending find a way it to the around the THE SYSTEM EVENT FILE The system event file is a sequential file, therefore, each new entry is written to the end of the file. SPEAR can format these entries into an ASCII report with its RETRIEVE facility. Refer to Section 4.4 for information on RETRIEVE. The following section describes the template that each entry fills. 2.3.1 Record Format Each entry in a TOPS-10 and TOPS-20 system event file is composed of two sections: a header section and a body section. The top section (contained in asterisks) of each entry report is the header section. It contains the following information: 1. The entry type 2. The time the entry was recorded 3. The operating system uptime at the time of the entry 4. The serial number of the CPU where the entry occurred 5. The record sequence number The record sequence number is a number indicating the position of the entry in the file. SPEAR assigns the record sequence number to the entry when you decide to RETRIEVE it. For each operating system, the format of the header is the same. The following is a sample of an entry header on TOPS-20 after it has been translated by SPEAR: ************************************************************ MASSBUS DEVICE ERROR LOGGED ON FRI 13 JUN 80 03:23:15 MONITOR UPTIME WAS 2:34:08 DETECTED ON SYSTEM #2137. RECORD SEQUENCE NUMBER: 344. ************************************************************ On TOPS-10, if the system crashed and the entry has been copied from the CRASH.EXE file, the header states this fact at the top of the section. For example: *********************************************************** **THIS ENTRY COPIED FROM A SAVED CRASH** *********************************************************** Because the information was extracted from a saved crash instead of a running operating system, the date and time of the entry and the uptime listed in the header are the last values recorded by the operating system before it crashed. (Note that multiple entries extracted from a crash will have identical DATE, TIME, and UPTIME.) 2-5 THE SYSTEM EVENT FILE The body section of the entry contains the various data items that make up the entry. The format of the header is constant regardless of the entry type but the body varies according to the type of entry. The amount of information that is reported in the body also varies depending on the format you specify to RETRIEVE. You can receive a SHORT version of an entry with only summary information or a FULL entry with all the information that is in the system event file. Refer to Section 4.4 for more information on the RETRIEVE function. 2.3.2 Record Conventions for Numbers and Dates In the entries on TOPS-10 and TOPS-20, most numbers output by SPEAR are either decimal or octal. If SPEAR uses another numbering system, it is so noted on any report you request. Decimal values always contain a decimal point; all other values are octal. Values printed in half-word format have leading zeroes suppressed in each half of the word, and the halves are separated with a comma. All register values that are translated to text, such as the CONI value, have text translations only for bits or bytes of interest, and the whole value is dumped. For example, the CONI value might include a DONE bit and a PI assignment, but these bits are not translated to text. All dates and times printed by SPEAR are from your for example EST, unless otherwise stated. Refer to Chapter 5 for samples of entries that system event file of your operating system. 2-6 local can time appear zone, in the CHAPTER 3 ANALYZING FAILURES 3.1 INTRODUCTION The main reason for using SPEAR is to isolate the faults that are causing intermJttent failure of the system. In case you are unaware of the various problems you can run into trying to find the cause of these failures, this chapter discusses: 3.2 1. The types of failures that can occur and what causes them. 2. The various error-checking schemes built into the system. 3. Some techniques to follow in isolating these failures. TYPES OF FAILURES A fault is a condition that causes a system component to fail to perform as expected. For example, such a condition could be a broken wire, a power supply fluctuation, or an unexpected interaction between two or more software routines. As a matter of course, the operating system records the symptoms of these occurrences in the system event file for later reference. A fault is not necessarily noticeable until a failure occurs. A failure occurs only when a fault causes an adverse effect on system performance. The fault probably does not become apparent until a failure occurs. This is one reason for a system manager or system operator to use the COMPUTE function (Section 4.8) of SPEAR to check system performance. You are likely to find several faults before you find the one that is causing the failure. Therefore, always confirm that the fault you corrected is indeed the one that is causing the failure. Refer to Section 3.4.1 for verification techniques. You should also be on the lookout for changes in performance that may indicate an impending failure. By running SPEAR daily and keeping a record of its output, you could prevent a problem with the system. There are two general categories of failures caused by are: • Solid failures • Intermittent failures 3-1 faults. They ANALYZING FAILURES 3.2.1 Characteristics of Solid Failures A fault that affects the system in a permanent manner results in a solid failure. A solid failure is easier to solve than an intermittent failure. Because the failure is solidi that is, reproducible, you have a basis by which to research, identify, and eliminate the cause of the failure. 3_2.2 Characteristics of Intermittent Failures A fault that affects the system in a temporary manner can result in an intermittent failure. An intermittent failure is more difficult to solve than a solid failure. Something must be causing the failure to occur and something must be making it go away. The secret behind finding the cause of an intermittent failure is knowing that somehow, somewhere, something is changing the conditions under which the system is running. The changing conditions, in turn, make the problem intermittent. For field service engineers: the next time you are working on a really tough intermittent problem (after checking the power supplies and ground system and running the appropriate diagnostics), try stepping back and thinking about the problem. Think about what the system is doing. Watch it for a while. See if you can identify the exact conditions at the time of the failure. Use SPEAR to watch the conditions of the system and che~k the events before and after they occur by checking the system event file. If you can identify the conditions, then maybe you can reproduce them. If you can reproduce the conditions, then you have changed the intermittent failure into a solid failure. Although the approach to solving a solid failure is the same as the approach to solving an intermittent failure, in many cases, you will find that solving a solid failure is easier. 3.3 ERROR DETECTING AND ERROR CHECKING The system has several means by which to check for errors in both the hardware and software. The hardware contains error-detection circuits, and the software contains error-checking routines. Both the detection circuits and checking routines serve a dual purpose: (1) to minimize the effects of a failure on overall system performance, (2) to help isolate the cause of a failure. 3.3.1 Hardware Error Detectors There are three basic types of hardware error detectors in common use: 1. Threshold error detectors 2. Timing error detectors 3. Parity error detectors 3-2 ANALYZING FAILURES Threshold error detectors monitor critical analog circuits, such as power supplies, servomechanisms, write current circuits, and temperature probes. Timing error detectors monitor asynchronous events within the system, such as data requests to main memory or cache. The memory or cache must respond to the request within a certain amount of time. If it does not, the nonexistent-memory timing-error detector sets an error condition. Other asynchronous events that must be monitored for proper timing are: index and sector pulses, disk and tape up-to-speed operations, and internal and external clocks. Parity error detectors monitor the transfer of information. The parity generator adds one or more extra bits to the information being transferred to satisfy a particular parity algorithm. For example, in the case of the single-bit odd parity, the information is in the form of ones and zeros, the extra parity bit assures that the total number of one bits in the transfer is odd. The parity error detector monitors each transfer. Should a transfer ever contain an even number of one bits, the parity error detector raises a parity error condition. Note that in some cases, two bits can be dropped leaving odd parity. However, this is an undetectable error condition. Once anyone of these detectors detects an error condition, the operating system records the information as an entry in the system event file. These are the kinds of events you will be looking for when using the SPEAR library. 3.3.2 Software Error Checking There are four types of software error use: 1. Range checking 2. Validity checking 3. Sum checking (checksum) 4. Loop checking checking A range checking routine verifies that the routine fall between two known values. routines arguments in supplted common to a A validity checking routine verifies that a routine written to accept only certain arguments indeed accepts only those arguments. Any other response causes an error condition. A sum checking routine (checksum) checks file storage. When the monitor assembles a group of blocks to write contiguously on the disk, it checksums the first word of that group and saves that checksum in the retrieval information block (RIB). If, when read back, that checksum does not match the first word; the monitor assumes it read the wrong block. If there are no hardware errors, this is the best assumption. These errors probably indicate a disk addressing failure. If the monitor crashes before it is able to write the new RIB of an old file, the checksum may change in core but not on disk. An obscure software problem may also be responsible. Reproducing the error is one way for you to narrow the problem down. Also check the crash log and look for other error types. 3-3 ANALYZING FAILURES Note that a checksum error is not a substitute for parity. Its purpose is to make sure that a data set was written in the right place. If it was not, either the software failed to keep track of the data, or the hardware failed to address the correct place. A loop checking routine keeps count of the number of times a program entered a loop and reports an arror when a maximum count is reached, indicating that the loop is unable to reach a decision. Any time one of these error conditions is set, records the event in the system event file. events by using the SPEAR library. 3.4 the operating system You can check on these ISOLATION TECHNIQUES When you are faced with the problem of finding the cause of an intermittent failure, you should take the time to define the problem. First check the symptoms: l~ What is happening that should? 2. What is happening that should not? 3. What are the conditions and circumstances? As you probably know, here are some possible failures: (power, causes of intermittent grounding, temperature, 1. An environmental violation humidity, contamination) 2. A damaged, defective, or worn component 3. A faulty mechanical or electrical connection 4. A mechanical misalignment 5. An electrical misadjustment 6. A software design oversight 7. A hardware design oversight What you have to work with are the symptoms of the failure and the SPEAR library of functions. Hopefully, the system operator has been running SPEAR analysis on a daily basis so that you can get a picture of the conditions leading up to the problem. If not, you can run SPEAR and receive a report within a short period of time. With SPEAR analysis and reported symptoms, you should be able to venture a guess as to the cause of the problem. You might even be able to pinpoint the failure right away. If you are not that fortunate, your next plan of action is to do the following: 1. Devise an experiment 2. Predict the results 3. Conduct the experiment 3-4 ANALYZING FAILURES 4. Evaluate the results 5. Refine the experiment 6. Repeat the process For example, if you suspect that a disk pack is bad, move the pack to another disk drive. If the media is bad, the error pattern will move to the other drive. Once you believe you have isolated the failure, you should confirm your findings. After moving the disk pack, run the system for a couple of days. Then run SPEAR analysis. Check to see if the same error patterns occur on the second drive. 3.4.1 Verification There are two general methods of verifying your findings. The first method is to reinsert the problem. If the symptoms recur, you can be relatively sure that you have identified the cause of the problem, thereby verifying your findings. If the symptoms do not recur, you should proceed with the second method. The second method is called the time window. You should use the time window for intermittent problems or when reinserting the probable cause is not feasible; that is, when reinserting would be too time consuming or potentially damaging to the system. The time window is simply a period of time during which you closely monitor the performance of the system. If the problem does not recur during th~t period, then you assume the problem is solved, and your findings are verified. The duration of the time window depends on whether the problem was solid or intermittent. If the problem was solid, then monitor the system for 24 hours. If the problem was intermittent, wait at least three times as long as the frequency of the error. Experience will dictate the method that works best for you. Your site may have its own specific techniques that are tried and true. successful method. 3-5 isolation and verification If so, stay with the most CHAPTER 4 THE SPEAR LIBRARY 4.1 INTRODUCTION The previous chapters introduced you to SPEAR, described where SPEAR gets its information, and listed techniques for intermittent fault isolation. This chapter explains how to use the SPEAR dialogue with its help facilities and describes the following six functions in the SPEAR library: • INSTRUCT • RETRIEVE • KLERR • SUMMARIZE • KLSTAT (TOPS-20 only) • COMPUTE SPEAR is set up in such a way that after you use it a number of times you can run through it without any problems. The reason for its ease of use is the way you interact with SPEAR. SPEAR has a dialogue that prompts and helps you along as much as you want. 4.2 RUNNING SPEAR To run SPEAR, first log in to your operating system, then type one the following: .R SPEAR On TOPS-10 based systems @SPEAR On TOPS-20 based systems SPEAR indicates that it is waiting for instructions by displaying following prompt: of the SPEAR> After you see the SPEAR prompt, you can type anyone of the function names, (you can type KLSTAT on TOPS-20 only) or type HELP or question mark, or EXIT back to operating system command level. If you type a function name, you need only specify enough characters to make it unique to SPEAR. In this case, you need type only the first character of the name for SPEAR to recognize it. 4-1 THE SPEAR LIBRARY If you type a question mark (?) at this point, SPEAR prints a list of the features available to you in your version of the SPEAR Library. CAUTION The SPEAR library is not transportable across operating systems. You cannot run SPEAR for TOPS-10 on TOPS-20 and so on. Consequently, you cannot use the system event file from one operating system with a SPEAR library from another system. SPEAR has several features to guide you in subsections describe these features. 4.2.1 its use. The following Prompts, Responses, and Arguments Each function of SPEAR has several levels of questions for you to answer. SPEAR prompts you and gives you a selection of acceptable responses. The default is listed in parentheses with each prompt. If you have been through this before, you can speed up the process by responding to all the prompts on the first line, using legal separators, or by specifying an indirect file containing your responses. SPEAR can process commands from a disk file as well as from your terminal. This disk file, known as an indirect file, is useful if you have a set of responses you often use. To use this function, create a disk file while at operating system command level with a text editor. The file should contain responses that you would normally type to SPEAR on the terminal. NOTE Be sure to delete any line-sequence numbers from indirect file. SPEAR will not accept them. your Once you have created the file and saved it in your disk area, all you need to do is to run SPEAR and type the file name preceded by an at sign (@). The at sign (@) signifies an indirect file. The default file name for an indirect file is SPRCMD.CMD. Note that you can specify an indirect file at any prompt level of SPEAR, as long as the file contains only the remaining information necessary to complete the SPEAR requests. You can choose to be prompted at every step or decide to supply all required information without prompting. In fact, at SPEAR command level, you can input an entire SPEAR session on one line, separating each field with a space. For example: SPEAR>RETRIEVE A09l6.PAK 5,6,10 ASCII FULL /G ~ By using special characters as separators, you can also speed up the process within the SPEAR dialogue. Section 4.2.2 describes these characters. 4-2 THE SPEAR LIBRARY 4.2.2 Separators and Terminators The following characters and terminal keys SPEAR: have special meaning to 1. The RETURN key ~ indicates that you have completed input to a SPEAR prompt in one way or another. You have either input your own arguments or taken the default. 2. A comma (,) - indicates that you are inputting a list of items within one request for input, for example a list of sequence numbers or packet identifiers. 3. A colon (:) - indicates that you have either input a device name within a file specification or you have specified devices within an error type specification. 4. A plus sign (+) - separates more than one major error type on one line. 5. A semicolon (i) - indicates that the next version number in a file specification. 6. An exclamation point (1) - allows you to insert comments. SPEAR ignores anything it sees on the current line after an exclamation point. 4.2.3 argument is a Help Features There are five major help features in SPEAR, the question mark (?), the HELP command, the @HELP command, the question mark switch (/?), and the /HELP switch. 1. The question mark (?) provides enough information to your memory about the acceptable responses. 2. The HELP command provides detailed information prompt and on acceptable commands. 3. The @HELP command displays files. 4. The question mark switch (/?) provides a list of switches you can type as response to a particular prompt. 5. The /HELP switch provides an explanation of switches that you can type as response prompt. information on refresh both concerning the indirect the acceptable to a particular You can type any of these help features after any prompt in the SPEAR dialogue and also after you have typed a response to the prompt. For example, if you type a question mark in response to a prompt, SPEAR does the following: 1. Lists all acceptable responses. 2. Gives a brief description of the desired response general (for example, file specification). if it is If you type a question mark after supplying characters to a prompt, SPEAR lists all acceptable responses matching the cha~acters typed. 4-3 THE SPEAR LIBRARY You can also type the HELP command after any prompt. SPEAR prints to 22 lines of information about the use of the prompt. up The Escape key is another help feature in the SPEAR library. The Escape key fills in a response if you type enough characters for SPEAR to know what you want. For example: Output mode (ASC I I) : B ( ESC) INARY SPEAR If you do not supply enough information before typing ~, prompts you for more input by sending a bell to the terminal. If you press <ESC> without typing any characters in response to a prompt, SPEAR fills in the default response. For example: Even t f i 1 e (S ERR: ERROR. SYS ): [ESC] S ERR: ERROR. S YS The following keys can also help you through the SPEAR dialogue: 1. CTRL/U - deletes the current input line 2. CTRL/W - deletes back to the last punctuation character 3. CTRL/F - completes the next field with the default 4.2.4 of a file specification File Specifications The following are the formats of the file specifications that can be given in a SPEAR command string. These formats are listed according to operating system: TOPS-10 dev:filename.file extension[directory] TOPS-20 dev:<directory>filename.file type. file version 4.2.5 SPEAR Switches The following is a list of the switches available in SPEAR. Note that the square brackets indicate optional information that you can omit. You do not type the square brackets. /1 lists the available switches. /B [REAK] returns you to the SPEAR> prompt. /G [0] executes the current SPEAR command with the parameters you have given so far. It takes the defaults for the rest of the parameters. This is the default switch. /H [ELP] lists the available switches explanation of their uses. /R [EVERSE] returns you one level back to the previous prompt, where you can change any parameters. /S [HOW] shows all the parameters you have specified so far and fills in the defaults for the ones you have not specified. 4-4 and gives a brief THE SPEAR LIBRARY The following is an example (from TOPS-10) using the /SHOW switch with the RETRIEVE and SUMMARIZE commands. Note that all the defaults are shown because no other parameters have been specified. SPEAR> SUMMARIZE/SHOW Event file: SYS:ERROR.SYS: Report to: DSK:SUMMAR.RPT Time from: 8-Mar-85 Time to: LATEST Show Error Distribution: YES SPEAR> RETRIEVE/SHOW Event or packet file: SYS:ERROR.SYS Output to: DSK:RETRIE.RPT Merge with: NONE Time from: EARLIEST Time to: LATEST Selection to be: INCLUDED Output mode: ASCII Report format: SHORT Selection type: ALL SPEAR> RETRIEVE/REVERSE SPEAR> EXIT 4.2.6 Exiting from SPEAR To exit from SPEAR, first return to /BREAK. Then type the EXIT command. typing CONTROL/C at any prompt. 4.3 the SPEAR> prompt by typing You can also exit from SPEAR by INSTRUCT INSTRUCT is a computer-aided instruction program that explains how to use the SPEAR library. You can use INSTRUCT as a course on how to use SPEAR, or as a reference to a particular piece of information on the SPEAR library. The SPEAR (CAl) course consists of four main modules: 1. Fault Isolation Techniques - This module describes the nature of intermittent faults and discusses some of " the most common methods used to isolate intermittent system and subsystem failures. 2. System Event File Organization and Content This module describes the overall organization and content of TOPS-10 and TOPS-20 system event files. 3. SPEAR Library Functions - This module explains how to use each of the SPEAR maintenance functions: RETRIEVE, KLERR, COMPUTE, SUMMARIZE, and KLSTAT. 4-5 THE SPEAR LIBRARY 4. Guaranteed Uptime Program - This module explains how the NOTIFY program to measure system uptime. to use Each module consists of an introduction and a menu of subordinate topics. When appropriate, the subordinate topics are also broken down into introduction and menus. Thus, you can use INSTRUCT as either a tutorial or a reference. INSTRUCT is frame-oriented, that is, it displays one frame of information at a time. Thus, you can study each frame for as long as you like. Then, when you are ready, you can proceed to the next frame by pressing the RETURN key. To use INSTRUCT as a tutorial, refer to Section INSTRUCT as a reference, refer to Section 4.3.2. 4.3.1 4.3.1. To use Setting Up a Student 10 To access INSTRUCT right now, do the following: Log in to your operating system. Run SPEAR - .R SPEAR @SPEAR (TOPS-10 ) (TOPS-20 ) To begin the teaching session, type: @SPEAR>INSTRUCT L!=.l"~~J This response places you at the beginning of the course. First INSTRUCT displays an overview of the SPEAR library. You must press the RETURN key to see the next frame of information. INSTRUCT then gives you an introduction to the course. If there is no instruction or question to answer at the bottom of the screen, press the RETURN key to see the next frame of information. After the explanation of common responses, you will be asked if you want to establish a student identification number: Badge number (REFERENCE): If you want to establish an 10, enter an alphanumeric string; something you are not likely to forget. Then press the RETURN key. From this point on, INSTRUCT will keep track of where you are in the course. After you have established your Student 10, you can leave INSTRUCT any time you want by typing /B. When you return, type your ID in response to the SPEAR prompt: @SPEAR>INSTRUCT ID n ~ where n is your Student 10. INSTRUCT will return you to the exact location break switch, /B. 4-6 where you typed the THE SPEAR LIBRARY 4.3.2 Using INSTRUCT as a Reference Tool The quickest way to following: SPEAR> i i access the INSTRUCT menus is by typing the menu and rig @D where The first i represents INSTRUCT. The second i represents ID. The r is for REFERENCE. The /g is for /GO. INSTRUCT responds with the following menu: Spear Course Menu 1. Course Administrator/Student Guide 2. Troubleshooting 3. System Event Files 4. Using The Spear Library 5. Guaranteed Uptime Program 6. Feedback 7. Random Questions 8. Dialog Changes Your selection please (i»~ At this point enter one of the numbers or press the RETURN key. letters in the The Course Administrator's Guide gives a brief description of how to administer the course along with a sample answer sheet. The Troubleshooting section gives some tips on how to approach the problem of isolating intermittent system faults. The System Event File section is a question and answer session concerning that topic. Using the SPEAR Library is a combination of information and questions and answers. The Guaranteed Uptime Program explains how to use the NOTIFY program with the COMPUTE function of SPEAR to measure system uptime. The Feedback section is a request for your Library. The Random Questions section gives to test your knowledge of SPEAR. The section, describes the use of the SPEAR manual with the opinion of the SPEAR you another opportunity Using the SPEAR Manual, SPEAR program. Remember to press the RETURN key unless instructed otherwise. frame after 4-7 each of information, THE SPEAR LIBRARY 4.4 RETRIEVE RETRIEVE provides a means by which to convert the entries in the system event file from internal binary format to a readable ASCII format. It also allows you to select specific entries from the system event file and save them in a separate file. 4.4.1 RETRIEVE Input RETRIEVE accepts the following types of input: 1. The system event file 2. A file created by the RETRIEVE process 3. Any file containing entries from the system event file With RE'rRIEVE, you have the option of translating the entire system event file or specific entries in the file by sequence number. In order to have more control over the selection of specific types of entries, you can use RETRIEVE to extract the entry types in which you are interested and then translate them. You can select entries on the basis of the following: 1. Date/time limits 2. Sequence numbers 3. Event codes 4. Error 5. Statistics 6. Configuration 7. Diagnostics Error, Statistics, Configuration and Diagnostics subdivided into the following categories: 1. Mainframe (CPU, memory, front-end) 2. Di sk 3. Tape 4. CI 5. NI 6. Unit record 7. Network 8. Operating system 9. Disk pack identifier 10. Tape reel identifier 4-8 can be further THE SPEAR LIBRARY Once you have defined a category, you can specify physical names or device types within a class, such as LPT for unit record device. Table 4-1 lists the available device types that you can specify. Table 4-1: Device Types Category Device Types Mainframe ALL, MEM, FE, CPU Disk ALL, RM03, RM05, RP04, RP05, RS04, RP20, RA60, RA80, RA81 RP06, RP07, Tape ALL, TU16, TU45, TU77, TU78, TA78 TU72, TU73, CI CI20, HSC50 NI NIA20, ALL Unit Record ALL, LPT, CDR Network ALL, Decimal number in range 0-511 4-2) TU70, TU71, (see Table Table 4-2 lists the classes available for selection of DECnet events. Table 4-2: Network Event Classes Class Desc r iption o Management layer Application layer Session Control layer Network services layer Transport layer Data link layer Physical link layer Reserved for other common event classes Reserved for RSTS specific event classes Reserved for RSX specific event classes Reserved for TOPS-20 specific event classes Reserved for VMS specific event classes Reserved for RT specific event classes Reserved for future use Reserved for Customer specific event classes 1 2 3 4 5 6 007-031 032-063 064-095 096-127 128-159 160-191 192-479 480-511 For more information concerning network entries from DECnet, refer the DECnet documentation for system managers and operators. to If you specify Error as an entry selection, you can also error type. See Table 4-4 for a list of error types. an 4-9 specify THE SPEAR LIBRARY 4.4.2 RETRIEVE Output RETRIEVE output can be in the following forms: 1. One or two lines containing the most pertinent data in format. 2. All data about each event, in ASCII format. 3. All data about each event in octal dump format. This format is useful only for debugging the error-reporting system. 4. Specific events saved in binary format, for future reference. Your default output can be an ASCII file, RETRIE~SYS. file, RETRIE.RPT, or a ASCII binary You should be aware that user-defined entries that are unknown to SPEAR cannot be translated into ASCII. You can, however, get an octal dump of these entries by specifying OCTAL to the Output Mode prompt when running RETRIEVE. An unusual event you may find in the system event file is a KLERR entry. The KLERR entries are different from most entries in that it takes several event file records to make up one complete entry~ This is because the front-end must send information in pieces through the DTE interface along with all communications, console, and hard-copy data. Because of this, there is a chance that not all records will actually get through to the event file. When SPEAR sees that a KLERR entry is incomplete, it will type an error message (non-fatal) and will translate all available data anyway. Each KLERR entry uses one sequence number. When looking at a RETRIEVE report, you may notice gaps between sequence numbers even if you have selected ALL entries. A KLERR entry is listed using the sequence number of the first record in the entry, but it is not listed until all records of the entry have been received. Because other entries may enter the event file before the front-end has sent all records of one KLERR entry, the KLERR entry will appear to be out of sequence. For example, you may find entries with the following sequence numbers: 1. Configuration status change 3. Disk error 6. Tape error 2. KLERR 8. Reload You can translate the KLERR entry into its components KLERR function. See Section 4.5 for details. For step-by-step procedures 4.4.3. for using 4-1'" RETRIEVE, by refer using to the Section THE SPEAR LIBRARY 4.4.3 RETRIEVE Procedure RETRIEVE allows you the option of converting events in the system event file into an ASCII format for listing on the terminal or lineprinter. To begin with, RETRIEVE prompts with one or more of the following guidewords: RETRIEVE Mode Event or packet file(SERR:ERROR.SYS): Packet numbers: Selection to be (INCLUDED): Selection type (ALL): Sequence numbers: Event codes: Category (ALL): Next category (FINISHED): Mainframe devices (ALL): Disk drives (ALL): Tape drives (ALL): CI controller (ALL): Unit record devices (ALL): Disk (structure IDs): Tape (reel IDs): Time from (EARLIEST): Time to (LATEST): Output mode (ASCII): Merge with (NONE): Report format (SHORT): Output to (DSK:RETRIE.RPT): 4-11 THE SPEAR LIBRARY 4.4.3.1 Retrieving Selected Events - If you want to take all the defaults, type RIG to the SPEAR> prompt; otherwise, read the following procedure • • STEP 1 After typing RETRIEVE to the SPEAR> prompt, you are asked for the name of the input file: TOPS-20 Event or packet file (SERR:ERROR.SYS): or TOPS_-10 Event or packet file (SYS:ERROR.SYS): Type one of the following: • 1. The RETURN key - to select file. the 2. Any file name, in the proper format, containing events stored in binary. 3. The name of a previous file mode • that default, you the system RETRIEVEd in event BINARY STEP 2 RETRIEVE then prompts for the method of selection: Selection to be (INCLUDED): Type one of the following: • 1. The RETURN key - to select the default I[NCLUDED]. INCLUDED moves a few selected entries of various types into a separate file. 2. E[XCLUDED] - to select all but a few entry types • STEP 3 After selecting prompt: INCLUDED or EXCLUDED, you receive the following Selection type (ALL): At this prompt, you have two separate lists from which Type one or more of the following from the first group: to choose. 1. E[RROR] - to select entries that contain actual failure data. 2. ST[ATISTICS] - to select statistic entries. 3. D[IAGNOSTICS] - to select entries created by a diagnostic. 4. CON [FIGURATION] - to select configuration entries. 5~ O[THER] - to select entries that do not fit types. 4-12 into the other THE SPEAR LIBRARY If you choose more than one of comma. these ty~es, separate each with a Or type one of the following from the second group: 1. The RETURN key or A[LL] - to select the default that extracts all entries. You will be asked for date and time limits next. 2. SE[QUENCE] - to select entries by sequence number. If you choose SEQUENCE, RETRIEVE prompts further with: Sequence numbers: Here you can specify one number, several numbers separated by commas, or a range of numbers separated by a hyphen. 3. COD[E] - to select entries on the basis of their octal code number. These numbers are listed in Table D-l and in the SPEAR Reference card. If you choose CODE, RETRIEVE prompts you further with: Event codes: Here you can specify one number, several numbers separated by commas, or a range of numbers separated by a hyphen. If you chose ERROR, STATISTICS, CONFIGURATION, OTHER, or a combination of these, proceed with Step 3A. If you chose ALL or CODE, proceed to Step 4. If you chose SEQUENCE proceed to Step 6 • • STEP 3A If you choose ERROR, STATISTICS, CONFIGURATION, OTHER, combination of these types, you receive the following prompt: or a Category (ALL): Type one of the following: 1. The RETURN key or A[LL] is the default. to select all the categories. 2. M[AINFRAME] to select mainframe components. 3. D[ISK] to select entries occurring on individual drives. disk subsystems or 4. T [APE] - to select entries occurring on individual drives. tape subsystems or 5. CI - to select entries occurring on the the HSCS0 disk controller. CI interconnect or 6. NI - to select entries occurring on the NI. errors occurring - 4-13 in This specific THE SPEAR LIBRARY 7. U[NITRECORD] - to select entries occurring on devices such as card readers and line printers. 8. NE[TWORK] - 9. o [PERATING-SYSTEM] - unit-record to select entries occurring on the network nodes. to select entries that are software entries occurring on communications rela ted. 10. CO[MM] devices. to 11. P [ACKID] packs. - to select "entries occurring on specific disk 12. R [EELID] reels. - to select occurring on specific tape select entries All categories except COMM and NI, prompt further for specific types. Table 4-3 lists the subprompts you can expect. Table 4-3: device Subprompts for Device Types Device Type Subprompt MAINFRAME DISK TAPE CI UNITRECORD NETWORK OPERATING-SYSTEM PACKID REELID Mainframe devices (ALL) : Disk drives (ALL) : Tape drives (ALL) : CI controllers (ALL): Unit record devices (ALL): Event class and type (ALL) : Operating System codes (ALL): Disk (structure IDs): Tape (reel IDs): Type? at the subprompt level to get a list of acceptable or refer to Table 4-1 in this manual. responses, If you chose ERROR as one of the selection types in STEP 3, you can also specify the particular error types for which you are looking in relation to the specific device. Table 4-4 lists the error types for the devices. 4-14 THE SPEAR LIBRARY Table 4-4: Error Types Prompts Error Types Disk error type (ALL): OFFLINE WRITE-LOCK UNSAFE MICROPROCESSOR SOFTWARE BUS CHANNEL-CONTROLLER READ-WRITE SEEK-SEARCH TIMING OTHER Tape error type (ALL): READ WRITE DEVICE-FORMATTER BUS CHANNEL-CONTROLLER SOFTWARE OFFLINE OPERATOR OTHER CI error type (ALL): for CI20 CI error type (ALL): for HSC50 NI error type (ALL): EBUS MBUS CRAM-PARITY CHANNEL-ERROR SERDES-OVERRUN EDS INCONSISTENT-DATA SERDES-OVERRUN EDC INCONSISTENT-DATA EBUS MBUS CRAM-PARITY CHANNEL-ERROR .STEP 3B RETRIEVE keeps prompting you for categories FINISHED or press the RETURN key: until you Next category (FINISHED): Type one of the following: 1. The RETURN key or F[INISHED] to take the default. 2. Another category. 4-15 either type THE SPEAR LIBRARY Note that you can select disk entries by either DISK or PACKID and tape entries by either TAPE or REELID. If you are interested in media, use PACKID or REELIDi otherwise, use DISK or TAPE. If you specify both DISK and PACKID (or TAPE and REELID), you select all disk entries (or tape entries), not just those that match the selected media. If you want to select entries with a specific device and media, you must run RETRIEVE twice. You can specify more than one device commas. For example: name by separating them with Disk drives (ALL}:DISK:RP06,RM03,RP05 (by using You can always come back to error category selection Everything typed here remains until you /REVERSE) to add parameters. type CTRL/U or CTRL/W. Note that supplying a device type (RP06, RM03) causes SPEAR to search a different field than if you had supplied a physical name (DP130, MTAl, and so forth). If the name you supply does not match one of the known device types, SPEAR assumes that it is a physical name. RETRIEVE then prompts you for the date and time limits of the you want to select: entries Time from (EARLIEST): Type one of the following: • 1. The RETURN key or E[ARLIEST] - to select the beginning of the file. This is the default. 2. A date and time in the format dd-mmm-yy hh:mm:ss - to signify where to begin extracting entries. A date by itself defaults to one second after midnight. 3. A date and time in the format -nn to indicate a reference point prior to the current date. For example, -7 causes RETRIEVE to begin extracting entries from seven days prior to the current day • STEP 5 RETRIEVE then prompts for the end of the time period: Time to (LATEST): Type one of the following: 1. The RETURN key or L[ATEST] - to select the end of This is the default. 2. A date and time in the format dd-mmm-yy hh:mm:ss to indicate the last date for extracted entries. A date by itself defaults to one second after midnight. 3 .. indicate a reference A date and time in the format -nn to point prior to the current date. For example, -13 causes RETRIEVE to stop extracting entries recorded thirteen days before the current date. 4-16 the file. THE SPEAR LIBRARY RETRIEVE next prompts for style of output~ Output mode (ASCII): Type one of the following: 1. The RETURN key or A[SCII] - to format. This is the default. 2. B[INARY] - to retain the entries in their internal format. If you choose ASCII, proceed STEP 8 • • ~o convert entries into ASCII STEP 7. If you choose BINARY, skip to prompts you STEP 7 After choosing ASCII, RETRIEVE output: for the form of your Report format (SHORT): Type one of the following: • 1. The RETURN key or S[HORT] to select the default. This selection produces a report with only the most essential information. No entry will be longer than three lines of 72 columns. 2. F[ULL] - to display all the information system recorded for that entry. 3. o [CTAL] - that the operating to produce a ones and zeros ASCII report. The ones and zeros represent the actual binary contents of the entry. Unless you are familiar with the internal format of the individual entries, this format has very little value. Its primary purpose is to aid in debugging the SPEAR program library • STEP 8 If you specified BINARY as output style, RETRIEVE then prompts for another file name to give you an opportunity to combine two files into one for record-keeping purposes. The merged output file will be in the proper chronological order. Both files must be in binary format. The prompt is: Merge with (NONE): Type one of the following: 1. The RETURN key - to select the default of NONE. 2. A file name of another system event file. file 4-17 containing entries from the THE SPEAR LIBRARY .STEP 9 The last thing RETRIEVE asks for is the destination of the output. you chose ASCII, the prompt is: If Output to (DSK:RETRIE.RPT): If you chose BINARY, the prompt is: Output to (DSK:RETRIE.SYS): Type one of the following: key to select the default RETRIEuRPT or 1. The RETURN RETRIE.SYS. 2. TTY: - to direct ASCII formatted output to the terminal. You should not request BINARY formatted output to be printed on the terminal. 3. Any file name in the proper format for your system. After you select the output destination and press RETURN, you to confirm your decision: SPEAR asks Type <cr> to confirm (/GO): At this point, you can: 1. Press RETURN or type IGO to execute the RETRIEVE process. 2. Type ISHOW to list the parameters you have chosen. 3. Type IREVERSE to return to the previous prompt. 4. Type IBREAK to return to SPEAR> level. 5. Type question mark (?), HELP, the question mark switch or IHELP to find out what your options are. (I?), If your output is formatted in ASCII and you decide to output the file to your disk area, you can list the file on the lineprinter by doing the following: Return to operating system command level by typing SPEAR> prompt. Use the PRINT command operating system. with 4-18 any options EXIT to the available on your THE SPEAR LIBRARY 4.4.3.2 Sample RETRIEVE Session - The following is a sample session using the TOPS-20 system event file for input: @spear Welcome to SPEAR for TOPS-20. Version 2(605) Type "?" for help. SPEAR> retrieve RETRIEVE mode Event or packet file (SERR:ERROR.SYS): Selection to be (INCLUDED): Selection type (ALL): error,diagnostic Category (ALL): disk Disk drives (ALL): RP07 Disk error type (ALL): ? One or more of the following: ALL OFFLINE WRITE-LOCK UNSAFE MICROPROCESSOR SOFTWARE BUS CHANNEL-CONTROLLER READ-WRITE SEEK-SEARCH TIMING OTHER HELP Disk error type (ALL): Next Category (FINISHED): Time from (EARLIEST): Time to (LATEST): Output mode (ASCII): Report format (SHORT): full Output to (DSK:RETRIE.RPT): Type <cr> to confirm (/GO): 4-19 read-write RETRIEVE THE SPEAR LIBRARY 4.4.3.3 Short Format - The following is a sample of a RETRIEVE report in short format: @ty retrie.RPT SPEAR Version 2(565). Retrieval from SERR:ERROR.SYS Report generated 6-Mar-84 15:57:46-EST As directed by user Selected window: 23-Feb-84 00:00:01-EST to 26-Feb-84 00:00:01-EST. Selected records are included Selection type is ERRORS, Report sent to DSK:RETRIE.RPT SEQ TIME Thu 23 Feb 84 1249. 03:12:43 DP100 WORK: RP07 SERIAL #2861. CONI RH= 0,222715 CHN STS= 540100,174632 SR= 0,51700 ER= 0,100000 CYL/SURF/SEC= 212./27./3. 1713. 08:15:49 DP040 RP06 SERIAL #0125. CONI RH= 0,202615 CHN STS= 500000,305600 SR= 0,51700 ER= 0,100000 CYL/SURF/SEC= 0./0./1. 1875. 11:26:39 DP000 SERR: RP06 SERIAL #0941. CONI RH= 0,222615 CHN STS= 540100,174024 SR= 0,51700 ER= 0,100000 CYL/SURF/SEC= 603./10./16. SEQ TIME Fri 24 Feb 84 328. 13:14:20 DP010 PUBLIC: RP06 SERIAL #0484. CONI RH= 0,222615 CHN STS= 540100,174066 SR= 0,51700 ER= 0,100000 CYL/SURF/SEC= 93./12./0. 372. 17:04:09 DP000 SERR: RP06 SERIAL #0941. CONI RH= 0,222615 CHN STS= 540100,174024 SR= 0,51700 ER= 0,100000 CYL/SURF/SEC= 361./15./16. SEQ TIME Sat 25 Feb 84 85. 10:43:36 DP110 GALAXY: RP07 SERIAL #2510. CONI RH= 0,322615 CHN STS= 540100,174632 SR= 0,51700 ER= 0,400 CYL/SURF/SEC= 623./15./35. 4-20 THE SPEAR LIBRARY 4.4.3.4 Octal Format - The following is a sample of a RETRIEVE report in octal format. SPEAR Version 2(565). Retrieval from SERR:ERROR.SYS Report generated 6-Mar-84 16:08:12-EST As directed by user Selected window: 23-Feb-84 00:00:01-EST to 26-Feb-84 00:00:01-EST. Selected records are included Selection type is ERRORS, Report sent to DSK:RETRIE.OCTAL Sequence # 1249 -- Record HEADER: 0/ 111001,,125124 1/ 131271,,257140 2/ 0,,116617 3/ 0,,5467 4/ 0,,2341 Record BODY: 0,,0 0/ 675762,,530000 1/ 1242,,440147 2/ 1,,74014 3/ 100000,,1 4/ 0,,222715 5/ 0,,2415 6/ 0,,35624 7/ 1,,234156 10/ 0,,172464 11/ 0,,0 12/ 0,,0 13/ 14/ 0,,0 732200,,177471 15/ 732200,,177471 16/ 720000,,15403 17/ 720000,,15403 20/ 0,,715652 21/ 600001,,0 22/ 0,,1 23/ 0,,0 24/ O,,0 25/ 0,,0 26/ 0,,324 27/ 0,,2214 30/ 4-21 THE SPEAR LIBRARY Sequence f 1713 -- Record HEADER: 0/ 111001,,125124 1/ 131271,,432751 2/ 0,,272430 3/ 0,,5467 4/ O,,3261 Record BODY: 0,,0 0/ 1/ 0, ,0 2/ 1242,,440146 0,,1 3/ 4/ 100000,,1 0,,202615 5/ 6/ 0,,2415 7/ 0, ,0 10/ 0,,466 11/ 0,,0 12/ 0,,0 0, ,0 13/ 0,,0 14/ 15/ 732204,,177771 732204,,177771 16/ 17/ 720004,,1 20/ 720004,,1 21/ 0, ,715436 200001,,0 22/ 0,,1 23/ 0,,0 24/ 0,,0 25/ 0,,0 26/ 27/ 0,,0 0,,1 30/ 4-22 THE SPEAR LIBRARY 4.4.3.5 Full Format - The following is an example of a full format: RETRIEVE SESSION SPEAR Version 2(565). Retrieval from SERR:ERROR.SYS Report generated 6-Mar-84 16:02:31-EST As directed by user Selected window: 23-Feb-84 00:00:01-EST to 26-Feb-84 00:00:01-EST. Selected records are included Selection type is ERRORS, Report sent to DSK:RETRIE.FULL *********************************************** MASSBUS DEVICE ERROR LOGGED ON Thu 23 Feb 84 03:12:43 DETECTED ON SYSTEM i 2871. RECORD SEQUENCE NUMBER: 1249. MONITOR UPTIME WAS *********************************************** 3:41:34 UNIT NAME: DP100 UNIT TYPE: RP07 UNIT SERIAL i: 2861. VOLUME ID: WORK LBN AT START OF XFER: 1074014 CYL: 212. SURF: 27. SECT: 3. OPERATION AT ERROR: DEV.AVAIL., GO + READ DATA(70) FINAL ERROR STATUS: 100000,1 RETRIES PERFORMED: 2. ERROR: RECOVERABLE DRIVE EXCEPTION,CHN ERROR, IN CONTROLLER CONI DCK, IN DEVICE ERROR REGISTER CONTROLLER INFORMATION: CONTROLLER: RH20 i 1 CONI AT ERROR: 0,222715 = DRIVE EXCEPTION,CHN ERROR, CONI AT END: 0,2415 = NO ERROR BITS DETECTED DATAl PTCR AT ERROR: 732200,177471 DATAl PTCR AT END: 732200,177471 DATAl PBAR AT ERROR: 720000,15403 DATAl PBAR AT END: 720000,15403 CH~NNEL INFORMATION: CHAN STATUS WD 0: 200000,174567 CW 1 : 0, 0 CW 2 : 0, 0 CHN STATUS WD 1: 540100,174632 NOT SBUS ERR,NOT WC = 0,LONG WC ERR, CHN STATUS WD 2: 614005,377200 4-23 THE SPEAR LIBRARY DEVICE REGISTER INFORMATION: AT ERROR AT END CR(00): 40713 4070 DEV. AVAIL. , READ DATA(70) 11700 SR (131) : 51700 ERR,MOL,PGM,DPR,DRY,VV, ER (02) : 100000 0 DCK, MR(03): 13 0 0 AS(04): 0 DA (05) : 154134 15407 D. TRK = 33, D. SECT. = 4 DT (06) : 241342 24042 700 LA (1217) : 1700 SN(10): 24141 24141 0 OF(ll): 0 DC (12) : 324 324 212. CC (13) : 324 324 212. E2 (14): 13 0 NO ERROR BITS DETECTED 0 E3 (15): 0 NO ERROR BITS DETECTED EP(16): 1454 0 PL (17) : 2400 0 DIFF. 13 40000 100000 13 0 3 0 10013 121 121 0 0 0 0 1454 2400 DEVICE STATISTICS AT TIME OF ERROR: # OF READS: 342126. # OF WRITES: 62772.:JI: OF SEEKS: :JI: SOFT READ ERRORS: 1. # SOFT WRITE ERRORS: 121. # HARD READ ERRORS: 0. # HARD WRITE ERRORS: 0. i SOFT POS ITIONING ERRORS: 0. # HARD POSITIONING ERRORS: 0. :II: OF MPE: 121. # OF NXM: 0. # OF OVERRUNS: 121. 4.5 15252. KLERR The KLERR function translates the front-end log. This log is summarized in the system event file as the FRONT END DEVICE REPORT "KLERR" entry. This entry is written into the system event file when the KL clock stops for any of several errors (FAST MEMORY, PARITY ERRORS, CRAM PARITY, DRAM PARITY ERROR, or FIELD SERVICE STOP). Any significant error signal will be listed just after the header. You can use KLERR to generate a detailed report of and/or summaries of KLERR data blocks. You always get a summary but you must select one of three format~ if you want a detailed report of each event. KLERR helps KLl0 maintainers by automating some of the time-consuming tasks associated with interpreting front-end snapshots logged in the TOPS-10 and TOPS-20 system event files. RSX-20F stores a list of function reads (FREADs) and their results in octal. To determine the cause of a crash by reading these octal function-read words is difficult because: • The KL10 registers are split between function-read words must be reconstructed manually. and • It takes time to find the signal names associated bit. each 4-24 with THE SPEAR LIBRARY • Some registers are difficult to reconstruct. • It is difficult to see patterns across multiple events. To use KLERR effectively, check the daily ANALYZE report. If KLERR records are being written, the ANALYZE report will include a message to that effect. The report will also show whether any error bits were set. You can use the ANALYZE packet number as input to RETRIEVE short format to find what error bits were set or use full format to get all the function reads in octal. If this does not successfully localize the fault, use the KLERR function. 4.5.1 KLERR Input KLERR accepts the following types of input: 4.5.2 • The system event file • A binary file created by the RETRIEVE process • Any binary file containing entries from the system event file KLERR Procedure KLERR prompts you with one or more of the following guidewords: KLERR mode Event file (SERR:ERROR.SYS): Selection (ALL): Sequence numbers: Time from (EARLIEST): Time to (LATEST): Report style (SUMMARY-ONLY): Summary type (ERRORS-ONLY): Output to (DSK:KLERR.RPT): If you want to take all the defaults, type KLE/G to the SPEAR> prompt. Otherwise, read the following procedure: ~STEP 1 After you type KLERR to the SPEAR> prompt, KLERR requests the name the input file: Event file (SERR:ERROR.SYS): TOPS-20 or TOPS-10 Event file (SYS:ERROR.SYS): 4-25 of THE SPEAR LIBRARY Type one of the following: 1. The RETURN key - to take the default, the system event file. 2. Any file in binary format containing KLERR events. ~STEP 2 Next KLERR prompts you to select all KLERR events or specific ones sequence number: by Selection (ALL): Type one of the following: 1. The RETURN key or A[LL] - to take the default of all KLERR events in the file. You will be prompted for date and time constraints. 2. S[EQUENCE] - to select number. If you choose SEQUENCE~ specific KLERR events by sequence KLERR prompts you further with: Sequence numbers: Here you can specify one number, several numbers separated by commas, or a range of numbers separated by hyphens. If you chose ALL, continue continue with STEP 5. with STEP 3. If you chose SEQUENCE, ~STEP 3 KLERR then prompts you for the date and time limits of the entries you want to select: Time from (EARLIEST): Type one of the following: 1. The RETURN key or E[ARLIEST] - to select the beginning of the file. This is the default. 2. A date and time in the format dd-mmm-yy hh:mm:ss - to signify where to begin extracting entries. A date by itself defaults to one second after midnight. 3. A date and time in the format -nn to indicate a reference point p~ior to the current date. For example, -7 causes KLERR to begin extracting entries seven days prior to the current day_ 4-26 THE SPEAR LIBRARY ~STEP 4 KLERR then prompts for the end of the time period: Time to (LATEST): Type one of the following: 1. The RETURN key or L[ATEST] - to select the end of This is the default. the file. 2. A date and time in the format dd-mmm-yy hh:mm:ss: to indicate the last date for extracting entries. A date by itself defaults to one second after midnight. 3. A date and time in the format -nn to indicate a reference point prior to the current date. For example, -13 causes KLERR to stop extracting entries recorded thirteen days before the current date. ~STEP S KLERR then prompts for the type of report in which you are interested: Report type (SUMMARY-ONLY): Type one of the following: 1. The RETURN key or S[UMMARY-ONLY] - to take the default. This report will contain only the final summary of signals. It will not have the entry-by-entry output. 2. F[ULL] - to select a set of detailed reports that list all the registers and signals (true or false) as well as their fields. 3. T[RUE] - to select a set of detailed reports that list all of the registers, but only the true signals and not the fields. 4. C[RAM-SAD-WORD] - to select a set of reports consisting of one line for each record that includes a CRAM parity error. This line contains the CRAM location and contents. If you chose CRAM-SAD-WORD, continue with STEP SA, otherwise with STEP 6. continue ~STEP SA If you choose CRAM-SAD-WORD, you are then prompted with formats: a choice of Cram word formats (MICROCODE): Type one of the following: 1. The RETURN key or M[ICROCODE] - to select the default. This format is a comparison of the bad cram word with the microcode listing. 2. o (CTAL] - to select a format that matches the the KL10 Maintenance failing cram module. 3. Handbook and can one shown in help isolate the T(RACON] - to select a format that compares TRACON snapshots. 4-27 THE SPEAR LIBRARY The next information KLERR prompts for is the type of summary in which you are interested: Summary type (ERRORS-ONLY): Type one of the following: 1. The RETURN key or E[RRORS-ONLY] to select the default. This summary is in the form of a single page list containing the number of times an error signal was true and the number of times it was false. 2. A[LL] - to select a summary with a complete listing number of times each signal was true or false. 3. N[ONE] - to select the option of receiving no summary. of the The last thing KLERR asks for is the destination of the output file: Output to (DSK:KLERR.RPT): Type one of the following: 1. The RETURN key - to select the default of KLERR.RPT. 2. TTY: to terminal. 3. Any file name in the proper format for your system. direct the ASCII formatted output After you select the output destination and press RETURN, you to confirm your decision: to your SPEAR asks the KLERR (I?), Type [cr] to confirm (/GO): At this point, you can: 1. Press the RETURN process. key or type 2. Type ISHOW to list the parameters you have chosen. 3. Type IREVERSE to return to the previous prompt. 4. Type IBREAK to return to the SPEAR prompt. S. Type question mark (?), HELP, the question mark switch or IHELP to find out what your options are. 4-28 IGO to execute THE SPEAR LIBRARY 4.5.3 Sample KLERR Session The following is a sample session of the KLERR dialogue: @spear Welcome to SPEAR for TOPS-20. Version 2(605) Type "?" for help. SPEAR> klerr KLERR mode Event file (SERR:ERROR.SYS): Selection (ALL): sequence Sequence numbers: 846 Report style (SUMMARY-ONLY): ? One of the following: SUMMARY-ONLY TRUE-SIGNALS FULL CRAM-BAD-WORD HELP Report style (SUMMARY-ONLY): cram Cram word format (MICROCODE): ? One of the following: MICROCODE OCTAL TRACON ALL HELP Cram word format (MICROCODE): tracon Summary type (ERRORS-ONLY): Output to (DSK:KLERR.RPT): Type <cr> to confirm (/GO): 4-29 THE SPEAR LIBRARY 4.5.4 KLERR Output The following is a sample of KLERR output: ******************~**************************** FRONT END DEVICE REPORT "KLERR" TYPE 205 LOGGED ON 15-Nov-83 04:52:57 MONITOR UPTIME WAS 0 DAY{S) 0:0:14 DETECTED ON SYSTEM f 2241 RECORD SEQUENCE NUMBER: 316 *********************************************** Registers: AR: 000000,,0~0000 BR: 000000,,000000 MQ: 001100,,002000 ARX: 000000,,000000 BRX: 002000,,020000 ADX: 000000,,000000 PC: 00,,005636 VMA: 00,,005636 VMA HELD: 00,,005636 PI ON: 177 PI HOLD: 000 PI GEN: 000 FM: 000000,,273041 AD: 000000,,000000 SC: 0000 FE: 0000 FM BLOCK: 00 FM ADDR: 04 CRAM word in octal: LOC 0-15 16-31 32-47 48-63 64-79 80-85 1044/ 001044 070000 104041 000020 000002 10 CRAM word by field (microcode listing format): LaC ABC D E F G 1044, 1044,0001,0400,0020,1020,7110,0000 CRAM word by field (TRACON format): LaC / J T AR AD BR MQ FM SCAD SC FE SH # VMA MEM COND SPEC M 1044/1044 1 40 1000 0 0 0 200 0 0 1 000 0 00 71 10 0 DRAM word by field: ADR: A B P J 254/ 2 0 0 144 4-30 THE SPEAR LIBRARY Signal name breakdown follows (Error signals first) - Signals in alphabetical order STATE NAME F F F F F F F F F APR2-M8539-APR C DIR P ERR IN H APR1-M8539-APR I/O PF ERR IN H APR1-M8539-APR MB PAR ERR IN H APR1-M8539-APR NXM ERR IN H APR2-M8539-APR S ADR P ERR IN H APR1-M8539-APR SBUS ERR IN H APR2-M8539-APR ANY EBOX ERR FLG H APR2-M8539-APR PWR FAIL IN H CHC1-M8533-CBUS ERROR E H - Fields from function reads VALUE FIELD o CCW2-M8534-CCW CHA 18-23 H o CCW2-M8534-CCW CHA 14-17 H o CCW2-M8534-CCW CHA 24-29 H o CCW2-M8534-CCW CHA 30-35 H o PIC4-M8532-EBUS CS00-03 E H o MBZ1-M8537-EBUS REG 00-08 H o MBZ1-M8537-EBUS REG 14-26 H 33 2 10 77400 o 600 11070 2 11000 MBC1-M8531-EBUS REG 27-33 H MBZ1-M8537-EBUS REG 34,35 H IRDI-M8522-IR AC 09-12 H MTRI-M8538-MTR CACHE COUNT 02-17 H MTRI-M8538-MTR EBOX COUNT 02-17 H MTR1-M8538-MTR INTERVAL 06-17 H MTRI-M8538-MTR PERF COUNT 02-17 H MTR3-M8538-MTR PERIOD 06-17 H MTR1-M8538-MTR TIME 02-17 H ** End of KLERR report. 1. entries were processed. 4-31 THE SPEAR LIBRARY 4.6 SUMMARIZE SUMMARIZE reads the system event file according to the following categories: and 1. Event code 2. STOPCODE (TOPS-l eJ) 3. BUGCHK, BUGHLT, BUGINF (TOPS-20 ) 4. Front-end reloads 5. Channel errors 6. Disk errors 7. Magnetic tape errors summarizes its contents The SUMMARIZE report also contains Error Distribution tables. These tables show a 24 hour distribution of events listed according to subsystem. With these tables, you can determine when the large number of events is occurring. Once you know the subsystem (Mainframe, Disk, Tape, and so forth) and the timeframe, you can use RETRIEVE or ANALYZE to pinpoint the specific device that is causing the problem. After reading the file, SUMMARIZE produces an ASCII report file containing the summaries and Error Distribution tables and stores it in your disk area (or wherever you specify). You can then print the report on the lineprinter for inspection. You can also print the report on the terminal by specifying TTY: to SPEAR's request for the output destination. SUMMARIZE allows you to pinpoint the timeframe of the summaries by requesting a beginning date and an ending date to search for in the system event file. In addition, you can also specify a binary file created with the RETRIEVE process (RETRIE.SYS) for input. See Section 4.4 for information on RETRIEVE. 4.6.1 The SUMMARIZE Report The following example is representative of a SUMMARIZE report in it contains: • File environment information • Entry occurrence counts • System event codes, occurrence counts • Summaries of bugchecks and subsystems • Error distribution tables shown 4-32 in parentheses under that entry THE SPEAR LIBRARY Note that if the media name cannot be identified in reports that include media identification, SUMMARIZE uses three specific formats: 1. <unknown> - if SUMMARIZE does not find a mount record in error file prior to the time of the error. 2. <none> - if a series of mount and dismount records indicate no medium was mounted at the time of the error, such as an error occurring during the mount process. 3. <blank> - if SUMMARIZE finds a mount record medium-name field of the mount record is empty. Note the error register codes listed in the report Section 4.6.2. are but the described in File Environment SPEAR Version 2(613) Input file: SERR:ERROR.SYS Created: 12-Mar-84 08:49:00-EST Output file: DSK:SUMMAR.RPT Selection Criteria: ALL Date of first entry processed: 14-Mar 01:22:13 Date of last entry processed: 14-Mar 23:53:38 Number of entries processed: 1128. Number of inconsistencies detected in error file: Entry Occurrence Counts: 9. 496. 36. 120. 8. 102. 1. 294. 62. SYSTEM RELOAD ••• (101) MONITOR BUG ••• (102) MASSBUS ERROR ••• (111) STATISTICS ••• (114) CONFIGURATION CHANGE ••• (115) FRONT END DEVICE ERROR ••• (130) CPU PARITY INTERRUPT ••• (162) PHASE III DECNET ENTRY ••• (240) HSC50 ERROR LOG ••• (243) Monitor Detected Errors and Reloads: 43. BUGCHK 4. BUGHLT 449. BUGINF Monitor Error and Reload Breakdown: BUGCHK Breakdown 8. FLKTIM 2. KLPERR 17. MSCORO 3. NODDMP 5. PI2ERR 4. SCACVC 4. SCATMO BUGHLT Breakdown 1. ILPSEC 1. NOTOFN 1. SKDPFI 1. UNPGF2 4-33 0. the THE SPEAR LIBRARY BUGINF Breakdown 8. CFCONN 4. KLPCVC 29. KLPNUP 1. KLPRRQ 1. KLPSTR 28. MSCAVA 2. MSCDSR 7. MSCPTG 324. NSPBAD 29. NSPLAT 2. NTOHNG 1. SPRZRO 1. TMBAEI 12. TTYSTP Front-end Summary: 10. CD20 10. DHll 10. DLIIC 10.DMll 1. DMll-3 6. KLCPU 45. KLERR records forming 5. full entries 10. LP20 DECnet Summary: Class.Type Count 0.0 0.3 2 .. 0 1 0. 4 .. 0 29. 233. 1. 6. 5. 4 .. 1 4 .. 4 4.7 4 .. 10 B. 2. Description Event records lost Automatic line service Local node state change Aged packet loss Node unreachable packet loss Packet format error Circuit down, circuit fault Circuit up RH20 Channel/Controller Summary: f 1 f 2 Hard 0. 5. Soft 1. 30. Hard Soft 0. 1. RP07 Summary: SIN 2861 DP100 4-34 THE SPEAR LIBRARY TM78 Summary: Hard Soft 2. 4. 3. 26. SIN 4404 MT200 SIN 5242 MT210 RH20 Breakdown (CONI) PAR ERR EXC LWC ERR SWC ERR CHN ERR RES ERR RAE OVR RUN DP100 SOFT 1. HARD 2. SOFT 4. HARD 3. SOFT 26. 1. MT200 MT200 MT210 MT210 *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* * * Disk Subsystem Error Summary * * * * *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* Disk Subsystem Error Entries Summarized by Device, then Error Type. Where the Error Types are the following: OTHER TIMIN SK-SR READ CH-CO BUS SOFT MICRO UNSAF WRTLK OFFLI = = OTHER TIMING SEEK-SEARCH READ-WRITE CHANNEL-CONTROLLER BUS HARDWARE DETECTED SOFTWARE ERROR MICROPROCESSOR DETECTED ERROR UNSAFE WRITE LOCK OFFLINE OTHER TIMIN SK-SR READ CH-CO BUS SOFT MICRO UNSAF WRTLK OFFLI DP100 1. DU-7-14-17 36. 3. 19. 3. DU-7-3-17 Read Data Errors further summarized by Drive and Media ID. Drive Media' Error Totals DP100 WORK 1. 4-35 1. THE SPEAR LIBRARY *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* ** This report summarizes all Read Data Errors by Drive and Media IO ** * * *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* DRIVE MEDIA CYL TRK SECT DP100 WORK 565. 5. SOFT RETRIES HARD 15. LBN ------- 1. 0. 2,,756704 2. RP07 BREAKDOWN: Error Register 1 D C U N S K SIN 2861 DP100 S 0 P I D T E W I L A A E E E 0 H C R C H C E E C H W C F F E R P A R R M R I L R I L F 1. *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* * * * Tape Subsystem Error Summary * * * *-*-*-*-*-*-~-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* Tape Subsystem Error Entries Summarized by Device, then Error Type. Where the Error Types are the following: OTHER READ WRITE FORMT CH-CO BUS SOFT OPER OFFLI OTHER READ OTHER READ WRITE DEVICE FORMAT CHANNEL-CONTROLLER BUS HARDWARE DETECTED SOFTWARE ERROR OPERATOR OFFLINE WRITE FORMT CH-CO BUS MT200 6. MT2l0 29. 4-36 SOFT OPER OFFLI THE SPEAR LIBRARY *-*-*-*-*-*-*-*-*~*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* ** SUMMARY of all Errors sorted by Media and Drive by * Operation. * ** * * *-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-* Operation : WRITE Related MEDIA UNIT ID 10 MT200 unknown TOTAL MT210 TOTAL ------ ------ ------ 6. 29. 35. 6. 29. 35. TM78 Breakdown: (Interrupt and Failure Codes are OCTAL) Interrupt Failure Hard Code Code Soft SIN 4404 MT200 MT200 MT200 SIN 5242 MT210 MT210 MT210 MT210 MT210 22 (WRITE) 22 (WRITE) 22 (WRITE) 7 10 14 0. 0. 2. 3. 1. 0. 22 (WRITE) 22 (WRITE) 22 (WRITE) 22 (WRITE) 22 (WRITE) 1 4 7 10 14 0. 0. 0. 0. 3. 7. 10. 1. 8. 0. Error distribution 14-Mar-84 1:00 6:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 - 2:00 7:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 0:00 Totals Main-IDisk ITape IUnit IComm INet- ISoft-ICrashlTotals frame I I I rec I I work I ware I I -----+-----+-----+-----+-----+-----+-----+-----+----11. 6. 5. 19. 7. 35. 20. 6. 13. 5. 10. 6. 9. 9. 9. 4. 1. 2. 11. 8. 4. 2. 4. 1. 1. 3. 27. 91. 19. 22. 17. 21. 19. 12. 16. 12. 64. 31. 7. 6. 3. 9. 7. 45. 76. 6. 38. 43. 39. 38. 38. 38. 25. 133. 56. 28. 22. 3. 10. 10. 86. 167. 25. 62. 72. 69. 61. 52. 58. -----+-----+-----+-----+-----+-----+-----+-----+----46. I 63. I 35. I 1. I 294. I I 505. I 950. I 4-37 THE SPEAR LIBRARY Due to the addition of the CI and HSC50, you will find another for listing the names of disks in the SUMMARIZE report. previous report, you will find the following: format In the DU-7-l4-l7 DU-7-3-l7 Starting from following: left to right, these four fields represent Field one Device type Field two RH slot number for the CI20. number 7. Field three HSC50 node number on the CI. Field four Drive number on the push button. If the HSC50 cannot get this number, the number 4095 appears in this field. DU RA80, RA8l DJ = RA60 ?? = unknown This is Note you will find a description of the Disk Subsystem Error Appendix D. 4.6.2 the always Bits in Error Register Codes The following tables contain brief explanations of the abbreviations of the error register codes (MASSBUS disk registers for RP04s and RP06s and tape registers for TU45s, TU77s, and TEl6s). Table 4-5: MASSBUS Disk Registers Error Register 1 Code Meaning DCK UNS OPI DTE WLE IAE AOE HCRC HCE ECH WCF FER PAR RMR ILR ILF Data Check Unsafe Operation Incomplete Drive Timing Error Wr i te Lock Error Invalid Address Error Address Overflow Error Header CRC Error Header Compare Error ECC Hard Error Write Clock Fail Format Error Parity Error Register Modification Refused Illegal Register Illegal Function 4-38 THE SPEAR LIBRARY Table 4-5: MASSBUS Disk Registers (Cont.) Error Register 2 Code Meaning ACU RP04 - AC Unsafe RP06 - Unused Phase Locked Oscillator Unsafe RP04 - 30 Volts Unsafe RP06 - Unused Index Error No Head Select Multiple Head Select Write Ready Unsafe RP04 - Failsafe Enabled RP06 - Abnormal Stop Transition Unsafe Transition Detector Failure RP04 - Motor Sequence Error RP06 - Read and Write Current Switch Unsafe Write Select Unsafe Current Sink Failure Write Current Unsafe PLU 30VU IXE NHS MHS WRU FEN ABS TUF TDF MSE R&W CSU WSU CSF WCU Error Register 3 Code Meaning OCYL SKI OPE Off Cylinder Seek Incomplete RP04 - Unused RP06 - Operator Plug Error AC Voltage Unsafe DC Voltage Unsafe RP04 - Unused 35 Volts Unsafe RP04 - Any Unsafe Except Read/Write RP06 - Unused RP04 - Velocity Unsafe RP06 - Write and Unsafe RP04 - Pack Speed Unsafe RP06 - DC Voltage Unsafe ACL DCL DIS 35V UWR VUF WOF PSU DCU 4-39 THE SPEAR LIBRARY Table 4-6: Tape Registers ---.-------------------~---------------------------------------------------. Meaning PE - Correctable Data Error NRZI - CRC Does Not Match Computed CRCC Unsafe Operation Incomplete Drive Timing Error Nonexecutable Function PE - Correctable Skew NRZI - Illegal Tape Mark Frame Count Error Nonstandard Gap Tape Character PE - Format Error NRZI - Longitudinal Redundancy Check : PE - Noncorrectable Data Error NRZI - Vertical Parity Error Data Bus Parity Error Format Error Control Bus Parity Register Modification Refused Illegal Register Illegal Function 4.6.3 SUMMARIZE Procedure SUMMARIZE prompts with one or more of the following guidewords: SUMMARIZE Mode Event file (SERR:ERROR.SYS): Ca tegory (ALL): Time from (EARLIEST): Time to (LATEST): Show Error Distribution (YES): Report to (DSK:SUMMAR.RPT): If you want to take all the defaults, type S/G to the otherwise, read the following procedure: SPEAR) prompt; After you type SUMMARIZE to the SPEAR) prompt, SUMMARIZE requests name of the input file: Event file (SERR:ERROR.SYS): TOPS-20 or Event file (SYS:ERROR.SYS): TOPS-10 4-40 the THE SPEAR LIBRARY Type one of the following: • 1. The RETURN key - to take the default, the system event file. 2. The name of a file you have previously RETRIEVEd, format, for example RETRIE.SYS. in binary 3. Any file in binary format containing events from event file • the system STEP 2 SUMMARIZE asks for the category interested: of the summary in which you are the default of all Category (ALL): Type one of the following: 1. The RETURN key categories. or A[LL] 2. M[AINFRAME] - to select a summary for mainframe events. 3. D[ISK] - to select a summary for disk devices. 4. T[APE] - 5. CI - to select a summary of CI-related events. 6. NI - to select a summary of NI-related events. 7. U[NITRECORD] - to select a summary of hard-copy devices. 8. NE[TWORK] - to select a summary of network-related events. 9. o [PERATING-SYSTEM] - to take to select a summary of tape devices. to select a summary of software-related events. 10. CO[MM] - to select a summary of communication devices. 11. P[ACKID] - to select a summary of specific disk packs. 12. R[EELID] - to select a summary of specific tape reels. All categories except for COMM and NI prompt for specific types. Table 4-7 lists the subprompts you can expect. Table 4-7: Subprompts for Device Types Device Type Subprompt MAINFRAME DISK TAPE CI UNITRECORD NETWORK OPERATING-SYSTEM PACKID REELID Mainframe devices (ALL) : Disk drives (ALL) : Tape drives (ALL) : CI controllers (ALL): Unit record devices (ALL) : Event class and type (ALL) : Operating System codes (ALL) : Disk (structure IDs) : Tape (reel IDs) : 4-41 device THE SPEAR LIBRARY ~STEP 3 SUMMARIZE keeps prompting you for categories FINISHED or press the RETURN key: until you either type Next Category (FINISHED): Type one of the following: 1. The RETURN key or F[INISHED] - to take the default. 2. Another category. ~STEP 4 After you have specified the source of input, SUMMARIZE prompts for the date and time at which you want the summary to begin: you Time from (EARLIEST): Type one of the following: 1. The RETURN key - to take event in the file. the default EARLIEST, the first 2. A date and time in the format dd-mmm-yy hh:mm:ss - to signify where to begin extracting entries. A date by itself defaults to one second after midnight. 3. A date and time in the format -nn to indicate a reference point prior to the current date. For example, -7 causes SUMMARIZE to begin extracting entries seven days prior to the current day. ~STEP 5 SUMMARIZE then prompts for the end of the time period: Time to (LATEST): Type one of the following: 1. The RETURN key - to take the default LATEST, the in the system event file. 2. A date and time in the format dd-mmm-yy hh:mm:ss to indicate the last date for extracted entries. A date by itself defaults to one second after midnight. 3. A date and time in the format -nn to indicate a reference point prior to the current date. For example, -13 causes SUMMARIZE to stop extracting entries recorded thirteen days before the current date. 4-42 last entry THE SPEAR LIBRARY ~STEP 6 After specifying a timeframe, you can choose whether or not to receive the error distribution tables: Show Error Distribution (YES): Type one of the following: 1. The RETURN key or Y[ES] - to take the default. This will give you all the error distribution charts relevant to the time constraints you specify. 2. N[O] - to suppress the error report. distribution charts from the ~STEP 7 The last thing SUMMARIZE asks for is the destination of the output: Report to (DSK:SUMMAR.RPT): Type one of the following: 1. The RETURN key - to take the default DSK:SUMMAR.RPT. 2. Any file name in the proper format. 3. TTY: - to have the report printed on your terminal. Note that if you specify TTY:, SUMMARIZE does not save the file in your disk area. After you select the output destination and press RETURN, you to confirm your decision. SPEAR asks Type <cr> to confirm (/GO): At this point you can: 1. Press RETURN or type /GO to execute the SUMMARIZE process. 2. Type /SHOW to list the parameters you have chosen. 3. Type /REVERSE to return to the previous prompt. 4. Type /BREAK to return to SPEAR level. 5. Type question mark (?), HELP, the question mark switch or /HELP to find out what your options are. (/?), To read the SUMMARIZE report, you can list the file on the lineprinter by doing the following: Return to operating system command level by typing SPEAR> prompt. Use the PRINT command operating system. with any options Note that if you specified TTY: to the Report to: not have a file saved in your area to print. 4-43 EXIT to the available on your prompt, you will THE SPEAR LIBRARY 4.6.4 Sample SUMMARIZE Session The following is a sample of a event file for input: SUMMARIZE session using the system @spear Welcome to SPEAR for TOPS-20. Version 2(605) Type "?ft for help. SPEAR> summarize SUMMARIZE mode Event file (SERR:ERROR.SYS): Category (ALL): main Mainframe devices (ALL): cpu Next Category (FINISHED): disk Disk drives (ALL): rp07 Next Category (FINISHED): Time from (EARLIEST): Time to (LATEST): Show Error Distribution (YES): no Report to (DSK:SUMMAR.RPT): Type INFO INFO INFO <cr> to confirm (/GO): - Summarizing ST:GIDNEY.02-27 - Now sending summary to DSK:SUMMAR.RPT - Summary output finished SPEAR> ex Table 4-8 lists the supported devices, according which you can expect summaries. 4-44 to subsystem, from THE SPEAR LIBRARY Table 4-8: Supported Devices SUBSYSTEM DEVICE DETAILED SUMMARIES? ---------- --------- ------ MAINFRAME KL10 KS10 FRONT-END YES NO YES CI CI20 HSC YES YES DISK RP03 RM03 RP04 RP05 RP06 RP07 RP20 RS04 RA60 RA80 RA81 YES YES YES YES YES YES YES (DX20 ) YES YES YES YES TAPE TU16 TU45 TU70 TU71 TU72 TU73 TU77 TU78 YES YES YES YES YES YES YES YES UNIT RECORD LPT CDR YES YES COMM DH11 DQ1l YES YES NET DECNET PHASE 2, 3, 4 ANF 10 SNA 20 NIA20 YES YES YES YES 4.7 TOPS-29 KLSTAT MODE On TOPS-20, there is an additional troubleshouting aid that can be helpful if severe intermittent faults do not leave enough information in the system event file. This feature is the KLSTAT mode. When you turn KLSTAT on, you are actually turning on a monitor flag that tells the monitor to record additional information into the system event file when any CPU, memory, or MASSBUS errors occur. Note that turning on this flag causes severe system degradation (the system goes down while KLSTAT is collecting data) you should turn it on only when absolutely necessary. In fact, you must have special privileges to turn it on or off. 4-45 THE SPEAR LIBRARY When the KLSTAT mode is in operation, the system event file will contain KL CPU STATUS BLOCK entries. For a sample of such an entry, turn to Section 5.3.12. For the KLSTAT procedure, read the following section, Section 4.7.1. 4.7.1 KLSTAT Procedure The KLSTAT mode has three functions: following procedure describes their use: ON, OFF, and CHECK. The .STEP 1 First, enable your special privileges at monitor level, either OPERATOR or WHEEL privileges. Then access SPEAR. (Note, you do not need privileges to CHECK the status of KLSTAT.) Once at the SPEAR prompt, type K[LSTAT]: SP,EAR>KLSTAT SPEAR responds with: SPEAR>KLSTAT KLSTAT mode Extra reporting (CHECK): .STEP 3 At this point, type one of the three options. Pressing the Escape key gets you the default, CHECK. If you type ON, you will get this message: The following should be noted before proceeding! This function can cause SEVERE system degradation! If you decide not to risk it, type /R to return to the SPEAR prompt. If you respond with one of the three choices, SPEAR prompts with: Type <cr> to confirm (/GO): If you chose ON or OFF, SPEAR returns you to the SPEAR prompt. chose CHECK, the default, SPEAR prints one of the following: If you (KLSTAT) Extra error reporting is currently enabled. or (KLSTAT) Extra error reporting is currently disabled. You can check the information gathered by turning on the KLSTAT mode by looking for the KL CPU STATUS BLOCK entry in the system event file. See Section 5.3.12. 4-46 THE SPEAR LIBRARY 4.8 COMPUTE COMPUTE allows you to generate an ASCII report on the availability of system resources. When compiling its report, COMPUTE considers system statistics and monitor failures in its calculations. The data base that COMPUTE uses differs slightly between the operating systems. On TOPS-10, the report data base is a file written by the monitor in the same format as the system event file. This TOPS-10 file contains reload information, device status-change data, date and time changes, and other pertinent information. The entries are written into this file when they occur, in the same manner as the entries are written into the system event file. COMPUTE files on TOPS-10 are grouped starting with the first monitor load and ending with the last reload in the selected directory. The files are named AVAIL. Ann beginning with AVAIL.A01 for the first week, (the oldest file in the group) AVAIL.A02 for the second week, and so forth up to the current (incomplete) file AVAIL.SYS. To find out the file names of the available weeks, do a directory search of SYS:AVAIL.*, by ty~ing DIR SYS:AVAIL.* at operating system command level. On TOPS-20, the report data base is the system event file, ERROR.SYS. For COMPUTE purposes, TOPS-20 also has a buffer file called COMPUTE. STATISTICS. Approximately every 20 seconds, any available runtime information is written into this buffer file. Then every hour the information in this buffer file is dumped into the system event file as a special entry called LOGGER ENTRY (octal code 500). Also, during a system reload, the last entry in COMPUTE. STATISTICS is written into the system event file. When you run COMPUTE on TOPS-20, it looks for these LOGGER entries to compile its report. Although TOPS-20 does not have separate weekly files, COMPUTE can break down the system event file into a calendar week from Sunday at 00:00:01 hours to Saturday at 23:59:59 (approximately) to come up with single weekly reports. COMPUTE uses hourly dumps from COMPUTE. STATISTICS on TOPS-20 to approximate the calendar week. In this way, you can specify date and time limits when running COMPUTE. 4.8.1 COMPUTE Reports With COMPUTE, you can output your report in one of three ways: 1. A single report containing statistics from a single week. 2. A single report containing merged into one report. 3. Several reports containing statistics from individual weeks. statistics from several weeks, In addition to the COMPUTE report, you also receive a report containing information concernIng reloads. This report is called RELOAD.RPT. You will receive the same number of reload reports as you do COMPUTE reports. If you decide you want individual weekly reports, COMPUTE prompts you for the beginning and ending dates of the weeks of interest. The defa~lt is the first week's file, the oldest file, to the file containing the last full week. If you use the default, you will receive one report for every week from the last monitor load to the last reload of the COMPUTE file in your selected directory. 4-47 THE SPEAR LIBRARY 4.8.2 COMPUTE Pormulas The following formulas are used reported in the full report: by COMPUTE to derive the values FORMULA 1 System Availability (SA) SA = (l.0) - Chargeable Downtime/Usage Cycle where Chargeable Downtime is any nonscheduled period of time that the system is not running as determined by the answer the operator gives to the WHY RELOAD? question. The answers that constitute a charge to downtime are: 1. 2• 3. 4. 5. 6. 7. 8. STOPCD or BUGHALT Halt Parity Hardware NXM (nonexistent memory) Hung Loop CM (corrective maintenance) Time is not charged when the WHY RELOAD? is: 1. 2. 3. 4. 5. 6. 7. 8. answer to Power Static OPR (operator) PM (preventive maintenance) New Sched (scheduled) SA (standalone) Other Total Downtime is the sum of Chargeable Downtime and Nonchargeable Downtime. Usage Cycle is Total Downtime plus Total Run time. Total Run Time is the sum of all monitor Run Times within the period you specify for the report. FORMULA 2 User Availability (UA) UA = (1.0) - Chargeable Downtime/(Chargeable Downtime + Total Run Time} FORMULA 3 System Effectiveness (SE) SE = System Availability * (e**(-t/MTBF» where e is the natural base of logarithms, known as the Napierian logarithm. ** represents the words "raised to the power of". 4-48 (2.7l828+), also THE SPEAR LIBRARY t can be one of four different values: 0.5 hrs., 1.0 hrs., or 4.0 hrs. 0.1 hrs., MTBF is the abbreviation for M~an Time Between Failures. This is the usage cycle divided by the number of crashes. Usage cycle is Total Run Time plus Total Down Time. System Effectiveness considers both the probability of the system being up at time zero (System Availability) and the probability of the system staying up (System Reliability) for some time period ntn. You should be aware of the following facts about the COMPUTE function: 4.8.3 1. The accuracy of this function depends heavily on correct operator response to the WHY RELOAD ques~ion and accurate insertion of the time of day. If nOthern IS selected for reason for reloading, the preceding Downtime is not counted against availability. 2. An incorrect reload time should be corrected by the operator before another reload occurs to avoid negative Downtimes or Runtimes. Because date/time changes are logged in the COMPUTE files, COMPUTE can adjust times as necessary. 3. Total Runtime and Downtime figures are not precise. On TOPS-10, the monitor keeps track of time by updating the availability file every six minutes. On TOPS-20, the buffer file COMPUTE.STATISTICS is updated every 20 seconds, and the system event file is updated every hour. If one crash/reload sequence is immediately followed by another, these times may not be correctly updated. COMPUTE compensates for this by assuming the system did not resume service after the previous reload. COMPUTE Procedures If you want to take all the defaults, type C/G to the otherwise, read the following procedures: COMPUTE uses the following guideword prompts: COMPUTE Mode Event file (SERR:ERROR.SYS): Report period (LAST-WEEK): Time from (EARLIEST): Time to (LATEST): Report type (SINGLE-REPORT): Availability report to (DSK:COMPUT.RPT): Reload report to (DSK:RELOAD.RPT): 4-49 SPEAR> prompt; THE SPEAR LIBRARY COMPUTE begins by asking for the file containing the records you to use in the COMPUTE calculations; Event file (SERR:ERROR.SYS): want TOPS-20 or Event file (SYS:AVAIL.LWK): TOPS-l0 Type one of the following: 1. The RETURN key - to take the default COMPUTE file for your system; SERR:ERROR.SYS on TOPS-20, SYS:AVAIL.LWK on TOPS-l0. 2. If you are on TOPS-20, and you know of another file containing COMPUTE statistics, specify that file name here. If you are on TOPS-l0, and you know of a specific AVAIL (for example, AVAIL.A14) specify the file name here. The next prompt asks for the period of time for which you want performance calculated: file system Report period (LAST-WEEK): Type one of the following: 1. The RETURN key or L[AST-WEEK] - to take the defaultu This report covers the last 7 days (168 hours) prior to last Sunday at 00:00:01. 2. T[HIS-WEEK] - if you want the report to cover the current week. This report will begin with last Sunday at 00:00:01 and continue through the present. This will be an incomplete week. 3. O[THER] - if you want the report to cover a period of time other than last week or this week. If you choose OTHER, you will be prompted for the date and time parameters. If you specify OTHER, continue with THIS-WEEK or LAST-WEEK, skip to STEP 6 • • STEP 3. If you specified S'fEP 3 A:Eter you type OTHER, COMPUTE prompts you for the the time period in which you are interested: beg inning date of to take the default. This is Time from (EARLIEST): Type one of the following: 1. The RETURN key or E[ARLIEST] the first entry in the file. 2. The date and time (real time) in the form dd-mmm-yy hh:mm:ss where dd is the numerical day, mmm is the first three letters of the month, yy is the year, hh is the hour, mm is the minute, and ss is the second. If you specify only the date, the default time is one second after midnight. 4-50 THE SPEAR LIBRARY 3. The date (relative time) in the form -nn where -nn indicates a date prior to the current date. For example, -6 causes COMPUTE to begin processing from 6 days prior to the current day. ~STEP 4 COMPUTE prompts next for the calculations: time at which you want to end the This is Time to (LATEST): Type one of the following: 1. The RETURN key or L[ATEST] the last entry in the file. to take the default. 2. The date and time (real time) in the form dd-mmm-yy hh:mm:ss where dd is the numerical day, mmm is the first three letters of the month, yy is the year, hh is the hour, mm is the minute, and ss is the second. If you do not specify the date, the default time is one second after midnight. 3. The date (relative time) in the form -nn where -nn indicates a date prior to the current day. For example, -2 causes COMPUTE to end the calculations 2 days prior to the current day. ~STEP 5 COMPUTE asks for the type of report you want: Report type (SINGLE-REPORT): Type one of the following: 1. The RETURN key or S[INGLE-REPORT] to take the default. This choice will give you one report containing the information for as many weeks as you specified. 2. M[ULTIPLE-REPORTS] - to receive a report for each week within the timeframe you specified. Each report will reflect system performance for a 7 day period beginning on Sunday at 00:00:01 and ending on the following Sunday 00:00:00. ~STEP 6 COMPUTE prompts for the destination of the availability report: Availability report to (DSK:COMPUT.RPT): Type one of the following: 1. The RETURN key DSK:COMPUT.RPT. to take 2. A file specification in the proper format for your system. 4-51 the default file specification THE SPEAR LIBRARY .STEP 7 The last thing COMPUTE asks for report: is the destination of the reload Reload report to (DSK:RELOAD.RPT): Type one of the following: 1. The RETURN key DSK: RELOAD. RPT. to take the default file specification 2. A file specification in the proper format for your system. After you select the output destination and press RETURN, you to confirm your decision: SPEAR asks Type <cr> to confirm (/GO): At this point, you can: 1. Press RETURN or type IGO to execute the COMPUTE process. 2. Type ISHOW to list the parameters you have chosen. 3. Type IREVERSE to return to the previous prompt. 4. Type IBREAK to return to SPEAR> level. 5. Type question mark (?), HELP, the question mark switch or IHELP to find out what your options are. (I?), After you execute COMPUTE, if you specified MULTIPLE-REPORTS, you will receive several individual reports with the file names Cmmdd.RPT and RLmmdd.RPT, where mm is the month of the start of the usage cycle. dd is the day of the week of the usage cycle. You will also receive a COMPUT.RPT and a RELOAD.RPT, combining all the information in the individual reports. When COMPUTE has finished its calculations, it prints a summary report on your terminal and outputs the full report(s) to your disk area ( or wherever you spec i fy). The COMPUTE Summary repo rt is a condensed version of the information you will find in the full report. 4-52 THE SPEAR LIBRARY 4.8.4 COMPUTE Summary Report The following is a sample COMPUTE Summ.ary report: COMPUTE Summary Report From: 28-Sep-81 03:44 To: 1-Oct-81 14:05 period length (HRS): 82.351, usage cycle = 82.350 SYSTEM Availability % USER Availability % Total Reloads Total Crashes 4. O. 100.000 100.000 MTB Reloads MTB Crashes 20.587 82.350 Effectiveness factor Six minutes 98.559 Thirty minutes 93.002 One Hour 86.495 Totals Means Maxima Minima Std. Dev. Run times 27.571 6.892 12.270 1. 401 4.234 Down times 54.779 18.259 52.153 0.375 23.978 Crash times 0.000 0.000 0.000 0.000 0.000 Four Hours 55.972 Bug/Stopcode count DIRPGl 7. DN20ST 7. DTEIPR 17. DX2HLT 1. ITRLGO 5. NSPLAT 5. OVRDTA 1. Report file name: DSK:COMPUT.RPT 4.8.5 COMPUTE Full Report The following is a sample of a COMPUTE Full report. This type of report is saved on your disk area for printing on a line printer. You can print it on your terminal but it will be unreadable because of the 132 column width of the report. 4-53 SYSTEM AVAILABILITY REPORT FOR THE PERIOD: 28-Sep-81 03:44 TO 1-0ct-81 14:05 CUSTOMER SATISFIED(Y OR N)? ____ CUSTOMER SIGNATURE ------------------------ ***** SYSTEM STATISTICS *****(ALL TIMES IN HOURS) AVAILABILITY SYSTEM EFFECTIVENESS OPERATIONAL CYCLE SYSTEM AVAILABILITY: USER AVAILABILITY NUMBER OF RELOADS 82.351 100.000 100.000 4. T= T= T= T= O.lHRS: O. 5HRS: 1. OHRS: 4. OHRS: DOWNTIME RUNTIME 98.559 93.002 86.495 55.972 TOTAL RUN TIME MAXIMUM RUN TIME MINIMUM RUN TIME MEAN RUN TIME 27.571 12.270 1. 401 6.892 SYSTEM NOT RUNNING MAXIMUM DOWNTIME MINIMUM DOWNTIME MEAN DOWN TIME 54.779 52.153 0.375 18.259 ***** RELOADS NOT AFFECTING MEASURED AVAILABILITY ***** MONITOR NAME & VERSION POWER FAIL STATIC OPERATOR PM NEW SCHEDULED STANDALONE OTHER JUNK 1. 52.153 O. 0.000 2. 0.375 TOTALS SYSTEM 2116 THE BIG ORANGE, TOPS-20 MONITOR 4(3556) 500,,4363 O. O. 0.000 BugjStopcode DN20ST DTEIPR DX2HLT ITRLGO NSPLAT OVRDTA OVRDTA 0.000 Count 7. 7. 17. 1. 5. 5. 1. 1. 2.251 O. 0.000 O. 0.000 4. 54.779 Count Time (HRS) CHAPTER 5 ENTRY DESCRIPTIONS 5.1 INTRODUCTION This chapter provides a sample of most of the events that can be recorded in the system event file. These samples appear just as you see them when you use RETRIEVE to translate entries from binary to ASCII. Although the entries may differ in format, they each have sections in common, some more than others depending on the operating system involved. Each entry may contain from one to six sections of info rmati on: Section 1 Section 2 Section 3 Section 4 Section 5 Section 6 Entry Description Unit Identification Software Status Controller Status Device or Unit Status Statistical Information Every entry has at least a Section 1, Entry Description. contains: 1. Type of entry and/or type of error 2. Error-entry date and time that it was logged 3. Monitor uptime 4. System serial number Entries may contain Sections 2 through Section 6. the following information: 1. Unit logical name 2. Unit physical name 3. Unit type 4. Media identification Section 3 contains the following: 1. Highest process requesting service (user) 2. Lowest process requesting service (author) 5-1 This section Section 2 contains ENTRY DESCRIPTIONS 3. User/process identification (user identification, program name, file name, program location in memory, and so forth) 4. Pertinent system registers (processor flags, program counter, and so forth) before and/or after error as applicable 5. Disposition of event (retry count, recovered or not, the point in the retry algorithm where recovery was affected, and so forth) 6. Other I/O activity at error time Section 4 contains the following: 1. Controller name and/or address 2. Controller type 3. Name and value controller of all information available from the 1. Name and value of all status information available unit from the 2. Function that was active at error time 3. Log ical and physical address of the unit before error 4. Logical and physical address of the unit at error 5. Transfer size applicable Section 5 contains the following: and starting memory location if of I/O some entries Section 6 contains unit activity since start-up. The default radix in these entries is decimal; however, may have numbers displayed in octa~ or binary. 5.2 TOPS-IS ENTRIES The following sections list both the FULL and SHORT versions entries that TOPS-10 can record in its system event file. 5-2 of the ENTRY DESCRIPTIONS 5.2.1 System Reload The monitor generates a System Reload entry into the system event file whenever it is loaded. Note that HALT, STOP, and CPU stopcode information is also recorded in this entry, if app1icablee FULL *********************************************** SYSTEM RELOAD LOGGED ON 5-Aug·-80 AT 0:16:39 DETECTED ON SYSTEM # 102~. RECORD SEQUENCE NUMBER: 190. MONITOR UPTIME WAS 0:00:38 *********************************************** CONFIGURATION INFORMATION RZ064A KL #1026/1042 SYSTEM NAME: MONITOR BUILT ON: 07-23-80 1026. CPU SERIAL #: 771165,0 STATES WORD: MONITOR V.ERS ION %701(0) RELOAD BREAKDOWN CAUSE: COMMENTS MEMORY ON-LINE AT RELOAD: FR OM : 0 P TO: 2 04 8 P SCHED iPUT 1 SHORT SE~' TIME 5-Aug-80 190. 0:16:39 RELOAD OF RZ064A KL #1026/1042 VERSION (70100) BUILT ON 07-23-80 REASON SCHED 5.2.2 Non-Reload Monitor Error Each time a JOB or DEBUG stopcode occurs, the monitor records the information as a Non-Reload Monitor Error in the system event file. The JOB stopcode endangers the integrity of the job currently runningi therefore, the monitor aborts the current job, then continues. A DEBUG stopcode is not immediately harmful to any job or the system; therefore, the monitor prints the stopcode message on the operator's terminal (CTY) and then continues processing. 5-3 ENTRY DESCRIPTIONS FULL *********************************************** NON-RELOAD MONITOR ERROR LOGGED ON 5-Aug-80 AT 10:51:49 DETECTED ON SYSTEM # 1042. RECORD SEQUENCE NUMBER: 863. MONITOR UPTIME WAS 2:26:26 *********************************************** SYSTEM NAME: RZ64C KL #1026/1042 SYSTEM SERIAL #: 1026. MONITOR DATE: 07-23-80 MONITOR VERSION %701(0) STOPCD NAME: BAZ RESULT: JOB #: 6. [1, 2] USER'S ID: 470 TTY NAME: ACTDAE PROGRAM NAME: CONTENTS OF AC'S AT STOPCD: 0: 1: 2: 3: 4: 5: 6: 7: 10: 11: 12: 13: 14: 15: 16: 17: 20,0 777642,377507 0,100 5777,371000 526200,340000 664145,663167 440004,0 0,50 0,0 0,505273 0,250255 47040,1 0,1 0,1 0,4 0,146 PI STATUS: 440004,0 SHORT SEQ TIME 5-Aug-80 863. 10:51:49 STOPCD BAZ ON CPU SERIAL i 1026 FOR JOB i 6 ON USER WAS [1,2] RUNNING ACTDAE 5-4 470 ENTRY DESCRIPTIONS 5.2.3 Crash Extract A Crash Extract becomes a part of the system event file whenever the program DAEMON starts. When DAEMON starts, it checks the system search list for a CRASH.EXE file. If it finds one, it extracts the information and appends it to the system event file. NOTE is strongly recommended that, each time the monitor started, you save a dump as a CRASH.EXE file so that DAEMON/SPEAR can provide a complete picture of system activity. You can do this by saving each monitor core image (dumping the crash) after each run; that is, before PM or CM periods, before scheduled reloads, after stand-alone periods, and so forth. To save core-image, use the /D command to MONBTS. ~t 1S Because DAEMON extracted the information from a saved crash, the date and time and the monitor uptime in the header are the last values recorded by the monitor before the crash. 5-5 FULL *********************************************** ** THIS ENTRY COPIED FROM A SAVED CRASH ** CRASH EXTRACT LOGGED ON 5-Aug-80 AT 0:11:25 DETECTED ON SYSTEM i 1026. RECORD SEQUENCE NUMBER: 187. MONITOR UPTIME WAS 11:50:09 *********************************************** CRASH.EXE READ FROM: DSKB SYSTEM WIDE ERROR COUNT: 162. CONTENTS OF GETTAB'D ITEMS: TIME OF DAY: 0:11:24 SYSTEM MEMORY SIZE: 336000 i JOBS LOGGED IN: 26. DEBUG STATUS WORD: 0,0 START OF MONITOR HIGH SEG: 2501000 UPTIME IN TICKS: 2556574. i UNREC EXEC PDL OV: O. SWAP ERROR COUNT: O. DISABLED HARDWARE ERROR COUNT: 20. i DEBUG STOPCDS: O. # JOB STOPCDS: O. LAST STOPCD-PROGRAM NAME: LAST STOPCD-UUO: 0,0 (J'I I m PARITY ERROR INFORMATION: TOTAL MEM PAR ERRORS: LAST PARITY ADDR: HIGHEST ADDR OF PARITY ERROR: i SWEEPS: LOGICAL AND OF DATA: COUNT OF SPUR CHANNEL ERRORS: O. 0 0 O. 0,0 O. TOTAL SPURIOUS PARITY ERRORS: LAST PARITY WORD: ADDRESS IN SEGMENT OF PAR ERR: USER ENABLED ERRORS: LOGICAL OR OF ADDR: SYSTEM RESPONSE INFORMATION: MEAN//ST.DEV. RESP/MIN 'til TTY output: 2.1//0.3 15.7 'til TTY input: 15.4 2.0//0.4 'til requeued: 11. 3//1. 2 4.0 !til 1st of above: 1.0//0.3 17.2 'TIL JOB STARTED: 18.1 0.6//0.3 TOTAL UUO COUNT: TOTAL JOB CONTEXT SWITCH COUNT: SUM TTY OUT UUO RES HI-SUM SQ TTY OUT UUO NUMBER TTY INP UUO SUM QUANTUM REQ RES LO-SUM SQ QUANTUM REQ RES HI-SUM SQ ONE OF ABOVE NUMBER CPU RES 5474654. 1736842. 1400074. O. 10944. 1936068. 22256557290. O. 12817. O. 0,0 0 O. 0.0 LAST ADDR POKED: 13415 tOF WORDS OF CORE: 4000000 # RECOVERED EXEC PDL OV: O. LAST STOPCD: LAST STOPCD-JOB NUMBER: 0 LAST STOPCD-P.PN: [0,0] MULTIPLE PARITY ERRORS: LAST PARITY PC: # PAR ERRORS THIS SWEEP: LOGICAL AND OF ADDR: LOGICAL OR OF DATA: O. 0 O. 0,0 0,0 # of RESP 11135. 10944. 2853. 12210. 12817. AVG. = 7710.8 PER MIN. AVG. = 2446.3 PER MIN. NUM TTY OUT UUO: 11135. LO-SUM SQ TTY OUT UUO: 10710992548. HI-SUM SQ TTY INP UUO: 9. NUMBER QUANTUM REO RES: 2853. SUM ONE OF ABOVE: 696571. LO-SUM SQ ONE OF ABOVE: 3363170363. HI-SUM CPU RES: O. SUM TTY INP UUO: 1283336. LO-SUM SQ TTY INP UUO: 22975521908. HI-SUM SQ QUANTUM RES: 6. NUMBER ONE OF ABOVE: 12210. SUM CPU RES: 473960. LO-SUM CPU RES: 3255471104. UPTIME: 11:50:09 TIME: OVERHEAD 2:12:05 LOST TIME: TOTAL UUO COUNT: 5474654. TOTAL SPUR NXM: O. 0:11:48 NULL TIME: TOTAL JOB CONTEXT SWITCH COUNT: 1736842. # JOBS AFFECTED LAST NXM: O. SHORT SEQ TIME 187. 0:11:25 CRASH EXTRACT-STOPCD WAS FOR JOB 0 UUO WAS 0,0 SYSTEM WIDE ERROR COUNT WAS 162 5-Aug-80 FIRST ADDR LAST NXM: 0 5:12:47 ~OTAL NXM: 0 ENTRY DESCRIPTIONS 5.2.4 Data Channel Error When a channel detects an error or a device connected to a channel detects an error during a data transfer, the monitor logs a Data Channel Error into the system event file. The entry is made at the time of first error; thus, the entry can be a soft or a hard error. Because the monitor programs the channel to stop when it encounters an error (except on the last retry), this entry gives valuable information about the word in error and its address, whether or not the error was detected by the channel. The Data Channel Error is generated only for DF10 data channels and is not generated for devices using the KL10 internal channels (RH20). FULL *********************************************** DATA CHANNEL ERROR LOGGED ON 1-0ct-80 AT 9:03:12 MONITOR UPTIME WAS DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 3122. 1:02:10 *********************************************** DATA CHANNEL ERROR TOTALS NXM'S AND OVERRUNS: 1. MEM PE SEEN BY CHANNEL: 0. CONTROLLER DATA PE OR CCW TERM CHK FAILS: 0. CHANNEL COMMAND LIST BREAKDOWN DEVICE USING CHANNEL: INITIAL CONTROL WORD: TERMINATION WD WRITTEN: EXPECTED TERM. WORD: CHANNEL COMMAND LIST: RPA5 0,454 11323,313216 11323,313413 0,454 774003,313213 0,0 3RD FROM LAST DATA WORD:0,0 2ND FROM LAST DATA WORD:0,0 LAST DATA WORD XFERRED: 0,0 SHORT SEQ TIME 3122. 9:03:12 RPA5 CHANNEL ERROR COUNTS: NXM/MPE/DPE 1/0/0 WRITTEN TERM WD = 11323,313216 EXPECTED TERM WD = 11323,313413 5.2.5 DAEMON Started 1-0ct-80 The monitor logs this entry into the system event file each time DAEMON is started, either after a system reload or a restart of DAEMON. If DAEMON is modified at the site, the customer version number should be edited to track the modifications. 5-8 ENTRY DESCRIPTIONS FULL *********************************************** DAEMON STARTED LOGGED ON 5-Aug-80 AT 0:16:30 DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 184. MONITOR UPTIME WAS 0:00:28 *********************************************** DAEMON VERSION 20(757) SHORT SEQ TIME 184. 0:16:30 DAEMON STARTED--VERSION 20(757) 5.2.6 MASSBUS Disk Error 5-Aug-80 Any time the monitor detects an error in any portion of the MASSBUS system (either hardware or software), DAEMON is called to collect and record all pertinent hardware and software information in the error file. In this entry, the MEDIA ID is the value given to the disk when structured with ONCE or TWICE. The STR ID is the logical name of the media such as DSKB0. Both are recorded in the HOME block. The LBN (logical block number) is the location of the first block in the transfer. If LBN n, n+l, n+2, and n+3 were transferred, it is possible that LBN n, n+l, and n+2 are alright, but LBN n+3 is bad. This value is broken into either the cylinder #, surface #, and sector # (for disks) or the track # and sector # (for RS04s) to determine the physical location of the failure. The OPERATION AT ERROR is the text translation of the last command issued to the device before the error was detected (presumably the command that caused the error). The text translation should match the translation of the bits in DATAI RHCR AT ERROR for the RH10 and DATAI PTCR AT ERROR for an RH20. If the information does not match, look for an error in the control bus. NOTE Because of dual-port capabilities for disk drives, the physical device number can change according to the port assignment. For example, on dual-ported drives, one drive may be RPA3 on PORT A and RPC3 on PORT B. MASSBUS devices store and make available significant amounts of device-dependent information. The contents of all registers are listed in the entry both at error time and after the last retry, along with the difference between the two values. Text translations are always from the AT ERROR value with the exception of the OFFSET Register; offsets are not normally used. Note that software errors are checked only after completed the transfer without a detected error. 5-9 the hardware has ENTRY DESCRIPTIONS FULL *k********************************************* MASSBUS DISK ERROR LOGGED ON 4-Aug-SO AT 13:36:27 DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 2. MONITOR UPTIME WAS 1:15:13 *k********************************************* UNIT ID: RPB5 UNIT TYPE: RP06 005S. UNIT SERIAL #: MEDIA ID: STR ID: USER'S ID: [1,2] USER'S PGM: PULSAR USER'S FILE: LBN AT START OF XFER: 1. CYL: O. SURF: O. SECT: 1. OPERATION AT ERROR: DEV.AVAIL., GO + READ DATA(70) ERROR: RECOVERABLE DRIVE EXCEPTION, IN CONTROLLER CONI DCK, IN DEVICE ERROR REGISTER REMAINING ENTRIES IN UNIT'S BAT BLOCK: UNKNOWN RETRY COUNT: 16. CONTROLLER INFORMATION: CONTROLLER: RH20 #540 CONI AT ERROR: 0,202415 = DRIVE EXCEPTION, CONI AT END: 0,2415 NO ERROR BITS DETECTED CHN STATUS AT ERROR: 500000,0 = NOT SBUS ERR, CHN STATUS AT END: 400000,0 = NO ERROR BITS DETECTED DATAl PTCR AT ERROR: 732605,177771 DATAl PTCR AT END: 732605,177771 DATAl PBAR AT ERROR: 723617,605735 DATAl PBAR AT END: 723617,605735 DEVICE REGISTER INFORMATION: AT ERROR AT END DIFF. TEXT CR(OO): 4070 4070 0 DEV.AVAIL., READ DATA(70) SR(Ol): 51700 11700 40000 ERR,MOL,PGM,DPR,DRY,VV, ER(02): 100000 0 100000 DCK, MR(03): 400 400 0 ZERO DET, AS (04) : 0 0 0 DA(05): 2 2 0 D. TRK = 0, D.SECT. 2 DT(06): 24022 24022 0 LA(07): 240 240 0 SN(10): l30 130 0 OF(ll): 116000 100000 16000 AT END:SIGN CHANGE, OFFSET DC ( 1 2) : 0 0 0 0. CC (13) : 0 0 0 O. E2(14): 0 0 0 NO ERROR BITS DETECTED E3(15): 0 0 0 NO ERROR BITS DETECTED EP (16) : 0 0 0 PL(17): 0 177771 177771 NONE SHORT SEQ TIME 4-Aug-SO 2. 13:36:27 RPB5 RP06 SERIAL # 0058. CONI RH = 0,202415 CHNSTS1 = 500000,0 SR = 51700 ER = 100000 CYL/SURF/SEC= 0./0./1. RETRIES: 16 502.7 DX2~ Device Error The monitor records a DX20 Device Error in the system event file when it detects an error in any portion of the MASSBUS system connected to the DX20 channel interface. In this entry, the MASSBUS REGISTER INFORMATION contains the nonzero contents of all registers both at error time and after the last retry. Also the SB (sense bytes) deecribe the device type and status of the device (in octal) attached to the DX20. 5-10 FULL *********************************************** DX20 ERROR LOGGED ON 8-Sep-80 AT 22:41:10 MONITOR UPTIME WAS DETECTED ON SYSTEM * 1026. RECORD SEQUENCE NUMBER: 1471. 3:23:01 *********************************************** U1 .....I ..... UNIT NAME: RNBO UNIT TYPE: RP20 VOLUME ID: SCROO LOCATION: LBN = 463454. OPERATION AT ERROR: GO+ READ DATA(70) [10,664] USER'S P,PN USER'S PGM: FILCHK USER'S FILE: RETRIES PERFORMED: 1. ERROR: RECOVERABLE DRIVE EXCEPTION,CHN ERROR, IN CONTROLLER CONI MPER, IN DEVICE ERROR REGISTER CONTROLLER INFORMATION: CONTROLLER: RH20 t 554 DX20 *:0 DX20 U-CODE VERSION: 0(4) CONI AT ERROR: 540100,222615 = DRIVE EXCEPTION,CHN ERROR, CONI AT END: 540100,222615 = DRIVE EXCEPTION,CHN ERROR, DATAl PTCR AT ERROR: 732600,171771 DATAl PTCR AT END: 732600,171771 DATAl PBAR AT ERROR: 723617,777417 DATAl PBAR AT END: 723617,777417 CHANNEL INFORMATION: CHAN STATUS WD 0: 200000,464 CW1: 414721,475143 CW2: 420000,721000 CHN STATUS WD 1: 540100,466 = NOT SBUS ERR,NOT WC = O,LONG WC ERR, CHN STATUS WD 2: 414720,721143 MASSBUS REGISTER INFORMATION: AT ERROR TEXT AT END DIFF. CR 00: 70 70 o READ DATA(70) SR 01: 170000 170000 o ATA,ERR,LINK PRESENT,MP RUN, ER 02: 10600 o 10600 ERROR CLASS = 1,SUBCLASS = 1 iMPER, = UNUSUAL STATUS FROM INITIAL SELECTION SEQUENCE MR 03: 4 4 o MICRO P START, AS 04: 1 1 o HR 05: 16005 16005 o HEAD*: 28. RECORD*:5. DT 06: 10061 10061 o ESSI20: 1 1 STATUS INDEX FOR ESRO&l=l o DEV STATUS: NO ERROR BITS DETECTED ASYN21: 0 o CTRL: 0 DRIVE: 0 0 DEV STATUS: NO ERROR BITS DETECTED . FA 22 o 0 ARGUMENT:O FLAGS: NO ERROR BITS DETECTED 0 DN 23 30 CTRL: 1 DRIVE: 10 30 o CL 24 1151 o 1151 CYL: 617. HR 25 16005 16005 o HEAD*: 28. RECORD#:5. ESR026 100151 100151 o ESR127 56123 56123 o DIAG30: 161231 DIAG31: 133025 161231 133025 o o RP20 SENSE BYTES LISTED IN HEXIDECIMAL DATA CHK, BYTE 00: 08 NO ERROR BITS DETECTED BYTE 01: 00 BYTE 02: 40 CORRECTABLE, RESTART COMMAND BYTE 03: 06 BYTE 04: 80 PHYSICAL DRIVE ID BYTE 05: 69 BYTE 06: 5C LOGICAL CYL. ADDR. = 617. LOGICAL HEAD = 28. BYTE 07: 53 = FORMAT 5 , MESSAGE 3 DATA FIELD CORRECTABLE DATA AREA CYL OF LAST SEEK ADDRESS: 617. SURF. OF LAST SEEK ADDRESS: 28. RECORD i IN ERROR: 4. SECTOR i IN ERROR: 20. i OF BYTES XFERRED: 576. BYTES ERROR DISPLACEMENT: 553. BYTES ERROR PATTERN: 100000 SHORT SEQ TIME 8-Sep-80 1471. 22:41:10 RNBO SCROO: RP20 CONI=540100,222615 CHNSTS1=540100,466 SR=0,170000 ER=0,10600 SENSE BYTE 7: 53 LBN: 463454. RETRIES: 1 ENTRY DESCRIPTIONS 5.2.8 Software Event This entry is logged into the system event file when a user with special privileges, for example the system operator, issues one of the following monitor calls: POKE, RTTRP, SNOOP, or TRPSET. These monitor calls have the following effect: 1. POKE changes the value of a word in monitor core. 2. RTTRP connects a device to or releases it from interrupt facility. 3. SNOOP allows privileged programs to insert breakpoints in the monitor that trap to a user program. The user program must be locked in core when the trap occurs. This feature is used for fault insertion, performance analysis, and trace functions. 4. TRPSET prevents jobs other than the calling job from running. You can use this call to guarantee fast response to realtime interrupts. For more information on monitor calls, refer to Calls Manual. the the TOPS-10 realtime Monitor FULL *********************************************** SOFTWARE EVENT LOGGED ON l4-Jul-80 AT 8:56:45 DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 1. MONITOR UPTIME WAS 0:42:42 *********************************************** EVENT TYPE: POKE JOB #: 46. USER PPN: [10,5324] LOCATION OF USER: NODE:26 LINE:154 TTY154 PROGRAM: SPICE STORED DATA VALUES: 0,34030 SHORT SEQ 1. TIME l4-Jul-80 8:56:45 SOFTWARE EVENT TYPE: POKE BY JOB 46 USER WAS [10,5324] RUNNING SPICE AT NODE: 26 LINE: 154 TTY154 5-13 ENTRY DESCRIPTIONS 5.2.9 Configuration Status Change The monitor records a Configuration Status Change whenever the system operator marks disk units and sections of core memory on-line or off-line. The system operator uses the either the CONFIG program or the SET command to change the system configuration. These tools are useful because they can prevent further errors to users until a unit can be repaired, or they can be used to split and later join dual CPU systems. For more information on the CONFIG program, refer to the file CONFIG.DOC. With the SET command, the system operator can also give a 2-character reason for the change in configuration. Any two characters can be used, but the following codes are suggested: PM preventive maintenance CM corrective maintenance DN unit is down OT other CAUTION When the system operator adds memory to the system, the monitor checks to verify the availability of the specified addresses. Mistakes are reported at the operator's terminal (CTY), but the error logging system treats these as valid NXMs and generates the appropr ia te NXM repo rts. You can iden ti fy a NXM report of this type because no physical memory is placed off-line and the user's directory is [1,2]. FULL *********************************************** CONFIGURATION STATUS CHANGE LOGGED ON 4-Aug-80 AT 14:06:05 DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 15. MONITOR UPTIME WAS *:t********************************************* COMMAND: DETACH DEVICE: RNA0 SHORT SgQ TIME 4-Aug-80 15. 14:06:05 CONFIGURATION CHANGE DETACHED RNA0 5-14 1:44:50 ENTRY DESCRIPTIONS 5.2.1~ System Log Entry The monitor records a System Log Entry when the system operator enters a log entry into the system event file with the OPR program. A system operator, or anyone with operator privileges, can entry into the system event file by doing the following: 1. make an Run the OPR program .OPR~ OPR> 2. When you see the prompt, specify the REPORT command: OPR>REPORT 3. Use the following syntax: OPR>REPORT user text ~ where user can be directory name and/or device name and can be a single-line or multiple-line response. For more information on OPR, refer to the TOPS-10 Language Reference Manual. Operator's text Command FULL *********************************************** SYSTEM LOG ENTRY LOGGED ON l5-Sep-80 AT 10:4~:12 DETECTED ON SYSTEM # 1~26. RECORD SEQUENCE NUMBER: 37. MONITOR UPTIME WAS 5:30:10 *********************************************** ENTRY CREATED BY: JOB #, TTY #: P,PN: WHO: DEV: MESSAGE: 77,502 [27,2617] MASELL TTY : THIS IS A TEST. SHORT SEQ TIME 15-Sep-8~ 37. 10:40:12 SYSTEM LOG ENTRY BY MASELL FOR DEVICE TTY ON TTY # 502 MESSAGE: : THIS IS A TEST. 5-15 ENTRY DESCRIPTIONS 5.2.11 Software Requested Data At certain times during system operation, some problems can arise that are not easily understood. Most frequently, the source of the failure is a hardware failure but the failure is detected by the software. In order to troubleshoot this type of failure, you may require additional data from the monitor. You can obtain this information by patching the monitor to collect the information at the proper point and passing it to the system event file for listing. CAUTION Patching a monitor can easily produce drastic, undesired results such as loss of customer data, system crashes, and so forth. Be EXTREMELY CAREFUL and enlist the help of someone who is familiar with the monitor structure and internal workings. SPEAR lists the information in this entry in octal and sixbit. ~********************************************** SOFTWARE REQUESTED DATA LOGGED ON 4-Jan-81 AT 6:50:34 DETECTED ON SYSTEM # 2263. RECORD SEQUENCE NUMBER: 1. MONITOR UPTIME WAS 3:13:34 'k********************************************** OCTAL VALUE 504554,545700 675762,544400 123456,654321 654321,123456 555762,450063 517042,516400 5.2.12 SIXBIT VALUE HELLO WORLD *<NUCl UCl*<N MORE S IXBIT Magtape System Error 'rhe monitor records any magtape errors it detects as a Magtape System grror. Errors that are non-recoverable are classified as HARD, recoverable errors are classified as SOFT. If the monitor detects a data channel error, it records the appropriate information under error code 6 or Data Channel Error. After a user issues an UNLOAD command or UUO, the monitor records the performance statistics for the tape, including the total number of characters transferred and the number of errors (soft read, soft write, hard read, hard write) encountered. Note that if someone mounts unlabelled tapes without specifying kind of ID, there will be no MEDIA identified in the error file. 5-16 any ENTRY DESCRIPTIONS FULL *********************************************** MAGTAPE SYSTEM ERROR LOGGED ON 8-Sep-80 AT 9:05:11 MONITOR UPTIME WAS DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 11. *********************************************** 0:57:06 UNIT NAME: MTB261 UNIT TYPE: TU70 USER'S ID: [1,2] USER'S PROGRAM: BACKUP MEDIA ID: LOCATION OF FAILURE: RECORD: O. OF FILE: 5. POSITION BEFORE ERROR: RECORD: 262143. OF FILE: 5. CHAR. INTO RECORD: 5458276711. OPERATION: S.I., IMM, BYTE, DEV.CMD.: READ STATUS: CU IS: TXOl,7 & 9 TRK NRZI DEVICE IS: WRITE ENB THIS ENTRY CREATED AS A RESULT OF A 'HUNG DEVICE' ERROR: NON-RECOVERABLE RUNNING,CSR, IN DXI0 CONI, UNIT EXC, IN ICPC+l ***AS OF DXI0 MICROCODE VERSION 4(0), RECOVERABLE ERRORS ARE NOT REPORTED TO MONITOR IF DXI0 MICROCODE ERROR RETRY IS ENABLED.*** RETRY COUNT: O. CONTROLLER INFORMATION: CONTROLLER: DXI0 #0 CONI AT ERROR: 1,422034 = RUNNING,CSR, CONI AT END: 1,422034 = RUNNING,CSR, ICPC+l AT ERROR: 32201,1 = UNIT EXC, ICPC+l AT END: 32201,1 = UNIT EXC, 710040,457 ICPC+2 AT ERROR: ICPC+2 AT END: 710040,457 TEXT REGISTER AT ERROR AT END DIFF B CNT: 0,0 0,0 0,0 OPL OUT, TAGBUS: 10,2 10,2 0,0 DAC: 1,226233 1,226233 0,0 0,0 REV: 150000,2660 150000,2660 CPMA&MD: 0,0 DR: 0,0 DEVICE INFORMATION: *IN OCTAL BYTES* TEXT AT END DIFF SENSE BYTE AT ERROR o 0 0 0 FILE PROT, TIE = 00000011 o 102 3 0 0-3: 0 102 3 0 o 0 0 0 NO ERROR BITS DETECTED o 100 5 0 4-7: 0 100 5 0 o0 0 0 000 0 NO ERROR BITS DETECTED 8-11: 0 0 0 0 o 305 213 0 000 0 12-15: 0 305 213 0 o 232 0 35 000 0 16-19: 0 232 0 35 o0 0 0 o0 0 0 20-23: 0 0 0 0 CHAN CMD LIST: CPC: CMDS: 0,0 262020,20001 140000,454 SHORT SEQ 11. TIME 8-Sep-80 9:05:11 MTB261 TU7x DX10 CONI = 1,422034 ICPC+l = 32201,1 SB(0-3) = 0/102/3/0 FILE/REC = 4/0 RETRIES: 0 HARD 5-17 ENTRY DESCRIPTIONS 5.2.13 Front End Device Report You will find a Front End Device Report in the system event file when the front end passes a packet of error information to the monitor. This information contains errors detected by the front end and KLCPU hardware and software. If the device being reported on is unknown to SPEAR, the entry is reported in octal. FULL *********************************************** FRONT END DEVICE REPORT LOGGED ON 3-Nov-80 AT 9:44:10 DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 67. MONITOR UPTIME WAS 2 DAYS 14:37:29 *********************************************** CPU #,DTE #: 0,0 FE SOFTWARE VER: 0. DEVICE: KLCPU STD. STATUS: 100 ERROR LOG REQUEST, KL RELOAD STATUS FROM FRONT END: 0 NO ERROR BITS DETECTED SHORT SEQ 67 .. TIME 3-Nov-80 9:44:10 KLCPU STD STAT=100 RELOAD STAT=0 5.2.14 Front End Reload The monitor logs a Front End Reload entry into the system event file when it determines that one of its front ends (attached to a DTE on a KL10 only) has crashed and has attempted to reload. Before rebooting the front end, the monitor dumps the crashed front end's core image to a disk file for later analysis. FULL ****************************.*********.******** FRONT END RELOAD LOGGED ON 9-Sep-SO AT 0:01:05 MONITOR UPTIME WAS DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 1494. 0:01:57 ***********.*****************~***************** CPU # :"FRONT END #: STATUS AT RELOAD: RETRIES: 0 1,1 DUMP FAILED,RELOAD FAILED,ROM DIDN'T ACK THE -10, SHORT SEQ TIME 1494. 0:01:05 FRONT END RELOAD ON PDPll #1 RELOAD STATUS: 104400 RETRIES: 0 5.2.15 9-Sep-SO KS18 BaIt Status Block The monitor records a KS10 Halt Status Block entry into the system event file when the KSl0 microcode executes a HALT stopcode. A snapshot of the condition of the system is taken just prior to the HALT, and this information is written as the entry. 5-18 ENTRY DESCRIPTIONS FULL *********************************************** KS10 HALT STATUS BLOCK LOGGED ON 9-Feb-81 AT 14:21:55 MONITOR UPTIME WAS 0:01:12 DETECTED ON SYSTEM # 4145. RECORD SEQUENCE NUMBER: 1. *********************************************** HALT STATUS CODE: 2 PROGRAM COUNTER: 1000 HALT STATUS BLOCK MAG: 0,2 PC: 0,1000 HR: 777756,4 AR: 0,0 ARX: 377777,777777 BR: 0,1000 BRX: 254000,1000 ONE: 241200,200000 EBR: 0,1 UBR: 0,31463 MASK: 774777,470177 FLAGS"PAGE FAIL WORD: 0,1 PI STATUS: 400060,120000 XWD1: 500101,553000 T0: 777777,777777 Tl: 4000,0 VMA: 0,177 SHORT SEQ TIME 9-Feb-81 1. 14:21:55 HALT STATUS CODE PC = 0,1000 HR = 254000,1000 PAGE FAIL = 4000,0 PI = 0,177 FLAGS"VMA 5.2.16 0,0 Magtape Statistics Each time an UNLOAD UUO or monitor command is given to a tape drive the monitor creates a Magtape Statistics entry. The same information is printed in summary form on both the user's terminal and the operator's terminal (CTY). In this entry, the REEL IDENTIFICATION is the name supplied to the monitor at the time the tape was mounted. It has nothing to do with any label information found on the tape. The CHARS READ is the number of characters or frames of tape read on this unit since the last UNLOAD command was issued to this unit. The CHARS WRITTEN is the number of characters or frames of tape written on this unit since the last UNLOAD command was issued. 5-19 ENTRY DESCRIPTIONS FULL *********************************************** MAGTAPE STATISTICS LOGGED ON 4-Aug-80 AT 13:40:05 DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 5. MONITOR UPTIME WAS 1:18~50' *********************************************** MAGTAPE STATISTICS UNIT NAME: REEL IDENTIFICATION: USER'S P,PN: CHARS READ: CHARS WRITTEN: SOFT READ ERRORS: HARD READ ERRORS: SOFT WRITE ERRORS: HARD WRITE ERRORS: MTB261 1,2 2720. 0. 0. 1. 0. 0. SHORT SEQ TIME 4-Aug-80 5. 13:40:05 MTB261 STATISTICS 5.2.17 READ CH/H/S: 2720/1/0 WRITE CH/H/S: 0/0/0 Disk Statistics This entry reports the performance of each disk unit since the monitor was loaded. It is useful for computing the disk error rate and disk throughput. This information is usually not recorded by DAEMON in the system event file because it takes up a great deal of space. Installations that want this entry should reassemble DAEMON with the conditional assembly switch FTUSN set. The monitor records this entry type for each disk unit on the system each hour. You can find the same type of information for each monitor run in the Crash Extract entry (Section 5.2.3). 5-20 FULL *********************************************** ** THIS ENTRY COPIED FROM A SAVED CRASH ** DISK STATISTICS LOGGED ON 5-Aug-80 AT 0:11:25 DETECTED ON SYSTEM 1026. RECORD SEQUENCE NUMBER: 188. * MONITOR UPTIME WAS 11:50:09 *********************************************** (J1 I N ..... UNIT RPAO RPA1 RPA2 RPA3 RPA4 RPA5 RPA6 RPA7 RPBO RPB1 RPB2 RPB3 RPB4 RPB5 RPB6 RPB7 RPDO RPD1 RPD2 RPD3 RPD4 RPD5 RPD6 RPD7 RNAO RNA1 RNA2 RNA3 PACK BLKXO DSKCO FTNO GALOO DSKC1 DSKBO DSKB1 DSKRO DSKPO BLKKO SEEKS 0 0 2145 3758 0 4614 0 0 549 2166 92665 0 3577 2032 592 0 131943 0 0 12525 56962 0 0 25648 17196 3029 550 2 BLOCKS BLOCKS WRITTEN READ 0 0 0 0 26986 2 2 9800 0 0 13740 1355 0 0 0 0 44 4037 26 35998 2149512 1306027 0 0 58 39762 4940 14114 61 4550 0 0 1576709 1162512 0 0 0 0 331776 56 1102911 863534 0 0 0 0 100568 18486 11667 28596 422 3884 4 843 0 2 SHORT 5-Aug-80 SEQ TIME 188. 0:11:25 DISK STATISTICS DATA 0 0 17 5 0 1 0 0 0 0 2 0 1 1 0 0 0 0 0 1 5 0 0 1 0 0 0 0 *** ERROR COUNTS *** DEVICE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 SEEK 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 HUNG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 SAT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RIB 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CHECKSUM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 t:rJ Z 1-3 ~ ~ t1 t:rJ til (') ~ H ttl 1-3 H 0 Z til ENTRY DESCRIPTIONS 5.2.18 DL19 Communications Error The monitor records a DL10 Communications Error into the system file when the DL10 detects an error on the communications link. event FULL *********************************************** DL10 COMMUNICATIONS ERROR LOGGED ON 4-Aug-80 AT 16:45:09 DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 86. MONITOR UPTIME WAS 4:23:54 *********************************************** UNIT: DC76 DL10 PORT: 0 ERROR: NO ERROR BITS DETECTED 11 PROGRAM NAME: DC 7 6 CONTROLLER INFORMATION: CONI DLC: 60,200204 = PI ENB, DATAl DLC: 0,750 = NO ERROR BITS DETECTED CONIDLB (R=0): 0,5037 CONI DLB (R=l): 40000,6005 CONI DLB (R=2): 2000,46401 CONI DLB (R=3): 577777,46400 DATAl DLB (R=l) (MB): 0,0 SHORT SEQ TIME 4-Aug-80 86. 16:45:09 DL10 ERROR ON PDPII # 0 CONI DLC = 60,200204 DATAl DLC = 0,750 5.2.19 KL19 Parity or NXM Interrupt The monitor records a KL10 Parity or NXM Interrupt in the system event file when the KL10 detects a parity error or an attempt to access a nonexistent memory location. The PC AT INTERRUPT is the status of the program counter at the time of the parity or nonexistent memory interrupt. The CONI PI AT INTERRUPT is the status of the Priority Interrupt system at the time of the parity or nonexistent memory interrupt. 5-22 ENTRY DESCRIPTIONS FULL *********************************************** ** THIS ENTRY COPIED FROM A SAVED CRASH ** KLI0 PARITY OR NXM INTERRUPT LOGGED ON 2-Dec-80 AT 0:05:28 DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 584. MONITOR UPTIME WAS 16:20:11 *********************************************** ERROR DETECTED ON CPLO PC AT INTERRUPT: 4000,566602 CONI PI AT INTERRUPT: 0,10377 CONI APR AT INTERRUPT: 7760,2030 = NXM,SWEEP DONE, ERA: 200003,554255 = WD # 1 MEMORY READ BASE PHY. MEM ADDR. AT FAILURE: 3554255 SYSTEM MEMORY CONFIGURATION: CONTROLLER: #4 DMA20 INTERLEAVE MODE: DMA: LAST AD DR HELD: 45220 ERRORS DETECTED: 4 WAY NONE SHORT 2-Dec-80 SEQ TIME 584. 0:05:28 PARITY OR NXM INTERRUPT ON CPLO CONI APR = 7760,2030 CONI PI = 0,10377 RDERA = 200003,554255 PC AT INTERRUPT = 4000,566602DUMPING UNKNOWN ERROR IN OCTAL ERROR CODE 5.2.20 0 KS10 NXM Trap When the KS10 detects a read on a nonexistent memory location, the monitor records a KS10 NXM Trap into the system event file. A trap stops execution during the current instruction. FULL *********************************************** KS10 NXM TRAP LOGGED ON 22-Mar-81 AT 0:11:50 DETECTED ON SYSTEM # 4608. RECORD SEQUENCE NUMBER: 1. MONITOR UPTIME WAS *********************************************** ERROR DETECTED ON CPS0 PC AT TRAP: 1,145267 CONI PI AT TRAP: 0,2377 PAGE FAIL WORD: 200013,770000 PAGE FAIL CODE: 20 = 1-0 NXM PHYSICAL MEMORY ADDRESS AT TRAP: USER'S ID AT TRAP: [307,5515] USER'S PROGRAM: TSTUBA # OF RECOVERABLE TRAPS: 0. # OF NON-RECOVERABLE TRAPS: 0. 0,0 SHORT SEQ 1. TIME 22-Mar-81 0:11:50 NXM TRAP PFW = 200013,770000 PMA = 0,0 NON RECOVERABLE FAILURE RETRYS: 31 USER AT TRAP [307,5515] RUNNING TSTUBA 5-23 0:23:18 ENTRY DESCRIPTIONS 5.2.21 KL1~ or KS1~ Parity Trap The monitor records a KL1~ or KS1~ Parity Trap when either the KL1~ or KS1~ detects an internal parity error, not necessarily in memory. In this entry, the PHYSICAL MEMORY ADDRESS AT TRAP gives the of the parity error where the trap occurred. location FULL *********************************************** OR KS1~ PARITY TRAP LOGGED ON 4-Feb-S1 AT 17:37:14 DETECTED ON SYSTEM # 2136. RECORD SEQUENCE NUMBER: 1. KL1~ MONITOR UPTIME WAS 0:03:13 *********************************************** ERROR DETECTED ON CPL0 PC AT TRAP: 316000,230 CONI PI AT TRAP: 0,377 PHYSICAL MEMORY ADDRESS AT TRAP: 547001,436241 USER'S ID AT TRAP: [1,2] USER'S PROGRAM: KLPAR4 PAGE FAIL WORD: 767000,241 PAGE FAIL CODE: 36 = AR BAD DATA WORD: 252525,252525 GOOD DATA WORD: 0,0 DIFFERENCE: 252525,252525 RECOVERY: CRASH USER RETRY COUNT: W CACHE: 4. w-o CACHE: 0.ERROR DURING CACHE SWEEP TO CORE # OF RECOVERABLE TRAPS: 0. i OF NON-RECOVERABLE TRAPS: 3. SHORT SEQ TIME 4-Feb-S1 1. 17:37:14 PARITY TRAP PFW = 767000,241 PMA = 547001,436241 NON RECOVERABLE FAILURE USER AT TRAP [1,2] RUNNING KLPAR4 RETRIES: 4 5-24 ENTRY DESCRIPTIONS 5.2.22 Memory Sweep for NXM When the monitor detects an attempt to access a nonexistent memory location in user core, it scans core by doing a memory sweep, looking for more NXMs. The monitor then records the results of this scan as a Memory Sweep for NXM in the system event file. The ADDRESSES DETECTED BY SWEEP gives you the locations, more attempts to access nonexistent memory locations. if any, FULL *********************************************** MEMORY SWEEP FOR NXM LOGGED ON 1-Oct-80 AT 9:03:14 MONITOR UPTIME WAS DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 3124. *********************************************** NXM CORE SWEEP TOTALS FOR CPL0 0. REPRODUCIBLE: NON-REPRODUCIBLE: D~TECTED BY DATA CHANNEL BUT NOT BY CPU: SWEEP INFORMATION: ERRORS DETECTED: LOGICAL "AND" OF BAD PHYSICAL ADDRESSES: LOGICAL "OR" OF BAD PHYSICAL ADDRESSES: MEMORY PLACED OFF-LINE: 0. 20. 0. 777777,777777 0,0 SHORT SEQ TIME 3124. 9:03:14 NXM SWEEP ON CPL0 # OF ERRORS SEEN 1-Oct-80 5-25 0 1:02:21 of ENTRY DESCRIPTIONS 5.2.23 Memory Sweep for Parity When the monitor detects a parity error on a read attempt, it sweeps memory looking for more of the same. The results of the sweep are recorded in the system event file as a Memory Sweep for Parity. The SWEEP INFORMATION contains the number of words found with bad parity. It also contains the logical AND and logical OR of the bad addresses and bad contents. FULL *********************************************** MEMORY SWEEP FOR PARITY LOGGED ON 4-Nov-80 AT 8:39:53 MONITOR UPTIME WAS DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 2026. 0:35:34 *********************************************** DATA PARITY CORE SWEEP TOTALS FOR CPL0 REPRODUCIBLE: 0. NON-REPRODUCIBLE: 0. USER ENABLED: 0. CORE SWEEPS: 1. DETECTED BY DATA CHANNEL BUT NOT BY CPU: 1. SW'EEP INFORMATION: ERRORS DETECTED: 0. LOGICAL "AND" OF BAD PHYSICAL ADDRESSES: 777777,777777 LOGICAL "OR" OF BAD PHYSICAL ADDRESSES: 0,0 LOGICAL "AND" OF BAD DATA: 777777,777777 LOGICAL "OR" OF BAD DATA: 0,0 SHORT SEQ TIME 2026. 8:39:53 DATA PARITY CORE SWEEP FOR CPL0 i OF ERRORS SEEN 5.2.24 CPU Status Block 4-Nov-80 0 The monitor records this entry into the system event file after recovering from a system crash. At the time of the crash, a snapshot is taken of the condition of all the components of the CPU (such as controllers, channels, RH20s, the pager, and so forth). When the system recovers, the monitor extracts this information from the CRASH.EXE file and places it in the system event file as a CPU Status Block. This entry contains the condition of the registers and channels just prior to the crash. Also, the SBDIAG FUNCTIONS column contains the SBUS diagnostic functions. 5-26 ENTRY DESCRIPTIONS FULL *********************************************** ** THIS ENTRY COPIED FROM A SAVED CRASH ** CPU STATUS BLOCK LOGGED ON 5-Aug-80 AT 0:11:25 DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 185. MONITOR UPTIME WAS 11:50:09 *********************************************** APRID = 231,342002 CONI APR = 7760,3 RDERA = 604000,7427 CONI PI = 0,10377 DATAl PAG = 701100,3 CONI PAG = 0,620001 CONI RH0 THRU RH7 000000,,002445 000000,,006400 000000,,002445 000000,,000000 000000,,000000 000000,,000000 CONI DTE0 THRU DTE3 000000,,020014 000000,,100000 000000,,100014 EPT LOCATIONS 0 THRU 37 (CHANNEL LOGOUT AREA) 200000,,000454 500000,,000456 600000,,000000 000000,,000000 000000,,000000 000000,,000000 200000,,000454 500000,,000455 600001,,457000 200000,,000454 500000,,000455 600001,,014660 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 EPT LOCATIONS 140 THRU 177 (DTE CONTROL BLOCKS) 141000,,413160 241000,,223676 264000,,057516 000000,,000442 000000,,057054 000000,,000030 000000,,000000 000000,,000000 264000,,057556 000000,,000443 000000,,057053 000000,,000030 241000,,224302 341000,,224563 264000,,057616 000000,,000444 000000,,057052 000000,,000030 341000,,232743 141000,,224000 264000,,057656 000000,,000445 000000,,057051 000000,,000030 UPT LOCATIONS 424 THRU 427 (UUO AREA) 000000,,000000 000000,,000000 000000,,000000 UPT LOCATIONS 500 THRU 503 (PAGE FAIL AREA) 000000,,000000 304000,,112667 004000,,566102 AC BLOCK 6 LOCATIONS 0 THRU 3 AND 12 000000,,000000 000000,,000000 000000,,000000 000000,,000000 AC BLOCK 7 LOCATIONS 0 THRU 2 255000,,000000 000000,,640010 000000,,000000 SBDIAG FUNCTIONS CTRLR FUNCTION 0 005740,,041736 4 000000,,002445 000000,,000000 000000,,100014 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000.,,057136 000000,,000000 000000,,057166 000000,,000000 000000,,057216 000000,,000000 000000,,057246 000000,,000000 000000,,000000 000000,,000000 FUNCTION 1 000200,,000000 SHORT SEQ TIME 185. 0:11:25 CPU STATUS BLOCK APRID = 231,342002 CONI APR CONI PI = 0,10377 CONI PAG = O,620001 DATAl PAG = 701100,3 5-Aug-80 5-27 7760,3 ENTRY DESCRIPTIONS 5.2.25 Device Status Block The monitor records this entry into the system event file after ~ecovering from a system crash. At the time of the crash, a snapshot IS taken of the condition of all the I/O devices (such as lineprinters, cardreaders, disk drives, and so forth). When the system recovers, the monitor extracts this information from the CRASH.EXE file and places it in the system event file as a Device Status Block. FULL *********************************************** ** THIS ENTRY COPIED FROM A SAVED CRASH ** DEVICE STATUS BLOCK LOGGED ON 5-Aug-80 AT O:11:25 DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 186. MONITOR UPTIME WAS 11:50:09 *********************************************** CONI 20 : CONI 24 : CONI 120 CONI 104 CONI 100 CONI 240 CONI 320 CONI 324 CONI 150 CONI 124 CONI 140 CONI 344 CONI 340 CONI 220 CONI 170 CONI 174 CONI 270 CONI 274 CONI 360 CONI 250 CONI 254 CONI 260 CONI 264 CONI 334 CONI 330 CONI 64 : CONI 60 : CONI 164 CONI 160 CONI 110 CONI 154 CONI 234 CONI 230 CONI 144 Dl\TAI Dl\TAI 170 DATAl 174 Dl\TAI 270 D.l\TAI 274 Dl\TAI 360 DATAl 250 DATAl 254 DATAl 260 °: 117,63202 O,32003 O,0 0,0 0,0 O,0 O,410000 770010,4100 3,0 O,2400 O,40 0,0 0,0 1,420004 0,0 0,0 O,0 4000,5 O,0 0,0 0,0 0,0 O,0 0,0 0,0 60,200224 0,5037 0,0 0,0 O,400000 2,0 O,0 307620,32400 0,0 0,0 O,0 0,0 0,0 4003,3 O,0 0,0 O,0 0,0 5-28 ENTRY DESCRIPTIONS DATAl DATAl DATAl DATAl DATAl 264 : 0,0 64 : 0,770 60 : 0,162 164 0,0 160 : 0,0 SHORT SEQ TIME 186. 0:11:25 DEVICE STATUS BLOCK 5.2.26 5-Aug-80 Line Printer Error The monitor records any errors detected by the LP100 controller as a Line Printer Error in the system event file. Note that if the line printer is taken off-line to add paper or change forms, the monitor does not record this event. The LAST DATA WORD SENT can help to determine the location of a data parity error, if one exists. Also, the CONI AT ERROR text translation contains significant error bits to describe the mode of operation when the failure occurred. FULL *********************************************** LINE PRINTER ERROR LOGGED ON 22-Mar-81 AT 0:11:50 DETECTED ON SYSTEM # 1536. RECORD SEQUENCE NUMBER: 1. MONITOR UPTIME WAS 0:23:18 *********************************************** UNIT NAME: LPT0 CONTROLLER TYPE: LP100 LAST DATA WORD SENT: 0,123 CONI AT ERROR: 200045,226465 VFU TYPE: DIRECT ACCESS CHARACTER SET: VARIABLE PAGE COUNTER: 37. NOT READY,VFU ERROR,OFF LINE, SHORT SEQ 1. TIME 22-Mar-81 0:11:50 LPT0 LP100 ERROR CONI LP 5-29 200045,226465 ENTRY DESCRIPTIONS 5.2.27 Unit Record Error The monitor logs a Unit Record Error into the system event file when it detects an error on a unit-record device such as a line printer, a card reader, a card punch, or a plotter. FULL *********************************************** UNIT RECORD ERROR LOGGED ON 8-Sep-80 AT 12:06:44 DETECTED ON SYSTEM # 1026. RECORD SEQUENCE NUMBER: 314. MONITOR UPTIME WAS 3:58:38 *********************************************** UNIT NAME: CONTROLLER TYPE: DEVICE TYPE: USER ID: PROGRAM NAME: VFU TYPE: CHARACTER SET: CONI AT ERROR: LAST DATA WD: LPT262 LP100 LPT [1,2] LPTSPL DAVFU 96 CHARACTER 307216,632444 0,0 NOT READY,VFU ERROR, OFF LINE, SHORT SEQ TIME 8-Sep-80 314. 12:06:44 LPT262 ERROR FOR USER [1,2] RUNNING LPTSPL CONI LP100 = 307216,632444 5.3 TOPS-29 ENTRIES The following sections list both the FULL and SHORT versions of the entries that TOPS-20 can record in its system event file. Note that the network entries for DECnet-20 version 2.1 are listed separately in Section 5.4. Network entries for DECnet-20 versions 3.0, and 4.0 are listed in Section 5.5 5.3.1 TOPS-29 System Reloaded Every time the monitor is loaded a TOPS-20 System Reloaded entry is written into the system event file, explaining why the system was reloaded. If the system is on auto-reload and a BUGHLT occurs, the BUGHLT address is listed and the TOPS-20 BUGHLT-BUGCHK entry, Section 5.3.2, is also written into the system event file. 5-30 ENTRY DESCRIPTIONS FULL *********************************************** TOPS-20 SYSTEM RELOADED LOGGED ON Mon 23 Jun 80 08:46:31 DETECTED ON SYSTEM # 2116. RECORD SEQUENCE NUMBER: 22. MONITOR UPTIME WAS 0:00:22 *********************************************** CONFIGURATION INFORMATION SYSTEM NAME: System 2116 TOPS-20 Monitor 4(3230) MONITOR BUILT ON: Wed 28 Nov 79 11:00:01 CPU SERIAL #: 2116. MONITOR VERSION: 4(3230) U-CODE VERSION: o RELOAD BREAKDOWN: SHORT SEQ TIME Mon 23 Jun 80 22. 08:46:31 RELOAD OF System 2116 The Big Orange Welcomes You, TOPS-20 Monitor 4(3230) VERSION 4(3230) BUILT ON Wed 28 Nov 79 11:00:01 REASON 5.3.2 TOPS-20 BUGCHKs and BUGHLTs When the monitor detects a BUGHLT, BUGCHK, or BUGINF, monitor software error, it records a TOPS-20 BUGHLT-BUGCHK entry into the system event file. The most serious of the three errors is a BUGHLT, which crashes the system. At this point, something is seriously wrong, and the monitor does not have enough integrity to attempt any further error recovery. The monitor does, however, collect pertinent information for error recording. When the system is reloaded, the information is extracted from a crash dump and recorded in the system event file. BUGCHK and BUGINF are less serious, perhaps correctable, monitor-detected errors that can affect only particular users instead of the entire system. These errors mayor may not crash the system depending on the error that occurs. The number of errors since reload is included in this entry because only five occurrences of this entry type are allowed in the monitor's error recording buffer at anyone time. In the case of an error occurring in a tight loop, more than five entries could overflow the buffer, and the information for the first occurrence might be lost. These numbers should increment by one for each entry; however, if the sequence is broken, it indicates that more than five entries occurred before the error-logger module of the monitor could empty the buffer. The FORK # and JOB # in the entry are the numbers associated with the current user at the time of the error. A value of -lor 777777 indicates that the monitor was performing an overhead function (such as scheduling) and that there was no current user. Note that the FORK # and JOB # indicate the current user, and not necessarily the user being serviced by the monitor interrupt-level routines. All BUGHLTs now reside in a monitor module, BUGS.MAC. This module includes a description of what might have caused the BUGHLT and also some corrective action that you can take. For complete listing and explanation of BUGINFs, BUGCHKs, and BUGHLTs, refer to the TOPS-20 BUGINF, BUGCHK, BUGHCT Document. 5-31 ENTRY DESCRIPTIONS FULL *********************************************** TOPS-20 BUGHLT-BUGCHK LOGGED ON Mon 16 Jun 80 11:10:19 DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 25. MONITOR UPTIME WAS 3:10:48 *********************************************** ERROR INFORMATION: DATE-TIME OF ERROR: Mon 16 Jun 80 11:10:09 # OF ERRORS SINCE RELOAD: 1. FORK # & JOB #: 72,0 USER'S LOGGED IN DIR: OPERATOR PROGRAM NAME: SYSJOB ERROR: BUGINF ADDRESS OF ERROR: 644111 NAME: DN20ST DESCRIPTION: DTESRV- DN20 STOPPED CONI APR: 7740,3 NO ERROR BITS DETECTED CONI PAG: 0,660132 DATAl PAG: 700100,1246 CONTENTS OF AC'S: 0: 0,0 1: 777775,1 2: 0,1 3: 0,0 4: 0,0 5: 0,0 6: 0,0 7: 0,0 10: 0,0 11: 0,0 12: 0,0 13: 0,0 14: 0,0 15: 0,0 16: 60000,0 17: 777505,335504 PI STATUS: 0,1 77 ADDITIONAL DATA ITEMS: 1 0,1 ERA: 602000,5504 = WD #3 MEMORY READ BASE PHY. MEM ADDR. AT FAILURE: 5504 SHORT SgQ TIME Mon 16 Jun 80 2S. 11:10:19 BUGINF DN20ST AT Mon 16 Jun 80 11:10:09 USER OPERATOR RUNNING SYSJOB CONI APR= 7740,3 CONI PAG= 0,660132 ERA= 602000,5504 5-32 ENTRY DESCRIPTIONS 5.3.3 MASSBUS Device Error Every time the monitor detects an error in the MASSBUS system a MASSBUS Device Error is recorded in the system event file. The MASSBUS system includes the MASSBUS devices RP04, RP05, RP06, TU45, and RM03; the RH20 controller (RHll and USA for 2020); and certain errors occurring in the channel logic. The unit name in this entry refers to the physical MASSBUS unit active at the time of the error. This is a 5-character name in the format: xxabc where xx is the device type DP (disk pack) or MT example, DP220 refers to disk pack 220. (magtape) For a is the logical address of the RH20 controller for device (0-7) - RHll and UBA in a 2020 configuration. this b is the logical MASSBUS address for this device (0-7) For magtape units, this is the TM02 address on the MASSSUS. c is the slave number of a magnetic tape unit. RP05s, and RP06s, this number is always 0. 5-33 For RP04s, ENTRY DESCRIPTIONS The following is a MASSBUS Device Error from an RP07 disk drive: FULL *********************************************** MASSBUS DEVICE ERROR LOGGED ON Mon 31 Aug 81 15:28:29 DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 131. MONITOR UPTIME WAS 0:36:03 *********************************************** UNIT NAME: DP50C UNIT TYPE: RP07 0395. UNIT SERIAL #: VOLUME ID: PS LBN AT START OF XFER: 1636360 CYL: 344. SURF: 23. SECT: 19. OPERATION AT ERROR: DEV.AVAIL., GO + WRITE DATA(60) FINAL ERROR STATUS: 20000,3 RETRIES PERFORMED: 2. ERROR: RECOVERABLE DATA BUS PAR ERR,DRIVE EXCEPTION,LONG WD CNT ERR,CHN ERROR, IN CONTROLLER CONI PAR, IN DEVICE ERROR REGISTER CONTROLLER INFORMATION: CONTROLLER: RH20 # 5 CONI AT ERROR: 0,722615 DATA BUS PAR ERR,DRIVE EXCEPTION,LONG WD CNT ERR,CHN ERROR, CONI AT END: 0,2415 = NO ERROR BITS DETECTED DATAl PTCR AT ERROR: 732203,177461 732203,177461 DATAl PTCR AT END: DATAl PBAR AT ERROR: 720003,13423 DATAl PBAR AT END: 720003,13423 CHANNEL INFORMATION: CHAN STATUS WD 0: 200000,133237 CWI : 0 ,0 CW2 : 0 , 0 CHN STATUS WD 1: 540100,133240 NOT SBUS ERR,NOT WC = O,LONG WC ERR, CHN STATUS WD 2: 603403,510620 DEVICE REGISTER INFORMATION: AT ERROR AT END CR(OO): 4060 4060 DEV.AVAIL., WRITE DATA(60) SR(OI): 50700 10700 ERR,MOL,DPR,DRY,VV, ER(02) : 10 0 PAR, HR (03) : 0 0 1\S (04) : 0 0 DA(05) : 13426 13427 D. TRK 27, D.SECT. = 26 DT(06) : 20042 20042 LA (07) : 2700 3000 DIFF. 0 40000 10 o o 1 o 1700 5-34 ENTRY DESCRIPTIONS SN(10) OF (11) DC (12) CC (13) : E2 (14) : E3 (15) : EP (16) : PL (17) : 1625 1625 0 0 530 530 344. 530 530 344. 0 0 NO ERROR BITS DETECTED 210 0 DVC,DPE, 0 0 0 0 0 0 0 0 0 210 0 0 DEVICE STATISTICS AT TIME OF ERROR: # OF READS: 79686. # OF WRITES: 59808. # OF SEEKS: # SOFT READ ERRORS: O. # SOFT WRITE ERRORS: 2. # HARD READ ERRORS: O. # HARD WRITE ERRORS: O. # SOFT POSITIONING ERRORS: O. # HARD POSITIONING ERRORS: O. # OF MPE: O. # OF NXM: O. # OF OVERRUNS: O. 14 597. SHORT SEQ TIME Mon 31 Aug 81 131. 15:28:29 DP50C PS: RP07 SERIAL #0395. CONI RH= 0,722615 CHN STS= 540100,133240 SR= 0,50700 ER= 0,10 CYL/SURF/SEC= 344./23./19. The following MASSBUS Device Error is from a TU78 magnetic tape drive: FULL *********************************************** MASSBUS DEVICE ERROR LOGGED ON Mon 31 Aug 81 15:42:02 MONITOR UPTIME WAS 0:08:46 DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 161. *********************************************** UNIT NAME: MT000 UNIT TYPE: TU78 UNIT SERIAL #: 0175. VOLUME ID: LOCATION: RECORD # 1. OF FILE # 0. USER'S LOGGED IN DIR NUMBER: 5 USER'S PGM: SYSJOB OPERATION AT ERROR: DEV.AVAIL. GO + READ FWD(70) FINAL ERROR STATUS: 0,0 RETRIES PERFORMED: 0. ERROR: NON-RECOVERABLE DRIVE EXCEPTION,CHN ERROR, IN CONTROLLER CONI 5-35 ENTRY DESCRIPTIONS M8960 u-CODE REVISION LEVELS: o ( 0- 3777) 1 (4000- 7777) 2 (10000-13777) 3 (14000-17777) 4 (20000-23777) 5 (24000-27777) 6 (30000-33777) 7 (34000-37777) 005 005 005 003 002 003 007 003 CONTROLLER INFORMATION: CONTROLLER: RH20 # 0 CONI AT ERROR: 0,222415 DRIVE EXCEPTION,CHN ERROR, 0,222415 = CONI AT END: DRIVE EXCEPTION,CHN ERROR, DATAl PTCR AT ERROR: 732200,177771 DATAl PTCR AT END: 732200,177771 DATAl PBAR AT ERROR: 720000,113000 DATAl PBAR AT END: 720000,113000 CHANNEL INFORMATION: CHAN STATUS WD 0: 200000,272774 CWl: 0,0 CW2: 0,0 ClfIN STATUS WD 1: 540100,272775 NOT saus ERR,NOT WC = 0,LONG WC ERR, ClfIN STATUS WD 2: 420003,170000 DEVICE REGISTER INFORMATION: AT ERROR AT END DIFF. CMD 00: 4070 4070 0 DEV.AVAIL. READ FWD (70) DST 01: 4415 4415 0 Interrupt code: TM UNEXPECTED combination -- interrupt code: 15 failure code: 2 CNT 02: 30004 30004 0 SKIP COUNT RECORD COUNT = 1. DRIVE # 0 0. DGl 03: 0 0 0 A~rN 04: 0 0 0 BCT 05: 113000 113000 0 38400. BYTES DTR 06: 142101 142101 0 s~rA 07: 166200 166200 0 RDY, PRES, ONL, PE, BOT, AVAIL, SgR 10 : 565 565 0 DG2 11: 0 0 0 DG3 12: 0 0 0 NST 13 : 1 1 0 Interrupt code: DONE Extended sense data not updated NCI 14: 406 406 CMD COUNT = 1. Rewind(06) NC2 15: 10 10 CMD COUNT 0. Sense(10) Ne3 16 : 10 10 CMD COUNT 0. Sense(10) NC4 17: 10 10 CMD COUNT 0. Sense(10) MPA 20: 2034 2034 MPD 21: 100000 100000 0 0 0 0 0 0 5-36 ENTRY DESCRIPTIONS EXTENDED SENSE BYTE DATA NOT SUPPLIED FOR THIS ENTRY DEVICE STATISTICS AT TIME OF ERROR: # OF READS: 0. # OF SEEKS: # OF WRITES: 0. # SOFT READ ERRORS: 0. # SOFT WRITE ERRORS: 0. # HARD READ ERRORS: 1. # HARD WRITE ERRORS: 0. # SOFT POSITIONING ERRORS: 0. # HARD POSITIONING ERRORS: 0. # OF MPE: # OF NXM: 0. 0. # OF OVERRUNS: 0. 0. SHORT 161. 15:42:02 MT000 TU78 SERIAL #0175. OPERATOR RUNNING SYSJOB CONI RH= 0,222415 CHN STS= 540100,272775 SR= 0,4415 ER= 0,30004 FILE/RECORD 0./1. 5.3.4 DX20 Device Error When the monitor detects an error in any portion of the MASSBUS system connected to the DX20 tape controller, the DX20 Device Error is recorded in the system event file. This entry contains the octal values of the CONI and DATAl from the controller both when the error was first detected and after the last retry. FULL *********************************************** DX20 DEVICE ERROR LOGGED ON Mon 9 Feb 81 10:33:16 DETECTED ON SYSTEM # 2116. RECORD SEQUENCE NUMBER: 4. MONITOR UPTIME WAS 4 DAYS 14:31:48 *********************************************** UNIT NAME: MT301 UNIT TYPE: TU70 VOLUME ID: 6631 LOCATION: RECORD # 1282. OF FILE # OPERATION AT ERROR: GO + WRITE DATA(60) FINAL ERROR STATUS: 0,3 RETRIES PERFORMED: O. ERROR: RECOVERABLE DRIVE EXCEPTION, IN CONTROLLER CONI MPERR, IN DEVICE ERROR REGISTER CONTROLLER INFORMATION: CONTROLLER: RH20 # 3 DX20 #:0 TX02 DX20 U-CODE VERSION: 1(13) CONI AT ERROR: 0,202615 DRIVE EXCEPTION, CONI AT END: 0,202615 = DRIVE EXCEPTION, DATAl PTCR AT ERROR: 732200,177761 DATAl PTCR AT END: 732200,177761 DATAl PBAR AT ERROR: 720000,172742 DATAl PBAR AT END: 720000,172742 #: 5-37 0 O. ENTRY DESCRIPTIONS CHANNEL INFORMATION: CHAN STATUS WD 0: 200000,260532 CWI : 0,0 CW2 : 0,0 500000,260534 CHN STATUS WD 1: NOT SBUS ERR, 600001,200006 CHN STATUS WD 2: MASSBUS REGISTER INFORMATION: AT ERROR AT END CR (00) : 60 60 WRITE DATA(60) 70000 BR(Ol) : 70000 CERR,LNKPRS,MPRUN, 600 ER(02) : 600 DIFF. o o o MPERR,MPERR CLASS: 1 ,SUB-CLASS: 0 UNUSUAL DEVICE STATUS FROM FINAL STATUS SEQUENCE ~1R(03): 4 4 0 MPSTR, 0 0 J\S (04) : 0 172742 0 SB (05) : 172742 0 50060 DT(06) : 50060 DRIVE TYPE: 60, HDWR VER: 50 7000 0 SI (20) : 7000 0 10001 DN(21) : 10001 0 120 ES (22) : 120 100 0 ~~E (23) : 100 0 !\Y (24) : 0 0 EO (26) : 4304 4304 0 4214 0 HI (27) : 4214 114751 0 IR (30) : 114751 133662 0 PC (31) : 133662 }\L (32) : 15466 0 15466 104030 0 SD (33) : 104030 117360 E'P (34) : 117360 0 122377 0 BW(35) : 122377 1B (36) : 160000 160000 0 0 l'IA(37) : 0 0 DEVICE INFORMATION RECORDED AT TIME OF ERROR REGISTER CONTENTS TEXT SB 0-3: 10 304 10 214 DATA CHK,NOISE,SEL WR STATUS,R/W VRC,ENV CHK/SKEW REG VRC,1600 BPI, TIE 4-7: 0 100 5 0 NO ERROR BITS DETECTED 8-11: 0 10 0 0 NO ERROR BITS DETECTED 12-15 o 16 374 0 16-19 o 2 0 74 20-23 o 0 201 200 MCV: 10 0 320 33 MRA: 343 30 0 60 MRB: 120 0 0 4 MRC: 200 14 1 20 MRD: 120 0 100 0 MRE: o 0 342 365 MRF: 102 0 4 0 CBO: o 3 152 200 205 17 0 16 CBl: DPO: 14 2 6 0 DP1: o 0 0 0 DP2: 30 14 0 0 DP3: o 14 111 70 1 141 LAS: 5-38 00001000 ENTRY DESCRIPTIONS DEVICE STATISTICS AT TIME OF ERROR: # OF READS: 674226290. # OF WRITES: 881585460. # SOFT READ ERRORS: O. # SOFT WRITE ERRORS: 39. # HARD READ ERRORS: O. # HARD WRITE ERRORS: O. # SOFT POSITIONING ERRORS: O. # HARD POSITIONING ERRORS: O. # OF MPE: O. # OF NXM: O. # OF OVERRUNS: O. # OF SEEKS: O. SHORT SEQ TIME MON 9 Feb 81 4. 10:33:16 MT301 6631: TU70 OPERATOR RUNNING TAPE CONI=0,202615 CHN STS 1= 500000,260534.CR=0,60 SR=0,70000 ER=0,600 SENSE BYTES 0-3: 10 304 10 214 FILE/RECORD 0./1282. 5.3.5 Drive Statistics Entries Drive Statistics Entries are written into the system event file to record the activity on the drive. For example, mounts and dismounts, reloads, and drive shutdowns are information that is recorded as a drive statistic. FULL *********************************************** DRIVE STATISTICS ENTRIES LOGGED ON 5-0ct 10:52:28 MONITOR UPTIME WAS 367. DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 361. *********************************************** Volume ID: SPARE Reason recorded: Disk pack mount Channel info(CDB): RH20 # 4 on PI level 5 Device info(UDB): RP20, DP401 PIA: 0 READS TOTAL : WRITES 8. SEEKS 1. *********************************************** DRIVE STATISTICS ENTRIES LOGGED ON 5-0ct 11:20:24 MONITOR UPTIME WAS 5454. DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 374. *********************************************** Volume ID: COM Reason recorded: Magtape unload Channel info(CDB): RH20 # 3 on PI level 5 Device info(UDB): TU70, MTA1, MT301 PIA: 0 READS WRITES TOTAL 353600. 7610560. NRZI PE GCR 353600. 7610560. SHORT 361. 10:52:28 STATS DRIVE: DP401 VOLID: SPARE 374. 11:20:24 STATS DRIVE: MT301 VOLID: CDM 5-39 REASON: Disk pack mount. REASON: Magtape unload. ENTRY DESCRIPTIONS 5.3.6 Configuration Status Change The monitor records a Configuration Status Change when the system operator takes disk units and/or sections of core memory on-line or off-line, thus changing the configuration of the system. The system operator can give a 2-character reason for the change in configuration. The following codes are suggested: PM - preventive maintenance CM - corrective maintenance DN - unit is down aT - other This entry lists what device was affected, what action was taken, and where the action was performed (channel number, controller number, unit number). CAUTION When the system operator adds memory to the system, the monitor checks to verify the availability of the specified addresses. Mistakes are reported to the operator at the operator's terminal, CTYi however, the error-logging system treats these as valid NXMs and records them as NXM entries. You can identify a NXM entry of this type by the fact that no physical memory is off-line and the user's directory is [1,2]. FULL *********************************************** CONFIGURATION STATUS CHANGE LOGGED ON Mon 23 Jun 80 08:50:21 DETECTED ON SYSTEM , 2137. RECORD SEQUENCE NUMBER: 1. MONITOR UPTIME WAS 2 DAYS 8:34:54 *:t********************************************* DETACH TU72 S/N:28410 AS MTA2 AT CHANNEL '0 CONTROLLER #0 UNIT #2 HEASON: SHORT SgQ TIME Mon 23 Jun 80 1 .. 08:50:21 DETACH TU72 S/N:28410 AS MTA2 AT CHANNEL #0 CONTROLLER #0 UNIT '2 REASON: 5-40 ENTRY DESCRIPTIONS 5.3.7 System Log Entry The monitor records a System Log Entry when the system operator enters a log entry into the system event file with the OPR program. A system operator, or anyone with operator privileges, can entry into the system event file by doing the following: 1. make an Run the OPR program @OPR~ OPR) 2. When you see the prompt, specify the REPORT command: OPR)REPORT ~ 3. Use the following syntax: OPR)REPORT user text ~ where user can be directory name and/or device name, and text can be a single-line or multiple-line response. For more information on OPR, refer to the TOPS-20 Language Reference Manual. Operator's Command FULL *********************************************** SYSTEM LOG ENTRY LOGGED ON Tue 1 Jul 80 11:37:37 DETECTED ON SYSTEM # 2116. RECORD SEQUENCE NUMBER: 32. MONITOR UPTIME WAS 0:09:48 *********************************************** ENTRY CREATED BY: JOB #, TTY #: DIRECTORY: WHO: DEV: MESSAGE: 11,17 SCHMITT SCHMIT NUL testing SHORT SEQ TIME Tue 1 Jul 80 32. 11:37:37 SYSTEM LOG ENTRY BY SCHMIT FOR DEVICE NUL ON TTY # 17 MESSAGE: : testing 5.3.8 Front-End Device Report You find a Front-End Device Report in the system event file when the front end passes a packet of error information to the monitor across the DTE-20. This information contains errors detected by the front end and KLCPU hardware and software. Currently, entries are created for the following devices: LP20, CD20, DHll, KLCPU, KLERROR, and KLINIK. 5-41 ENTRY DESCRIPTIONS If the FORK i and JOB # associated with the error are 777777,777777, this indicates that the TOPS-20 monitor knows of this device but it is not currently assigned to any fork or job. If the FORK i and JOB # ar.e 777776,777776, this indicates that the monitor does not know anything about this device. The front end generates a standard-status word for each transfer across the DTE-20. The ERROR LOG REQUEST bit in this word causes the packet to be recorded into the system event file. The information in the entry varies depending on the type of device being reported on. If SPEAR does not know how to list a device, this fact is stated in the entry, listed in octal. FULL *********************************************** FB:ONT END DEVICE REPORT LOGGED ON Mon 16 Jun 80 11:48:30 DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 35. MONITOR UPTIME WAS 3:48:59 *********************************************** DTE20 #: FE SOFTWARE VER: DEVICE: DHll STD. STATUS: 300 DH11 UNIBUS ADDRESS: SYSTEM CONTROL REG: RECEIVED CHAR REG: O. O. NON RECOVERABLE ERROR,ERROR LOG REQUEST, 160060 = DHll #2 30106 = TRANS & NXM INT ENA,STORAGE INT ENA,LINE #6 123000 = VALID DATA PRESENT,FRAMING ERROR,LINE #6,CHAR=0 SHORT SEQ TIME Mon 16 Jun 80 35. 11:48:30 DH11 STD STAT=300 UNIBUS ADDR=160060 SYS CONTROL=30106 REC CHAR=123000 503.9 Front End Reloaded Each time the KLCPU detects that the front end has halted or is in a loop a Front End Reloaded entry is recorded in the system event file. The KL attempts to copy a crash dump file onto disk from the front end's memory and then reboots the front end. The front-end number is the logical address of the front end and indicates whether this front end is privileged. The status at reload describes, in text, any errors that occurred during the reboot process. The file name of the c~re dump is listed if the crash dump WeiS successful. 5-42 ENTRY DESCRIPTIONS FULL *********************************************** FRONT END RELOADED LOGGED ON Tue 1 Ju1 80 00:18:51 DETECTED ON SYSTEM 1 2102. RECORD SEQUENCE NUMBER: 126. MONITOR UPTIME WAS 0:02:24 *********************************************** FRONT END I: 0 STATUS AT RELOAD: RETRIES: 0 REASON FOR RELOAD: FILENAME FOR DUMP: NO ERROR BITS DETECTED B03 <SYSTEM)0DUMP11.BIN.17, 1-Jul-80 00:18:45 SHORT SEQ TIME Tue 1 Ju1 80 126. 00:18:51 FRONT END RELOAD ON PDP11 10 RELOAD STATUS"RETRIES 0,0 PDP11 HALT CODE B03 5.3.10 Processor Parity Trap The monitor records a Processor Parity Trap each time a page-fail trap occurs in the CPU as a result of an AR, ARX, or PAGE TABLE parity error. The information contained in the GOOD DATA WORD is valid only if the error is recoverable; otherwise, the data is 0,0 and the DIFFERENCE DATA is a copy of the BAD DATA WORD. The DIFFERENCE is the result of an XOR between the bad data and the good data words. Note that if the user is unknown, 777777,777777 will be the FORK and JOB numbers. FULL *********************************************** PROCESSOR PARITY TRAP LOGGED ON Tue 8 Ju1 80 11:14:04 DETECTED ON SYSTEM 1 2102. RECORD SEQUENCE NUMBER: 320. MONITOR UPTIME WAS *********************************************** STATUS AT ERROR: BAD DATA DETECTED BY: AR PAGE FAIL WD AT TRAP: 763000,313 BAD DATA WORD: 252525,252525 GOOD DATA WORD: 525252,525252 DIFFERENCE: 777777,777777 PHYSICAL MEM ADDR. AT FAILURE: 563003,277313 RECOVERY: CONT. USER RETRY COUNT: 1. CACHE IN USE FORK 1 & JOB I: USER'S LOGGED IN DIR: PROGRAM NAME: 53,17 EIBEN KLPAR1 SHORT SEQ TIME Tue 8 Ju1 80 320. 11:14:04 PARITY TRAP PAGE FAIL WORD;763000,313 PHYSICAL MEMORY ADDRESS;563003,277313 FAILURE TYPE"RETRIES;40000,1 5-43 8:51:58 ENTRY DESCRIPTIONS 5.3.11 Processor Parity Interrupt When the monitor detects an APR interrupt because of a parity error, it records a Processor Parity Interrupt in the system event file. It records the entry after it has scanned all physical memory looking for more errors. If the original error also generates a page-fail trap, the monitor also creates a Processor Parity Trap entry. The CONI APR and ERA values are the contents of these registers at the time of the first error. The PC AT INTERRUPT value includes the flags in the left half. The BASE PHYsical MEMory ADDRess AT FAILURE is from the right half of the contents of the ERA. 1'he # OF ERRORS on this sweep refers to the number of parity errors during this sweep of physical memory. If the value is zero, the monitor did not detect any errors, and 777777,777777 is the logical AND function for both bad addresses and bad data. The logical OR function, in this case, is 0,0. 'I'he SYSTEM MEMORY CONFIGURATION lists the physical memory configuration and any detected errors at the time of the first error. These are the results of S-BUS DIAGNOSTIC FUNCTIONS for all memory controllers on this cpu. FULL *********************************************** PROCESSOR PARITY INTERRUPT LOGGED ON Tue 8 Jul 80 11:21:35 DETECTED ON SYSTEM # 2102. RECORD SEQUENCE NUMBER: 323. MONITOR UPTIME WAS *********************************************** CONI APR: 7740,413 = MB PAR ERR, 36001,520314 = WD #0 CACHE WRITE ERA: BASE PHY. MEM ADDR. AT FAILURE: 1520314 PC FLAGS AT INTERRUPT: 300000,0 PC AT INTERRUPT: 67320 # ERRORS ON THIS SWEEP 2. LOGICAL AND OF BAD ADDRESSES: 1,520304 LOGICAL OR OF BAD ADDRESSES: 1,520314 LOGICAL AND OF BAD DATA: 252525,252525 LOGICAL OR OF BAD DATA: 252525,252525 SYSTEM MEMORY CONFIGURATION: CONTROLLER: #0 MB20 128 K F0: 6000,0 F1: 36300,36012 INTERLEAVE MODE: 4-WAY REQ ENABLED: 0 2 LOWER ADDRESS BOUNDARY: 0 UPPER ADDRESS BOUNDARY: 777777 ERRORS DETECTED: NONE 5-44 8:59:29 ENTRY DESCRIPTIONS CONTROLLER: #1 MB20 128 K F0: 6000,0 Fl: 36300,36005 INTERLEAVE MODE: 4-WAY REQ ENABLED: 1 3 LOWER ADDRESS BOUNDARY: 0 UPPER ADDRESS BOUNDARY: 777777 ERRORS DETECTED: NONE CONTROLLER: #2 MB20 128 K F0: 6000,0 Fl: 36301,36012 INTERLEAVE MODE: 4-WAY REQ ENABLED: 0 2 LOWER ADDRESS BOUNDARY: 1000000 UPPER ADDRESS BOUNDARY: 1777777 ERRORS DETECTED: NONE CONTROLLER: #3 MB20 128 K F0: 6000,0 Fl: 36301,36005 INTERLEAVE MODE: 4-WAY REQ ENABLED: 1 3 LOWER ADDRESS BOUNDARY: 1000000 UPPER ADDRESS BOUNDARY: 1777777 ERRORS DETECTED: NONE CONTROLLER: #10 MF20 F0: 26123,277313 Fl: 500,1000 LAST WORD REQUEST: RQ3 WRITE LAST ADDRESS HELD: 3277313 CONTROLLER STATUS: SF2 & SFl= 2 ERRORS DETECTED: WRITE PARITY CONTROLLER: #11 MF20 F0: 7747,631734 Fl: 500,1000 LAST WORD REQUEST: RQ0RQIRQ2RQ3- READ LAST ADDRESS HELD: 7631734 CONTROLLER STATUS: SF2 & SFl= 2 ERRORS DETECTED: NONE ERRORS DETECTED DURING SWEEP: ADDRESS BAD DATA GOOD DATA DIFFERENCE 1520304 252525,252525 GOOD DATA NOT FOUND 1520314 252525,252525 GOOD DATA NOT FOUND SHORT SEQ TIME Tue 8 Jul 80 323. 11:21:35 PARITY INTERRUPT-CONI APR;7740,413 ERA;36001,520314 PC AT INTERRUPT;0,67320 # OF ERRORS;2. 5.3.12 KL CPU Status Block This entry is written into ERROR.SYS on TOPS-20, if KLSTAT is turned on at the time of a system crash. (See Section 4.5.1 for this procedure.) At the time of a crash, a snapshot of the condition of all the components of the CPU (such as controllers, channels, RH20s, the pager, and so forth) is taken. When the system recovers, this information is extracted from the CRASH.EXE file and written as an entry in ERROR.SYS. This entry displays the condition of the registers and channels at the time of the crash. 5-45 ENTRY DESCRIPTIONS FULL *********************************************** KL CPU STATUS BLOCK LOGGED ON Man 15 Sep 80 15:03:19 DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 26. MONITOR UPTIME WAS 17:49:02 ************************************~********** APRID = 600236,364131 CONI APR = 7740,3 RDERA = 202000,132276 CONI PI = 0,2377 DATAl PAG = 701000,3201 CONI PAG = 0,660124 CONI RH0 THRU RH7 000000,,002445 000000,,002445 000000,,002445 000000,,002000 000000,,002000 000000,,002000 CONI DTE0 THRU DTE3 000000,,001016 000000,,101016 000000,,002000 EPT LOCATIONS 0 THRU 37 {CHANNEL LOGOUT AREA) 200000,,225566 540100,,225567 620003,,477000 200000,,074442 500000,,074443 600000,,460000 200000,,075064 500000,,075065 600001,,053000 200000,,075522 500000,,075523 600001,,573000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 EPT LOCATIONS 140 THRU 177 {DTE CONTROL BLOCKS) 241000,,223711 241000,,730250 254340,,002135 000000,,000000 000000,,223434 000000,,000030 000000,,000000 041000,,731556 254340,,002147 000000,,000226 000000,,223433 000000,,000030 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 UPT LOCATIONS 424 THRU 427 (UUO AREA) 310100,,057200 000000,,700000 000000,,000000 UPT LOCATIONS 500 THRU 503 (PAGE FAIL AREA) 411000,,742000 000000,,000162 000006,,611327 AC BLOCK 6 LOCATIONS 0 THRU 3 AND 12 000770,,000007 301000,,002520 000000,,127000 011003,,276223 AC BLOCK 7 LOCATIONS 0 THRU 2 000000,,000000 000000,,000000 000000,,000000 SBDIAG FUNCTIONS CTRLR FUNCTION 0 o 006000,,000000 1 006000,,000000 10 007743,,201500 000000,,002445 000000,,002000 000000,,002000 254340,,726001 254340,,726421 254340,,727011 254340,,727501 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,000000 000000,,223516 000000,,000000 000000,,223546 000000,,000000 000000,,000000 000000,,000000 000000,,000000 601000,,003201 000000,,027543 000000,,153764 FUNCTION 1 036300,,036012 036300,,036005 000500,,001000 SHORT SEQ TIME Man 15 Sep 80 26. 15:03:19 KL CPU STATUS BLOCK APRID = 600236,364131 CONI APR 7740,3 RDERA = 202000,132276 CONI PAG = 0,660124 DATAl PAG = 701000,3201 5-46 ENTRY DESCRIPTIONS 5.3.13 MF2e Device Report This entry is written to ERROR.SYS when a MOS memory error occurs. A program called TGHA is called by the monitor every time a MOS memory error occurs. TGHA is responsible for recovering from the error. If TGHA places memory off-line or substitutes a spare bit, these events are recorded as an entry in ERROR.SYS. The TGHA entry is actually an ASCII text report describing the attempt to recover from an error in MOS memory. FULL *********************************************** MF20 DEVICE REPORT LOGGED ON Mon 30 Jun 80 10:02:41 DETECTED ON SYSTEM 2102. RECORD SEQUENCE NUMBER: 21. * MONITOR UPTIME WAS 1 DAY 11:39:06 *********************************************** TEXT FROM TGHA: A NEW MF20 KNOWN ERROR HAS BEEN DECLARED. DATA: STORAGE MODULE SERIAL NUMBER: 8320021 BLOCK: 3, SUBBLOCK: 1, BIT IN FIELD (10): 5, ROW: 174, COLUMN: 52, E NUMBER: 109, ERROR TYPE: CELL SHORT SEQ TIME Mon 30 Jun 80 21. 10:02:41 MF20 REPORT 5.3.14 KLERR Front End Device Report The following entry is written into the system event file when the KL clock stops for any of several errors (FAST MEMORY, PARITY ERRORS, CRAM PARITY ERROR, DRAM PARITY ERROR, or FIELD SERVICE STOP). Any significant error signal will be listed just after the header. 5-47 ENTRY DESCRIPTIONS FULL *********************************************** FRONT END DEVICE REPORT "KLERR" TYPE 205 LOGGED ON 23-Mar-81 09:14:54 MONITOR UPTIME WAS 0 DAYS DETECTED ON SYSTEM # 2102 RECORD SEQUENCE NUMBER: 7 *********************************************** No error bits are active *,t***** LOGGING STARTED 23-MARCH-81 09:12 ,RSX-20F YB14-41A OUTPUT DEVICES: TTY,LOG KLE>EXAMINE KL PCI 5337 Vl~AI 5337 PI ACTIVE: OFF, PION: 177, PI HOLD: 000, PI GEN: 000 OVF CYO CYI FOV BIS USR UIO LIP AFI ATI ATO X X KLE>CLEAR OUTPUT TTY OUTPUT DEVICES: LOG KLE)SET CONSOLE MAINTENANCE CONSOLE MODE: MAINTENANCE KLE)SHOW HARDWARE KLI0 SIN: 2102., MODEL B, 60. HERTZ MOS MASTER OSCILLATOR EXTENDED ADDRESSING INTERNAL CHANNELS CACHE KLE>EXAMINE DTE DLYCNT: 000000 DEXWD3: 160000 DEXWD2: 060323 DEXWDl: 000000 KLI0 DATA=014064 760000 TENADl: 000000 TENAD2: 000024 ADDRESS SPACE=EPT OPERATION=EXAMINE PROTECTION-RELOCATION IS ON KLI0 ADDRESS=24 TOI0BC: 010000 TOIIBC: 130000 TOIOAD: 067540 TOIIAD: 070572 TOI0DT: 000036 TOIIDT: 142400 DIAGI : 001100 KL IN HALT LOOP MAJOR STATE IS TO-II TRANSFER DIAG2 : 040000 STATUS: 012504 RAM IS ZEROS DEX WORD 1 11 REQUESTED 10 INTERRUPT E BUFFER SELECT DEPOSIT-EXAMINE DONE DIAG3 : 026000 5-48 FUF NDV 19:00:43 ENTRY DESCRIPTIONS KLE>FREAD 100:177 FR 100/ 000177 602664 FR 101/ 000000 002600 FR 102/ 000013 410202 FR 103/ 000020 212024 FR 104/ 000000 032434 FR 105/ 000000 003421 FR 106/ 000000 642000 FR 107/ 000000 715642 FR 110/ 000003 015225 FR 111/ 000104 000000 FR 112/ 007740 037411 FR 113/ 000000 044524 FR 114/ 000101 000012 FR 115/ 001107 060144 FR 116/ 001400 012003 FR 117/ 001100 002000 FR 120/ 000000 000000 FR 121/ 000000 000000 FR 122/ 001100 002000 FR 123/ 000000 270173 FR 124/ 002000 020000 FR 125/ 000000 000000 FR 126/ 000000 000001 FR 127/ 000000 000001 FR 130/ 000072 000000 FR 131/ 070054 060000 FR 132/ 014064 760000 FR 133/ 000020 414000 FR 134/ 130066 404003 FR 135/ 120024 224003 FR 136/ 104052 604003 FR 137/ 002004 244003 FR 140/ 760505 050707 FR 141/ 100201 000001 FR 142/ 110000 001010 FR 143/ 600202 061407 FR 144/ 540001 050707 FR 145/ 510000 000001 FR 146/ 650000 001010 FR 147/ 111212 071407 FR 150/ 000000 000104 FR 151/ 000000 002004 FR 152/ 000000 000104 FR 153/ 000024 002104 FR 154/ 000000 000125 FR 155/ 000000 002405 FR 156/ 000000 000125 FR 157/ 000024 002525 FR 160/ 001003 017027 FR 161/ 001006 276703 FR 162/ 001006 206017 FR 163/ 001000 000523 FR 164/ 001003 017323 FR 165/ 001006 276767 FR 166/ 001006 206017 FR 167/ 001000 000103 FR 170/ 360040 126722 FR 171/ 000000 735722 FR 172/ 011600 137230 FR 173/ 200102 377322 FR 174/ 176010 177664 FR 175/ 163000 127375 FR 176/ 000200 337375 FR 177/ 760000 533305 KLE>WHAT AC AC-BLOCK: 0 KLE>SWEEP KLE>XCT CONI 0,15! KLE>EXAMINE TEN 15 CONI APR,15 5-49 ENTRY DESCRIPTIONS 15/ 007740 000003 KLE>XCT BLKI 4,15! RDERA KLE>EXAMINE TEN 15 15/ 602000 005337 KLE>XCT CONI 4,15! CONI PI,15 KLE>EXAMINE TEN 15 15/ 000000 000177 KLE>XCT DATAl 10,15! DATAl PAG,15 KLE>EXAMINE TEN 15 15/ 700100 001270 KLE>XCT CONI 10,15! CONI PAG,15 KLE>EXAMINE TEN 15 15/ 000000 060137 KLE>SET OUTPUT TTY OUTPUT DEVICES: TTY,LOG KLE>CLEAR OUTPUT LOG ******* LOGGING FINISHED 23-MARCH-81 09:13 SHORT SEQ. TIME 23-Mar-81 7. 09:14:54 KLERR FRONT END DEVICE TYPE 205 No error bits are active 5.3.14.1 The HSC50 Error Log - When a CPU initiates a request for data transfer from the HSC50, the HSC50 Error Log entry is written into that particular CPU's system event file. The following are examples of the full and short versions of the HSC50 Error Log event. 5-50 ENTRY DESCRIPTIONS FULL *********************************************** HSC50 ERROR LOG LOGGED ON 14-Jul-85 16:50:06-EDT MONITOR UPTIME WAS 0 DAY(S) DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 7503. 14:20:42 *********************************************** COMMON DATA COMMAND REF #: 0'HHHH'l00 CI20 PORT #: 7. NODE #: 15. SEQUENCE #: I. FORMAT: 03 FLAGS: 41 EVENT: 002B CNTLR DEVICE #: 00000230F00F CNTLR CLASS: 01 CNTLR MODEL: 01 CNTLR SOFTWARE VER: 02 CNTLR HARDWARE VER: 00 HOST COMMAND #: 0 SDr Error Operation Continuing, Sequence Number Reset Drive Error, SDI command timed out Mass Storage Controller HSC50 UNIT IDENTIFICATION DATA UNIT NUMBER: 11. MULTI-UNIT CODE: 0020 UNIT DEVICE #: 000000000FA5 UNIT CLASS: 02 UNIT MODEL: 05 UNIT SOFTWARE VER: 06 UNIT HARDWARE VER: 01 VOLUME SIN: 00000FA5 DEC Std 166 Disk RA81 SDI DATA HEADER: 00000000 Logical Block BLOCK AT ERROR WAS 0. CONTROLLER DATA REQUEST BYTE: MODE BYTE: ERROR BYTE: CONTROLLER BYTE: RETRY COUNT / FAILURE CODE: 13 00 00 00 00 Drive-online or available, Port switch in, Run/Stop switch in, Formatting disabled, Diag1 Cy1 access disabled, 512 byte RA80/81 DEVICE DATA LAST POSITION COMMAND: SDI ERROR STATUS: LAST SEEK CYLINDER: HEAD NUMBER: MICROPROCESSOR LEDS: FRONT PANEL FAULT CODE: 87 00 0000 0. 00 00 EXTRANEOUS DATA IN 8 BIT OCTAL BYTES (UNUSED RIGHT 4 BITS IN 36-BIT WORD) BYTES BYTES BYTES BYTES BYTES 63.-60. 67.-64. 71.-68. 75.-72. 79.-76. 002 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 000 (00) (00) (00) (00) (00) SHORT SEQ TIME 14-Jul-85 7503. 16:50:06 HSC50 Error Message Node 15. Drive Error, SDI command timed out on RABI #11. SIN FA5 SDI Error - 5-51 ENTRY DESCRIPTIONS 5.4 DECNET ENTRIES (V2.1) The following sections list both the FULL and SHORT versions of network entries (Version 2.1) TOPS-10 or TOPS-20 can record in the system event file. 5.4.1 Network Control Started Whenever NETCON is loaded and started, the monitor records a Network Control Started entry into the system event file. This entry includes the version number and the node on which NETCON is running. FULL *********************************************** N1!:TWORK CONTROL STARTED LOGGED ON Mon 23 Jun 80 11:37:08 DETECTED ON SYSTEM i 2137. RECORD SEQUENCE NUMBER: 15. MONITOR UPTIME WAS 2 DAYS 11:21:41 *********************************************** PROGRAM NAME: PROGRAM VERSION: NODE NAME: NETCON 4(22) KL2137 SHORT SgQ TIME Mon 23 Jun 80 IS. 11:37:08 NCU STARTED PROGRAM: NETCON VER:4(22) STARTED ON NODE KL2137 504.2 Network Up-Line Dump Whenever NETCON dumps a node, the monitor records the name of the node involved, the line used, the dump-file specification, and any return code as a Network Up-Line Dump entry in the system event file. FULL *********************************************** NETWORK UP-LINE DUMP LOGGED ON Mon 23 Jun 80 11:07:53 DETECTED ON SYSTEM 2137. RECORD SEQUENCE NUMBER: 11. * MONITOR UPTIME WAS 2 DAYS 10:52:26 *******************************~*~************* TARGET NODE NAME: SERVER NODE NAME: SERVER LINE DESIG.: FILE NAME DUMPED: DN20L KL2137 DTE20 1 0 PS:<SROBINSON)DN20L-R4-26.DMP SHORT SEQ TIME Mon 23 Jun 80 11. 11:07:53 UP-LINE DUMP OF NODE DN20L BY NODE KL2137 LINE DESIGNATION DTE20 1 0 FILE DUM?ED TO PS:<SROBINSON)DN20L-R4-26.DMP 5-52 ENTRY DESCRIPTIONS 5.4.3 Network Down-Line Load Whenever NETCON loads a node, the monitor records the name of the node involved, the line used, the load-file specification, and any return code as a Network Down-Line Load entry in the system event file. FULL *********************************************** NETWORK DOWN-LINE LOAD LOGGED ON Mon 23 Jun 80 11:10:33 DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 13. MONITOR UPTIME WAS 2 DAYS 10:55:06 *********************************************** TARGET NODE NAME: SERVER NODE NAME: SERVER LINE DESIG.: FILE NAME LOADED: DN20L KL2137 DTE20 1 0 PS:<NEXT-RELEASE)DN20L-R4-26.SYS.1 SHORT SEQ TIME Mon 23 Jun 80 13. 11:10:33 DOWN-LINE LOAD OF NODE DN20L BY NODE KL2137 LINE DESIGNATION DTE20 1 0 FILE LOADED PS:<NEXT-RELEASE)DN20L-R4-26.SYS.1 5.4.4 Network Hardware Error Whenever NETCON detects an error in any hardware device connected to a node, the monitor records this information as a Network Hardware Error in the system event file. 5-53 ENTRY DESCRIPTIONS FULL *********************************************** NETWORK HARDWARE ERROR LOGGED ON Man 23 Jun 80 08:52:48 DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 3. MONITOR UPTIME WAS 2 DAYS 8:37:21 *********************************************** MSG SENT FROM: DN20L MSG REC'D AT: KL2137 HDWR TYPE: KMC-DUP11 SOFTWARE TYPE: ILLEGAL PARENT SYSTEM TYPE: UNKN HARDWARE ERROR MSG SEQUENCE # FROM XMIT NODE: 14. LINE ID: KDP 0 1 0 REASON FOR ENTRY: DDCMP START REC'D DURING NORMAL OPERATION RECOVERY STATE: NOT SUPPLIED, ERROR: NO ERROR BITS DETECTED IN RxDBUF, NO ERROR BITS DETECTED IN TxCSR HARDWARE REGISTER INFORMATION: MICROCODE: NOT SUPPLIED CONTROLLER REGISTERS: SEL 0: 100220 SEL 2: 0 SEL 4: 177777 SEL 6: 177777 DEVICE REGISTERS: RXCSR: 0 RXDBUF: 0 TXCSR: 0 TXDBUF: 0 NO ERROR BITS DETECTED NO ERROR BITS DETECTED SHORT SEQ TIME Man 23 Jun 80 3. 08:52:48 NETWORK HARDWARE ERROR FROM DN20L FOR LINE KDP 0 1 0 ERROR IS DDCMP START REC'D DURING NORMAL OPERATION 5.4.5 Network CHECKII Report Whenever the DN20 or DN200 is loaded, CHECKII (a hardware test module) is started. All messages from CHECKII, at that time, become one entry in the system event file. Note that the log data in this entry is an ASCIZ arbitrary length. 5-54 CHECKII message of ENTRY DESCRIPTIONS FULL *********************************************** NETWORK CHECK11 REPORT LOGGED ON Mon 23 Jun 80 11:09:56 DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 12. MONITOR UPTIME WAS 2 DAYS 10:54:28 *********************************************** MSG SENT FROM: KL2137 MSG RECID AT: KL2137 HDWR TYPE: UNKN SOFTWARE TYPE: UNKN PARENT SYSTEM TYPE: UNKN MSG SEQUENCE # FROM XMIT NODE: 2. TEXT FROM CHK11 REPORT: CHK11 HARDWARE TEST version 2A(21) of 10-AUG-79 by LDW Testing begins ••• THE PROCESSOR SEEMS TO BE A KDI1-E CHK11 EXPECTED AN 11/34 (11/34) KT11 memory management test PHYSICAL MEMORY HAS ABSOLUTE LIMITS OF o - 757777 FOR A TOTAL OF 124KW (DECIMAL) MAPPED PHYSICAL MEMORY TEST ••• ••• COMPLETE KW11-L checked device scan report assumes DN20 DN21 DN25 fixed assignments (no floating) 1 Fixed DTE20 at 174440, vector at 774 1 Fixed KMCII at 160540, vector at 540 2 Fixed DUP11s from 160300, vector at 570 2 Fixed DMC11s from 160740, vector at 670 caRll complete SHORT SEQ TIME Mon 23 Jun 80 12. 11:09:56 NETWORK CHECKll REPORT 5.4.6 Network Line Statistics Periodically, NETCON records the status of each communications and this information becomes an entry in the system event file. 5-55 line, ENTRY DESCRIPTIONS FULL *********************************************** NETWORK LINE STATISTICS LOGGED ON Mon 16 Jun 80 08:34:19 DETECTED ON SYSTEM # 2137. RECORD SEQUENCE NUMBER: 1. MONITOR UPTIME WAS 0:34:48 *********************************************** MSG SENT FROM: DN20L MSG REC'D AT: KL2137 HDWR TYPE: DTE-20 SOFTWARE TYPE: PARENT SYSTEM TYPE: UNKN LINE ID: DTE 1 0 0 1802. 808. 814. 0. UNKN REASON FOR ENTRY: PERIODIC ENTRY SECONDS SINCE LAST ZEROED BLOCKS RECEIVED BLOCKS SENT NON - LINE ERROR RETRANSMISSIONS SHORT SEQ Mon 16 Jun 80 TIME 1. 08:34:19 NETWORK LINE COUNTERS FROM NODE DN20L FOR LINE DTE -1-0-0 LINE ERROR RETRANS RECV LINE ERRORS 5.5 DECNET ENTRIES (V3.9 AND V4.9) The DECnet V3.0 and V4.0 module Event Logger records any significant network events into the system event file. The headers for DECnet V3.0 and V4.0 entries have the title: DECNET ENTRY The body of each entry contains numbers that correspond to specific event classes and event types. Tables 5-1 and 5-2 list the meaning of the numbers in the entry. Refer to Section 4.4.3 for information on , how to RETRIEVE network entries by event class. Table 5-1: Network Event Classes ~.-----------------.------------------------------------------------------~ Event Class Description r-.---------------.--~-----------------------------------------------------~ o 1 2 3 4 5 6 7-31 32-63 64-95 96-127 128-159 160-191 192-479 480-511 Network Management Layer Applications Layer Session Control Layer Network Services Layer Transport Layer Data Link Layer Physical Link Layer Reserved for other common event classes Reserved for RSTS specific event classes Reserved for RSX specific event classes Reserved for TOPS-20 specific event classes Reserved for VMS specific event classes Reserved for RT specific event classes Reserved for future use Reserved for Customer specific event classes 5-56 ENTRY DESCRIPTIONS Table 5-2: Network Events Class Type o o 1 Entity Event Text none node 1ine,circuit 1ine,circuit 1ine,circuit node 1ine,circuit line,circuit Event records lost Automatic node counters Automatic data link counters Automatic data link service Data link counters zeroed Node counters zeroed Passive loopback Aborted service request none none Local node state change Access control reject none none node Invalid message Invalid flow control Data base reused 4 4 8 9 10 11 none circuit circuit circuit circuit circuit circuit circuit circuit circuit circuit circuit 4 12 circuit 4 13 circuit 4 14 node Aged packet loss Node unreachable packet loss Node out-of-range packet loss Oversized packet loss Packet format error Partial routing update loss Verification reject Circuit down, circuit fault Circuit down, software fault Circuit down, operator fault Circuit up Initialization failure, circuit fault Initialization failure, software fault Initialization failure, operator fault Node reachability change 5 5 5 o 1ine,circuit 1ine,circuit 1ine,circuit o o o o o o o 2 3 4 5 6 7 2 o 2 1 3 3 3 1 2 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 6 6 6 6 6 6 o o 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 o 1 2 3 4 5 line,circuit 1ine,circuit 1ine,circuit 1ine,circuit 1ine,circuit line,circuit line,circuit line line line line line line Locally initiated state change Remotely initiated state change Protocol restart received in maintenance mode Send error threshold Receive error threshold Select error threshold Block header format error Selection address error Streaming tributary Local buffer too small Data set ready transition Ring indicator transition Unexpected carrier transition Memory access error Communications interface error Performance error 5-57 ENTRY DESCRIPTIONS The following are examples of three DECnet entries in FULL format: *********************************************** DECNET ENTRY LOGGED ON 7-Dec 03:01:49 MONITOR UPTIME WAS 0 DAY{S) 9:9:33 DETECTED ON SYSTEM # 2102. RECORD SEQUENCE NUMBER: 19. *********************************************** Event type 4.10 Line up From node 118. (MCB) , occurred 7-DEC-1981 0:00:00.400 CIRCUIT = DMC-0 NODE = 121 *********************************************** DECNET ENTRY LOGGED ON 7-Dec 03:01:50 MONITOR UPTIME WAS 0 DAY(S) 9:9:35 DETECTED ON SYSTEM # 2102. RECORD SEQUENCE NUMBER: 20. *********************************************** Event type 4.14 Node reachability change From node 118. (MCa) , occurred 7-DEC-1981 0:00:00.466 REMOTE NODE = 103 () STATUS = REACHABLE *********************************************** DECNET ENTRY LOGGED ON 7-Dec 03:02:02 MONITOR UPTIME WAS 0 DAY{S) 9:9:47 DETECTED ON SYSTEM # 2102. RECORD SEQUENCE NUMBER: 21. *********************************************** Event type 5.3 Send error threshold From node 118. (MCB) , occurred 7-DEC-1981 0:00:18.000 CIRCUIT = KDP-0-0 The following are examples of the same three DECnet entries above these are listed in SHORT format: 19. 03:01:49 DECNET Event type 4.10 Line up From node 118. (MCa) occurred 7-0EC-1981 0:00:00.400 20. 03:01:50 DECNET Event type 4.14 Node reachability change From node 118. (MCB) occurred 7-DEC-1981 0:00:00.466 21. 03:02:02 DECNET Event type 5.3 Send error threshold From node 118. (MCB) occurred 7-DEC-1981 0:00:18.000 5-58 but ENTRY DESCRIPTIONS The following DECnet entry lists packet header information: *********************************************** DECNET ENTRY LOGGED ON 27-Feb-84 07:23:29-EST DETECTED ON SYSTEM i 2871. RECORD SEQUENCE NUMBER: 120. MONITOR UPTIME WAS 1 DAY(S) 0:2:17 *********************************************** Event type 4.1 From node 143. Node unreachable packet loss (GIDDN), uptime was 1 day(s) 16:56:39 Packet Header = 2 / 142 / 143 / 6 From left to right, the four fields listed with the packet header have the following meanings: byte long Field one (2) - is a hexadecimal value one representing the message flags. Field two (142) - is a decimal (unsigned) value two bytes long representing the destination node address. Field three (143) - is a decimal (unsigned) value two bytes representing the source node address. long Field four is a hexadecimal value one byte representing the forwarding data. long (6) - Note if the packet is a control packet, the packet header will contain only two fields, the message flags (Field one) and the source node address (Field three). For more information on network event parameters, see Appendix F. For more information concerning DECnet Versions 3.0 and 4.0 entries, refer to the DECnet documentation for system managers and operators. 5-59 APPENDIX A SPEAR MESSAGES There are four general categories of SPEAR messages; User Validation Messages, Dialogue Usage Messages, Warning Messages, and Event File Messages. The following tables list these messages and suggested actions. Table A-I: User Validation Messages The following messages can occur because of an error on the user's part. Each message is preceded by the header: ?USER Validation failed CODE or SEQUENCE not allowed in list of responses You have selected CODE or SEQUENCE as a attempted to add another selection type. response and have Typed a response that did not match one of the list of responses. valid Does not match any valid response End time must be later than begin time Typed an ending date/time that is prior to or the same as the beginning date/time in RETRIEVE or COMPUTE. Invalid date format Typed date incorrectly. -dd. The correct format is dd-mmm-yy or Invalid time format Typed time incorrectly. The correct format is hh:mm:ss. Matches more than one valid response Typed a response that was not unique. Need to type more characters before pressing the RETURN key or ESCAPE key. A-I SPEAR MESSAGES Table A-I: User Validation Messages (Cont.) May not select all at this prompt You tried to select ALL when you must respond names or numbers. with specific No recognition for this prompt Typed ESCAPE key where blanks. it is impossible to fill in the Not a valid name or number If a name, typed a special character or more than the maximum number of characters. If a number, typed a special character or alphabetic character or more than the maximum number of digits. That function is not available You typed a function name in the SPEAR library that does not exist in the same directory as SPEAR. This could happen if you do not have ANALYZE or if some of the programs are kept on tape. Table A-2: Dialogue Usage Messages The following messages can occur when you are responding to the dialogue incorrectly. They are meant to give you some insight as to what the correct response is to the current prompt. Not one of the recognized types At RETRIEVE level, when specifying a device, you typed a ? after typing a few characters. SPEAR did not recognize the device as one of its physical devices. Please select function first Typed a switch that requires some function to have been selected first (for example, /GO or /SHOW) at the SPEAR> prompt. Unable to complete this response You typed an ESCAPE to a prompt that SPEAR does not know how to complete. This is true whenever the response is not one of a fixed list of possible responses, for example, time of day or file specification. No default response for this prompt Typed the ESCAPE key or another delimiter default (at SPEAR> prompt, for example). A-2 where there is no SPEAR MESSAGES Table A-3: Warning Messages The following is a list of warning messages you may receive during a SPEAR operation. Each message is introduced with the following sentence: -- The following should be noted before proceeding Impossible to input event records from the terminal! You specified TTY: specification. in response to a request for a file The input file will be superseded! In RETRIEVE, you named the output file the same name as the input file. This means you will overwrite your input file if you proceed. Will overwrite input file with ASCII output! In RETRIEVE, you specified the same name for both input and output files and also specified ASCII as the output format. If you proceed, the input file (which is binary) will be overwritten with ASCII output. Binary output to terminal is unreadable! In RETRIEVE, you requested the BINARY report format and specified TTY: in response to Output to: then Merging with self causes duplicate records! In RETRIEVE, you specified the same name for both the input file and the merge file. If you proceed, you will end up with a file containing duplicate records. Will create an exact copy of the input file! In RETRIEVE, you selected all the events in the system event file and then requested them in BINARY format. This is a waste of effort because all you will have succeeded in doing is duplicating the system event file. Will create an empty output file! In RETRIEVE, you selection process. have excluded everything during the This function can cause SEVERE system degradation! You have turned on the KLSTAT switch which slows down system operation to gather extra data into the system event file. A-3 SPEAR MESSAGES Table A-4: Event File Messages The following messages can occur as the result of an error in the system event file. The message indicates a recoverable error. Each message is preceded with the following header: routine %SPEAR Event file error detected in module Bad header found - RESYNCHing Lost synchronization in file, resynchronizing block. Some data has been lost. in next file Some data has been EOF encountered while skipping an entry Error file is truncated for some reason. lost. Internal EOF found - RESYNCHing Internal end-of-file mark detected but still has data. (This can happen if files are appended to each other.) No data is lost. Premature EOF detected in error file! Encountered an EOF in the middle of a header or entry. is truncated. Some data is lost. File You can also receive fatal error messages in the form: ?SPEAR Program error in module routine where the blanks are filled in with the module and routine names. These are SPEAR program errors over which you have no control. If you receive such an error, fill out a Software Performance Report describing the error and the situation leading up to the error. ~nother error over which you have no control is an error from an internal program called XPORT. XPORT does not identify itself in the message. However, the message is preceded by a question mark, indicating, in this case, that this is a fatal error. If you receive an XPORT error message, you should also fill out a Software Performance Report. A-4 SPEAR MESSAGES Other possible messages you can receive originate from system. For example: ?SPEAR Monitor call failed TOPS-20 ?SCNxxx message TOPS-10 the operating On TOPS-20, you should refer to the Monitor Calls Manual for a list of these messages. On TOPS-10, you should refer to the SCAN documentation for a list of SCAN messages. A-5 APPENDIX B COMMAND AND CONTROL FILES Because of dialogue changes in RETRIEVE and SUMMARIZE, if you have existing SPEAR Vl.0 command or control files, you must change them accordingly or they will not run. For RETRIEVE, the changes from Vl.0 to V2.0 are in the Selection type, Error and Nonerror fields. There are no changes necessary if your command or control file specified a Selection type of Error, All. See Section 4.4.3 for the RETRIEVE dialogue changes. You can maintain the same functionality for an error selection changing the Vl.0 dialogue to the following V2.0 dialogue: SPEAR Vl.0 @SPEAR *RETRI.E.: VE *SERR: ERROR. SYS *INCLUDED *ERROR *DISK *RP06 *FINISHED *EARLIEST *LATEST *DSK:RETRIE.RPT */GO by SPEAR V2.0 @SPEAR *RETRIEVE *SERR:ERROR.SYS *INCLUDED *ERROR *DISK *RP06 *ALL (Here's the difference.) *FINISHED EARLIEST *LATEST DSK:RETRIE.RPT */GO To RETRIEVE the events for a specific device error type, replace the ALL in the previous V2.0 control file with one or more device error types, for example, Software, Bus, Channel-controller. For Nonerror selection, you can now select specific devices. Instead of Nonerror, specify Statistics, Configuration, Diagnostics, Other, or a combination of these separated by commas. SPEAR Vl.0 SPEAR V2.0 @SPEAR *RETRIEVE *SERR:ERROR.SYS *INCLUDED *NONERROR *EARLIEST *LATEST *DSK: RETRIE. RPT */GO @SPEAR RETRIEVE *SERR:ERROR.SYS *INCLUDED *STATISTICS,DIAGNOSTICS (Change) *DISK (Chang e) *RA60,RA80,RA8l (Change) *FINISHED (Change) *EARLIEST *LATE.:ST *DSK:RETRIE.RPT */GO B-1 COMMAND AND CONTROL FILES For SUMMARIZE, two new prompts have been added to the dialogue, Category and Show Error Distribution. You can maintain the same functionality by changing the VI.0 dialogue to the following V2.0 dialog ue: SPEAR VI.0 SPEAR V2.0 @SPEAR *SUMMARIZE *SERR:ERROR.SYS *EARLIEST *LATEST *DSK: SUMMAR. RPT */GO @SPEAR *SUMMARIZE *SERR: ERROR. SYS *ALL (Change) *EARLIEST *LA'rEST *YES (Change) *DSK: SUMMAR. RPT */GO To get summaries for a specific device or class of devices, replace ALL in the previous V2.0 dialogue with device sel~ction. For example: SPEAR V2.0 @SPEAR *SUMMARIZE *SERR: ERROR. SYS *DISK *RA60,RA80 *FINISHED *EARLIEST *LATEST *YES *DSK: SUMMAR. RPT */GO To suppress the error distribution charta, change the YES to NO in the dialogue. Because there are no changes in the dialogue for COMPUTE or KLSTAT, you need not change your previous control or command files for these functions. B-2 APPENDIX C EVENT CODES The following table contains the current list of TOPS-10 and TOPS-20 event codes along with their internal class. The dashes (---) indicate that the event code does not exist under the specified operating system. Table C-l: -10 Code 001 002 005 006 007 010 011 012 014 --- 015 016 017 021 030 031 033 040 042 045 050 052 054 055 056 057 061 062 063 064 --- 066 067 071 072 201 TOPS-10 and TOPS-20 Event Codes Name SYSTEMRELOAD MONITORBUGDATA EXTRACTEDCRASHINFO CHANNELERRORREPORT DAEMONSTARTED OLD DISK ERROR MASSBUSERR DX20ERR SOFTWAREEVENT STATISTICS CONFIGCHANGE SYSERRORLOG SOFTWAREREQDATA TAPE ERR FEDEVICE-ERR FERELOAD KSHALTSTATUS OLDDISKSTATS TAPESTATS DISKSTATS DLHARDWAREERROR KLI>ARNXMINT KSNXMTRAP KLORKSPARTRAP NXMMEMORYSWEEP PARMEMORYSWEEP CPUPARTRAP CPUPARINT KLCPUSTA'rus DEVICESTATUS MF20ERR OLDKLADDRESSFAIL KLADDRESSFAIL LP100ERR HARDCOPYERR NETCONSTARTED -20 Code In terna1 Class Subsystem 101 102 ERROR ERROR ERROR ERROR CONFIG ERROR ERROR ERROR ERROR STATISTICS CONFIG ERROR ERROR ERROR ERROR/CONFIG CONFIG ERROR STATISTICS STATISTICS STA'rrSTICS ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR ERROR CONFIG MONITOR MONITOR MONITOR MAINFRAME SOFTWARE DISK DISK/TAPE DISK/TAPE SOFTWARE DISK/TAPE (ALL) SOFTWARE SOFTWARE T.APE MAIN/UNIT/COMM MAINFRAME MAINFRAME DISK TAPE DISK COMM MAINFRAME MAINFRAME MAINFRAME MAINFRAME MAINFRAME MAINFRAME MAINFRAME CRASH CRASH MAINFRAME MAINFRAME MAINFRAME UNITRECORD UNITRECORD NETWORK --------111 ----114 115 116 ----130 131 133 ------------------160 162 163 --164 ------- --- 201 C-1 EVENT CODES Table C-l: TOPS-l~ and TOPS-29 Event Codes (Cant. ) -1~ Code Name 202 203 210 211 220 221 222 230 231 232 233 234 240 242 243 244 245 250 NODEDOWNLINELOAD NODEDOWNLINEDUMP NETHARDWAREERR NETSOFTWAREERR NETOPRLOGENTRY NNETTOPOLOGYCHANGE NETCHECKllREPORT NETLINESTATS NETNODESTATS OLDDN64STATS DN6XSTATS DN6XENABLEDISABLE DECnet Entry HSC 50 END PACKET HSC50 ERROR LOG KLIPA EVENT MSCP ERROR DIAGNOSTIC EVENT -20 Code Internal Class Subsystem 202 203 CONFIG CONFIG ERROR ERROR ERROR CONFIG CONFIG STA'rISTICS STATISTICS STATISTICS STATISTICS CONFIG ERROR ERROR ERROR ERROR ERROR DIAGNOSTIC NETWORK NETWORK NETWORK NETWORK NETWORK NETWORK NETWORK NETWORK NETWORK NETWORK NETWORK NETWORK NETWORK DISK/TAPE DISK/TAPE CI CI (ALL) 21~ 211 22~ 221 222 230 231 232 233 234 240 242 243 244 245 250 C-2 APPENDIX D DISK SUBSYSTEM ERROR BITS The following charts list the categories into which fall in the SUMMARIZE report for Disk Subsystems. the error bits For example, if the SUMMARIZE report states that your RP06 has six SK-SR (SEEK-SEARCH) errors, you may want to know what specific RP06 error bits are considered to be in this category. If you go to the SK-SR chart and look under device for RP04,5,6 (which means either RP04, RP05, or RP06), you will see that this chart shows that anyone of the three error bits listed is considered as a SEEK-SEARCH error. The headings have the following meanings: ERROR NAME The name Guide. DEVICE The device type. REG The register containing the error bit. BIT The position of the error bit. COMMENTS Any qualifiers if applicable listed in the KL10 The following is a list of the charts that will follow: TIMIN SK-SR READ CH-CO BUS SOFT MICRO UNSAF WRTLK OFFLI = TIMING SEEK-SEARCH READ-WRITE CHANNEL-CONTROLLER BUS HARDWARE DETECTED SOFTWARE ERROR MICROPROCESSOR DETECTED ERROR UNSAFE WRITE LOCK OFFLINE 0-1 Maintenance DISK SUBSYSTEM ERROR BITS *-*-*-*-*-*-*-*-*-*-* * * * * TIMIN * * *-*-*-*-*-*-*-*-*-*-* ImROR NAME DEVICE REG BIT OP INC DRIVE TIMING ERR INDEX ERROR RP04,5,6 RP04,5,6 RP04,5,6 ERR 1 ERR 1 ERR 2 13 12 11 INDEX UNSAFE DRIVE TIMING ERR OP INC RP07 RP07 RP07 ERR 3 ERR 1 ERR 1 06 12 13 OP INC RM03,5 ERR 1 13 OP INC DRIVE TIMING ERR RK07 RK07 RKER RKER 13 12 E0 E3 RL02 RL02 RLCS RLCS Comments See note after last chart See note after last chart *-*-*-*-*-*-*-*-*-*-* * * * * SK-SR * * *-*-*-*-*-*-*-*-*-*-* ERROR NAME DEVICE REG BIT SEEK INC OFF CYL HEADER COMP ERR RP04,5,6 RP04,5,6 RP04,5,6 ERR 3 ERR 3 ERR 1 14 15 07 SEEK INC LOSS CYL ERROR HEADER COMP ERR RP07 RP07 RP07 ERR 3 ERR 3 ERR 1 14 09 07 HEADER COMP ERR SEEK INC RM03,5 RM03,5 ERR 1 ERR 2 07 14 SEEK INCOMPLETE RK07 DRIVE OFF TRACK RK07 HEADER VERTICALRC RK07 RKER RKDS RKER 01 05 08 SEEK TIME OUT RLMP RLCS 12 HI RL02 RL02 Comments See note after last chart D-2 DISK SUBSYSTEM ERROR BITS *-*-*-*-*-*-*-*-*-*-* * * READ * * * * *-*-*-*-*-*-*-*-*-*-* ERROR NAME DEVICE REG BIT DATA CHECK HEADER CRC ERR FORMAT ERR RPQJ4,S,6 RPQJ4,S,6 RPQJ4,S,6 ERR 1 ERR 1 ERR 1 IS QJ8 QJ4 BAD SECTOR ERR DATA CHECK HEADER CRC ERR FORMAT ERR SYNC BYTE ERROR RPQJ7 RPQJ7 RPQJ7 RPQJ7 RPQJ7 ERR 3 ERR 1 ERR 1 ERR 1 ERR 3 IS IS QJ8 QJ4 QJ2 BAD SECTOR ERR DATA CHECK HEADER CRC ERR FORMAT ERR RMQJ3,S RMQJ3,S RMQJ3,5 RMQJ3,5 ERR 2 ERR 1 ERR 1 ERR 1 IS IS QJ8 QJ4 BAD SECTOR ERR DATA CHECK ECC HARD ERR FORMAT ERR RKQJ7 RKQJ7 RKQJ7 RKQJ7 RKER RKER RKER RKER QJ7 IS QJ6 QJ4 E2 RLQJ2 RLCS Comments See note after last chart *-*-*-*-*-*-*-*-*-*-* * * CH-CO * * * * *-*-*-*-*-*-*-*-*-*-* ERROR NAME DEVICE REG BIT Comments CHAN ERR OVER RUN RHIQJ RHIQJ CONI CONI 2QJ 22 and no drive errors CHAN ERR OVER RUN RH2QJ RH2QJ CONI CONI 22 26 and no drive errors IS TIMEOUT RD SUB INV MAP MAP PE DATA LATE RH78QJ RH78QJ RH78QJ RH78QJ RH78QJ MBA SR MBA SR MBA SR MBA SR MBA SR QJl QJ2 04 QJ5 11 and no drive errors NOM EX MEM SPE INV MAP MAP PE DATA LATE RH75QJ RH7SQJ RH75QJ RH75QJ RH75QJ MBA SR MBA SR MBA SR MBA SR MBA SR QJl 14 QJ4 QJ5 11 and no drive errors NON EX MEM DATA LATE WRITECHECK RKQJ7 RKQJ7 RKQJ7 RKCS2 RKCS2 RKCS2 11 15 14 and Not Data Check E4 RLQJ2 RLCS See note after last chart 0-3 DISK SUBSYSTEM ERROR BITS *-*-*-*-*-*-*-*-*-*-* * * BUS * * **-*-*-*-*-*-*-*-*-*-** ERROR NAME DEVICE REG BIT RAE MDPE PARITY ERR RH10 RH10 RH10 CONI CONI ER 1 29 18 03 RAE MDPE PARITY ERR RH20 RH20 RH20 CONI CONI ERR 1 24 18 03 MCPE NON EX DRIVE MDPE PARITY ERR RH780 RH780 RH780 RH780 MBA SR MBA SR MBA SR ERR 1 17 18 06 03 MCPE NON EX DRIVE MDPE PARITY ERR RH750 RH750 RH750 RH750 MBA SR MBA SR MBA SR ERR 1 17 18 06 03 PARITY ERR DATA PARITY ERROR RP07 RP07 ERR 1 ERR 3 03 03 NON EX DRIVE RK07 DR TO CNTRL PE RK07 CNTRL TO DR PE RK07 CONTROLLER TIMEOUT RK07 MULTIPLE DRIVE SEL RK07 UNIT FIELD ERR RK07 RKCS2 RKCSI RKER RKCSI RKCS2 RKCS2 12 13 03 11 09 08 DRIVE SEL ERR RLMP 08 RL02 Comments and no Class B device errors D-4 DISK SUBSYSTEM ERROR BITS *-*-*-*-*-*-*-*-*-*-* * * SOFT * * **-*-*-*-*-*-*-*-*-*-** ERROR NAME DEVICE REG BIT INVALID ADDR ERR AD DR OVERFLOW ERR REG MOD RFSD ILL REG ILL FUNCTION RP04,5,6 RP04,5,6 RP04,5,6 RP04,5,6 RP04,5,6 ERR ERR ERR ERR ERR 1 1 1 1 1 10 09 02 01 '00 INVALID ADDR ERR ADDR OVERFLOW ERR REG MOD RFSD ILL REG ILL FUNCTION PROG ERR RP07 RP07 RP07 RP07 RP07 RP07 ERR ERR ERR ERR ERR ERR 1 1 1 1 1 2 10 09 02 01 00 15 INVALID ADDR ERR PROGRAM ERROR ADR OVERFLOW ERR DRIVE TYPE ERR NONEXECUTIBLE FNC ILL FUNCTION RK07 RK07 RK07 RK07 RK07 RK07 RKER RKCS2 RKER RKER RKER RKER 10 10 09 05 02 00 Comments *-*-*-*-*-*-*-*-*-*-* * * MICRO * * **-*-*-*-*-*-*-*-*-*-** ERROR NAME DEVICE RP07 CROM PARITY ERR MP UNSAFE RP07 RP07 DEFECT SKIP ERR CONTROL LGIC FAIL RP07 LOSS OF BIT CLOCK RP07 MP HANDSHAKE RP07 SERDES DATA FAIL RP07 SYNC CLOCK FAIL RP07 RP07 RUNTIME OUT RP07 FAULT CODE REG ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR BIT 2 2 3 3 3 3 3 3 3 2 Comments 14 13 13 11 HI 08 04 01 00 00-07 0-5 Any nonzero value DISK SUBSYSTEM ERROR BITS *-*-*-*-*-*-*-*-*-*-* * * UNSAF * * **-*-*-*-*-*-*-*-*-*-** BIT ERROR NAME DEVICE REG AC LOW DC LOW WR OS DC UN NO H SEL MULTI H SEL TRAN UNSF TRAN DET F C SW UNSF W-SEL UNSF C SK UNSF ACUN PLO UNS 30VU WRITE UN SF WR C UNSF RP04,5,6 RP04,5,6 RP05,6 RP05,6 RP04,5,6 RP04,5,6 RP04,5,6 RP04,5,6 RP04,5,6 RP04,5,6 RP04,5,6 RP04 RP04,5,6 RP04 RP04,5,6 RP04,5,6 ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR ERR 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 06 05 01 00 10 09 06 05 03 02 01 15 13 12 08 00 UNSAFE R/W 3 UNSAFE RP07 RP07 ERR 1 ERR 2 14 12 R/W 2 UNSAFE R/W 1 UNSAFE WRITE OVERRUN WRITE READY UNSAF WRITE CURENT FAIL DC UNSAFE RP07 RP07 RP07 RP07 RP07 RP07 ERR ERR ERR ERR ERR ERR 2 2 2 2 3 3 10 09 08 12 05 UNSAFE DEVICE CHK RM03,5 RM03,5 ERR 1 ERR 2 14 07 UNSAFE SPEED LOSS ACLO RK06,7 RK06,7 RK06,7 RKER RKDS RKDS 14 04 03 WRITE DATA ERR CURRENT HEAD ERR SPEN ERR WRITE GATE ERR RL01,2 RL01,2 RL01,2 RL01,2 RLMP RLMP RLMP RLMP 15 14 11 10 Comments REG 2<11-13>RD/WRT1-3~REG3<5>DC 11 and Not Write Locked 0-6 UNS DISK SUBSYSTEM ERROR BITS *-*-*-*-*-*-*-*-*-*-* * * WRTLK * * * * *-*-*-*-*-*-*-*-*-*-* ERROR NAME DEVICE REG BIT WRITE LOCK ERR RP04,5,6 ERR 1 11 WRITE LOCK ERR RP07 ERR 1 11 WRITE LOCK ERR RM03,5 ERR 1 11 WRITE LOCK ERR RK07 RKER 11 WRITE LOCK RL02 RLMP 13 Comments and Wri te Gate Error *-*-*-*-*-*-*-*-*-*-* * * OFFLI * * * * *-*-*-*-*-*-*-*-*-*-* ERROR NAME DEVICE REG BIT Comments MEDIUM ON LINE RP04,5,6 DS 12 OFFLINE when not true MEDIUM ON LINE RP07 DS 12 OFFLINE when not true MEDI UM ON LINE RM03,5 DS 12 OFFLINE when not true !***** RL02 NOTE **** NOTE THAT THESE 3 BITS (10,11,& 12) OF THE CS REG ARE GROUPED TO DETERMINE THE ERROR AS FOLLOWS (x means we don't care the state of the bit) 12 11 10 RESULT DLT CRC OPI o 0 1 E0 OPI xII HEADER CHECK El DATA CRC IF READ OPERATION E2 x 1 0 WRITE CHECK IS WRITE OPERATION E3 1 x 1 HEADER NOT FOUND E4 x o DATA LATE 1 !***** 0-7 APPENDIX E NETWORK EVENT PARAMETERS Network Management Layer Event Parameters - Class 0 Type 1 2 3 Keywords SERVICE o = LOAD 1 = DUMP STATUS Return code o = REQUESTED >0 = SUCCESSFUL <0 = FAILED Error detail (if error) Error message (optional) OPERATION o = INITIATED 1 = TERMINATED REASON o Receive timeout 1 = Receive error 2 Line state change by higher level 3 Unrecognized request 4 = Line open error Session Control Layer Event Parameters - Class 2 Type Keywords 0 REASON 0 = Operator command 1 = Normal operation OLD STATE 0 = ON 2 = SHUT 1 = OFF 3 RESTRICTED NEW STATE {(} = ON 2 SHUT 1 = OFF RESTRICTED 3 SOURCE NODE SOURCE PROCESS DESTINATION PROCESS USER PASSWORD (0 means password set; n parameter means not set) ACCOUNT 1 2 3 4 5 6 7 8 E-l NETWORK EVENT PARAMETERS Network Services Layer Event Parameters - Class 3 Type Keywords o MESSAGE Message flags Destination link address Source link address Data CURRENT FLOW CONTROL o No flow control 1 = Segment flow control 2 Message flow control 1 Routing Layer Event Parameters - Class 4 Type 1 2 3 4 5 6 7 Keywords PACKET HEADER Message flags Destination node address (not for control packet) Source node address Forwarding data (not for control packet) PACKET BEGINNING HIGHEST ADDRESS NODE EXPECTED NODE REASON o Line synchronization lost 1 Data errors 2 Unexpected packet type 3 Routing update checksum error 4 Adjacent node address change 5 Verification receive timeout 6 Version skew 7 Adjacent node address out of range 8 Adjacent node block size too small 9 Invalid verification seed value 10 Adjacent node listener received timeout 11 Adjacent node listener received invalid data RECEIVED VERSION STATUS o = REACHABLE 1 UNREACHABLE E-2 NETWORK EVENT PARAMETERS Data Link Layer Event Parameters - Class 5 Type 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Keywords OLD STATE o HALTED 3 = RUNNING 1 = ISTRT 4 MAINTENANCE 2 = ASTRT NEW STATE o HALTED 3 RUNNING 1 = ISTRT 4 MAINTENANCE 2 = ASTRT HEADER SELECTED TRIBUTARY PREVIOUS TRIBUTARY TRIBUTARY STATUS o Streaming 1 Continued send after timeout 2 = Continued send after deselect 3 = End streaming RECEIVED TRIBUTARY BLOCK LENGTH BUFFER LENGTH DTE REASON (Reserved) (Reserved) PARAMETER TYPE CAUSE DIAGNOSTIC Physical Line Layer Event Parameters - Class 6 Type Keywords o DEVICE REGISTER NEW STATE o OFF 1 = ON 1 E-3 APPENDIX F GLOSSARY The following is a list of terms explained within the context of document. this Term Explanation Body section The data portion of system event file. entry in the BUGCHK A recoverable error detected TOPS-20 operating system. by the BUGHLT A non-recoverable error detected by TOPS-20 operating system. the BUGINF A message informing you that a certain event relating to the TOPS-20 operating system has occurred. CTY The system operator's terminal. Dump format One of the three output RETRIEVE procedure. Entry type The type of entry within a system event file, for example, a MASSBUS Device Error, or a Crash Restart Error. ERROR. SYS The name of the system event file in both the TOPS-l0 and TOPS-20 operating systems. Event code The octal code designated to a particular event in the system event file. FRU An acronym for Field Replaceable Unit. This is a piece of hardware that the Field Service engineer can replace on the spot. Full format A complete and detailed listing of an event, in ASCII as translated with RETRIEVE. Hard error A non-recoverable error. Header section The top portion of an entry in the system event file, after SPEAR formats it. F-l an forms of the GLOSSARY Term Explanation MTTR An acronym for Mean Time To Repair. The average time it takes a Field Service engineer to isolate and repair a system malfunction. NXM error An attempt to address memory location. Parity error Indicates that one or more bits have been picked up or dropped to cause a nonparity condition. RETRIE.RPT A file containing entries converted from binary to ASCII. RETRIE.SYS A file entries file. Retry count The number of times an operation is tried, in addition to the first time. Sequence number The number given to system event file. in the Short format A brief version of an entry in system event file, after SPEAR translated it. the has Snapshot The information gathered by operating system immediately recovering from a crash. the after Soft error A recoverable error. Stopcode A message a nonexistent in binary format containing extracted from the system event an entry containing a 3-letter code at the CTY indicating that a serlOUS error has occurred in the operating system's data base. pri~ted System event file The file where the operating system records hardware and software events. Sweep After certain events occur, the operating system checks core looking for more of the same. F-2 INDEX -D- -AABS, 4-39 ACL, 4-39 ACU, 4-39 AOE, 4-38 AVAIL.SYS, 4-47 -BBody section, 2-5, F-l /BREAK switch, 4-4 BUGCHK, 2-2, 5-30, F-l BUGHLT, 2-2, 5-30, F-l BUGINF, 2-2, F-l -C- CAl, 4-5 Channel failures, 2-3 Chargeable downtime, 4-48 Checking error, 3-1 loop, 3-3 range, 3-3 software error, 3-3 sum, 3-3 validity, 3-3 Checksum, 3-3 CM, 5-14, 5-40 Command HELP, 4-3 Command and Control Files, B-1 Completing next field, 4-4 COMPUTE formulas, 4-48 COMPUTE full report, 4-53 COMPUTE function, 4-1, 4-47 COMPUTE procedure, 4-49 COMPUTE report, 4-47 COMPUTE summary report, 4-53 COMPUTE.STATISTICS, 4-47, 4-48 Computer-aided instruction, 4-5 CONFIG program, 5-14 Configuration status change, 5-14, 5-40 Controller failures, 2-3 Conventions record, 2-6 COR/CRC, 4-40 CPU failures, 2-3 CPU status block, 5-26 Crash extract, 5-5 CS/ITM, 4-40 CSF, 4-39 CSU, 4-39 CTRL/F, 4-4 CTRL/U, 4-4 CTRL/W, 4-4 CTY, F-l DAEMON started, 5-8 Data channel error, 5-8 DCK, 4-38 DCL, 4-39 DCU, 4-39 DECnet Entries, 5-56 Deleting current line, 4-4 Deleting previous field, 4-4 Detecting error, 3-1 Device status block, 5-28 Device types, 4-9 Dialogue SPEAR, 4-2 Dialogue usage messages, A-2 DIS, 4-39 Disk statistics, 5-20 DL10 communications error, 5-22 DN, 5-14, 5-40 DPA, 4-40 DTE, 4-38, 4-40 Dump format, F-l DX20 device error, 5-10, 5-37 -EECH, 4-38 Entries hardware, 2-2 performance, 2-4 software, 2-2 TOPS-10, 5-2 TOPS-20, 5-30 Entry descriptions, 5-1 Entry type, F-l Error bits, 0-1 Error checking, 3-1 Error detecting, 3-1 Error detectors hardware, 3-1 parity, 3-3 threshold, 3-3 timing, 3-3 Error register codes, 4-38 ERROR.SYS, F-l Event Codes, C-l Event codes, F-l Event file, 4-12 Event file messages, A-4 Executing SPEAR, 4-4 Exiting from SPEAR, 4-5 Extra error reporting, 4-46 -FFailures channel, 2-3 Index-l Failures (Cont.) controller, 2-3 CPU, 2-3 I/O device, 2-3 intermittent, 3-1 memory, 2-3 solid, 3-1 types of, 3-1 FCE, 4-40 Features HELP, 4-3 FEN, 4-39 FER, 4-38 Field completing next, 4-4 deleting previous, 4-4 File specifications, 4-4 Files indirect, 4-2 FMT, 4-40 Format full, 4-24 octal, 4-21 record, 2-5 short, 4-20 Front end reload, 5-18 Front end reloaded, 5-42 Front-end device report, 5-18, 5-41 FRU, F-1 Full format, 4-24, F-1 Function COMPUTE, 4-1, 4-47 INSTRUCT, 4-1 KLERR, 4-1, 4-24 KLSTAT, 4-1 RETRIEVE, 4-1, 4-8 SUMMARIZE, 4-1, 4-32 -G- Glossary, F-1 /GO switch, 4-4 -HHard error, F-1 Hardware entries, 2-2 Hardware error detectors, 3-1 HCE, 4-38 HCRC, 4-38 Header sample, 2-5 Header section, 2-5, F-1 HELP command, 4-3 Help features, 4-3 /HELP switch, 4-3, 4-4 -II/O device failures, 2-3 IAE, 4-38 ILF, 4-38, 4-40 ILR, 4-38, 4-40 INC/UPE, 4-40 Indirect files, 4-2 Input KLERR, 4-25 RETRIEVE, 4-8 INSTRUCT as a reference, 4-7 INSTRUCT function, 4-1 Intermittent failures, 3-1 Isolation techniques, 3-4 IXE, 4-39 -K- KL CPU status block, 5-45 KL10 parity interrupt, 5-22 KL10 parity trap, 5-24 KLERR entry, 4-10 KLERR front end report, 5-50 KLERR function, 4-1, 4-24 KLERR input, 4-25 KLERR output, 4-30 KLERR Procedure, 4-25 KLSTAT function, 4-1 KLSTAT mode, 5-45 KLSTAT procedure, 4-46 KLSTAT switch, 5-45 KS10 Halt status block, 5-18 KS10 NXM trap, 5-23 -LLibrary SPEAR, 4-1 Line Printer error, 5-29 Loop checking, 3-3 -M- Magtape statistics, 5-19 Magtape system error, 5-16 MASSBUS device error, 5-33 MASSBUS disk error, 5-9 MASSBUS disk registers, 4-38 Memory failures, 2-3 Memory sweep for NXM, 5-25 Memory sweep for parity, 5-26 MF20 device report, 5-47 MHS, 4-39 MSCP, F-l MSE, 4-39 MTTR, F-1 -N- Napierian logarithm, 4-48 NEF, 4-40 NETCON, 5-52 Network CHECK1l report, 5-54 Network control started, 5-52 Network down-line load, 5-53 Network entries, 5-52 Network Event Classes, 5-56 Index-2 Network event Classes, 4-9 Network event Parameters, E-I Network hardware error, 5-53 Network line statistics, 5-55 Network up-line dump, 5-52 NHS, 4-39 Non-reload monitor error, 5-3 Nonchargeable downtime, 4-48 NSG, 4-40 NXM error, F-l Returning to previous prompt, 4-4 Returning to SPEAR prompt, 4-4 /REVERSE switch, 4-4 RMR, 4-38, 4-40 RP06, 5-10 Running SPEAR, 4-1 -S- -0- Octal format, 4-21 OCYL, 4-39 OPE, 4-39 OPI, 4-38, 4-40 OPR, 5-15, 5-41 OT, 5-14, 5-40 -PPacket file, 4-12 PAR, 4-38, 4-40 Parity error, F-l Parity error detectors, 3-3 PEF/LRC, 4-40 Performance entries, 2-4 PLU, 4-39 PM, 5-14, 5-40 Procedure COMPUTE, 4-49 KLERR, 4-25 KLSTAT, 4-46 RETRIEVE, 4-11 SUMMARIZE, 4-40 Processor parity interrupt, 5-44 Processor parity trap, 5-43 PSU, 4-39 -QQuestion mark switch (I?), 4-4 -R- R&W, 4-39 Range checking, 3-3 Record conventions, 2-6 Record format, 2-5 Report COMPUTE, 4-47 COMPUTE full, 4-53 COMPUTE Summary, 4-53 SUMMARIZE, 4-32 RETRIE.RPT, F-l RETRIE.SYS, F-l RETRIEVE error class, 4-9 RETRIEVE function, 4-1, 4-8 RETRIEVE input, 4-8 RETRIEVE output, 4-10 RETRIEVE procedure, 4-11 Retry count, F-l SA, 4-48 Sample header, 2-5 Sample KLERR session, 4-29 Sample RETRIEVE session, 4-19 Sample SUMMARIZE session, 4-44 SE, 4-48 Section body, 2-5 header, 2-5 Separators, 4-3 Sequence number, F-l Setting student 10, 4-6 Short format, 4-20, F-l /SHOW switch, 4-4 SKI, 4-39 Snapshot, F-l Soft error, F-l Software entries, 2-2 Software error checking, 3-3 Software event, 5-13 Software requested data, 5-16 Solid failures, 3-1 SPEAR course menu, 4-7 SPEAR dialogue, 4-2 SPEAR library, 4-1 SPEAR messages, A-I SPEAR switches, 4-4 STOPCO, 2-2, F-2 Stopcodes, 2-2, F-2 Sum checking, 3-3 SUMMARIZE function, 4-1, 4-32 SUMMARIZE procedure, 4-40 SUMMARIZE report, 4-32 Sweep, F-2 Switch /GO, 4-4 /HELP, 4-3, 4-4 question mark, 4-4 /REVERSE, 4-4 /SHOW, 4-4 System availability, 4-48 System effectiveness, 4-48 System event file, 5-1, F-2 System log entry, 5-15, 5-41 System reload, 5-3 -TTOF, 4-39 Techniques isolation, 3-4 verification, 3-5 Terminators, 4-3 TGHA, 5-47 Index-3 Theory, F-2 Threshold error detectors, 3-3 Time window, 3-5 Timing error detectors, 3-3 TOPS-10 entries, 5-2 TOPS-20 entries, 5-30 TOPS-20 system reloaded, 5-30 Total runtime, 4-48 TUF, 4-39 Types of failures, 3-1 -UUA, 4-48 Unit record error, 5-30 UN S , 4 - 38 , 4 - 4 0 Usage cycle, 4-48 User availability, 4-48 User validation messages, A-I UWR, 4-39 -V35V, 4-39 Validity checking, 3-3 Verification techniques, 3-5 30VU, 4-39 VUF, 4-39 -W- Warning messages, A-3 WCF, 4-38 WCU, 4-39 WHY RELOAD?, 4-48 WLE, 4-38 WOF, 4-39 WRU, 4-39 WSU, 4-39 -X- XPORT messages, A-4 Index-4 TOPS-10/TOPS-20 SPEAR Manual AA-J8338-TK READER'S COMMENTS NOTE: This form is for document comments only. DIGITAL will use comments submitted on this form at the company's discretion. If you require a written reply and are eligible to receive one under Software Performance Report (SPR) service, submit your comments on an SPR form. Did you find this manual understandable, usable, and well-organized? Please make suggestions for improvement. Did you find errors in this manual? If so, specify the error and the page number. Please indicate the type of reader that you most nearly represent. D Assembly language programmer D Higher-level language programmer D Occasional programmer (experienced) D User with little programming experience D Student programmer D Other (please specify)~~~~~~~~~~~~~~~~~~~~~ Name~~~~~~~~~~~~~~ ___~___ Date ______________________________~ Organization ~~~~~~~~~~~_~~_ Telephone _________________________ Street ____________________________________________________________________________________________ City ________________________________________________ State __________ Zip Code __________ or Country I I I I - - - - I "a I ~.--D~.~.ot Tgear -.'~.d ere and Ta~ ----------------------f~ -111--------~~;;:;; -__ ~ ~ gW if Mailed in the United States I F~~~!A~~~~T ~;~~!N~t'~~S. POSTAGE WILL BE PAID BY ADDRESSEE SOFTWARE PUBLICATIONS 200 FOREST STREET MR01-2/L 12 MARLBOROUGH, MA 01752 - -- - - - - Do Not Tear - Fold "ere and Tape - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - I
Home
Privacy and Data
Site structure and layout ©2025 Majenko Technologies