This "TOPS-10 Crash Analysis Guide" (January 1989, for TOPS-10 Version 7.04 and GALAXY Version 5.1) is a procedural and reference manual for experienced TOPS-10 system programmers. Its primary purpose is to provide methods, tools, and procedures for analyzing TOPS-10 system crashes, diagnosing their causes (hardware or software), and suggesting solutions.
The document outlines a systematic approach to crash analysis:
- Introduction to Crashes: It explains system error recovery, distinguishes between fatal and non-fatal errors, and details how the system creates a "crash file" (a memory dump) when an unrecoverable error occurs.
- Tools and Information Sources: Key tools include FILDDT (File DDT) for examining crash files, EDDT (Exec DDT) for debugging the running monitor, CRSCPY for copying crash files, and SPEAR for error log reports. Essential information sources are CTY output, crash files, monitor source code listings, operator logs, and monitor table descriptions.
- Examining a Crash File: Covers how to create a crash file, use FILDDT to load and map monitor symbols and memory addresses (virtual to physical using EPT/UPT), verify the dump's integrity, use FILDDT command files for automation, and extract stopcode information (date, time, CPU, job details).
- Locating the Failure: Guides on interpreting information from the crash file to understand the system's state at the time of the crash. This includes details on hardware mapping, paging pointers, extended addressing, monitor-resident user data, the Program Counter Word, processor modes (user/exec), the Priority Interrupt system, device interrupt service, traps (especially Page Fail Traps), clock level functions, and the state of accumulators and push-down lists. It also describes the monitor's modular organization and provides examples of locating failures for specific stopcodes (IME, UIL, KAF).
- Examining Data Structures: Provides an in-depth look at the monitor's various data structures crucial for diagnosis, such as job-related tables (e.g., JBTSTS, PDB), CPU data structures (CDB), memory data structures (PAGTAB, MEMTAB), command processing tables, UUO processing tables, I/O data structures (JDA, DDB, LDB), terminal chunks, tape drives, disk structures (MFD, UFD, RIB, HOME block, SAT), and the software disk cache.
- Error Handling Routines: Describes how the monitor handles hardware and software errors, the types of hardware error messages, the APR Interrupt Routine, Page Fail Trap Routine, saved hardware error information, and hardware error checking. It extensively details "stopcodes" (symbolic error names) by type (HALT, STOP, JOB, CPU, DEBUG, INFO, EVENT), their processing by the DIE routine, and special stopcodes like Keep-Alive Fail (KAF), Illegal Memory Reference (IME), and Executive UUO Error (EUE). It also covers errors detected by the RSX-20F front-end.
- Debugging the Monitor: Explains how to actively debug the monitor using FILDDT for minor patches to the running system and EDDT (a special DDT version) for more extensive debugging, including setting breakpoints and understanding multi-CPU debugging considerations.
The document assumes the reader is familiar with DDT commands and other TOPS-10 documentation. It includes a glossary of acronyms and illustrations of the monitor's address space layout.