Digital PDFs

Order Number: MISC-68363AED

/ Hardware / Computers / VAX / VAXstation 35x0

This document, the "Firefox Error-Handling Specification, Revision 0.1" (dated December 1987), outlines how operating system software should detect and handle hardware errors within a Firefox workstation.

Its primary focus is on detecting single-bit data errors, especially transient or single-point failures that do not affect critical system elements. It explicitly states that hard failures (detectable by diagnostics prior to OS startup) and multi-point failures are generally outside its reliable recovery capabilities.

The specification details error handling procedures for the M-bus and specific hardware modules:

M-Bus Error Handling (FBIC): Describes a comprehensive process for M-bus errors. This involves immediately copying and re-enabling error-logging registers from all M-bus interfaces, validating these logs, determining the dominant error type, and identifying the source of the error (e.g., a specific module or the backplane). Errors are classified into those requiring system shutdown and restart (if system-wide data integrity cannot be guaranteed, such as ARB, MCPE, MDPE, or a double error) versus those allowing for system recovery.
L2001 Dual-CVAX Processor Module Error Handling: Covers errors related to its M-bus monitor and slave functions, I/O operations, and cache references, often leading to machine checks or logging invalid data.
L2002 Q-Bus Adapter Module Error Handling: Addresses errors occurring in its M-bus slave and master roles, Q-bus parity issues, and memory references.
L2003 Workstation I/O Module Error Handling: Focuses on its role as a passive M-bus slave, noting the absence of internal bus parity and recommending checksums/CRCs for data integrity checks.
L2007 Memory Module Error Handling: This section is explicitly marked as "TBD" (To Be Determined), indicating it's not covered in this revision.

System Error Analysis and Recovery (Section 15.6) outlines the overall procedure:

A single processor acquires a semaphore to serialize error handling.
Error-logging registers from all M-bus interfaces are copied and then re-enabled.
Logs are validated; if inconsistent, the system is shut down as it indicates a hardware failure.
The specific module causing the error is identified.
Based on the error type and its impact on data integrity, the system decides whether to shut down and restart (for severe, non-recoverable errors) or to attempt recovery (e.g., terminating affected processes, reinitializing I/O devices, or flushing cache lines for less severe, transient errors).
Finally, the system attempts to resume normal operation.

The document concludes by stating that simulations showed the error resolution algorithms correctly identified the faulty module in 94.5% of tested cases.

MISC-68363AED

December 1987

21 pages

Original

0.9MB

view download