This document, the "Firefox Error-Handling Specification, Revision 0.1" (dated December 1987), outlines how operating system software should detect and handle hardware errors within a Firefox workstation.
Its primary focus is on detecting single-bit data errors, especially transient or single-point failures that do not affect critical system elements. It explicitly states that hard failures (detectable by diagnostics prior to OS startup) and multi-point failures are generally outside its reliable recovery capabilities.
The specification details error handling procedures for the M-bus and specific hardware modules:
- M-Bus Error Handling (FBIC): Describes a comprehensive process for M-bus errors. This involves immediately copying and re-enabling error-logging registers from all M-bus interfaces, validating these logs, determining the dominant error type, and identifying the source of the error (e.g., a specific module or the backplane). Errors are classified into those requiring system shutdown and restart (if system-wide data integrity cannot be guaranteed, such as ARB, MCPE, MDPE, or a double error) versus those allowing for system recovery.
- L2001 Dual-CVAX Processor Module Error Handling: Covers errors related to its M-bus monitor and slave functions, I/O operations, and cache references, often leading to machine checks or logging invalid data.
- L2002 Q-Bus Adapter Module Error Handling: Addresses errors occurring in its M-bus slave and master roles, Q-bus parity issues, and memory references.
- L2003 Workstation I/O Module Error Handling: Focuses on its role as a passive M-bus slave, noting the absence of internal bus parity and recommending checksums/CRCs for data integrity checks.
- L2007 Memory Module Error Handling: This section is explicitly marked as "TBD" (To Be Determined), indicating it's not covered in this revision.
System Error Analysis and Recovery (Section 15.6) outlines the overall procedure:
- A single processor acquires a semaphore to serialize error handling.
- Error-logging registers from all M-bus interfaces are copied and then re-enabled.
- Logs are validated; if inconsistent, the system is shut down as it indicates a hardware failure.
- The specific module causing the error is identified.
- Based on the error type and its impact on data integrity, the system decides whether to shut down and restart (for severe, non-recoverable errors) or to attempt recovery (e.g., terminating affected processes, reinitializing I/O devices, or flushing cache lines for less severe, transient errors).
- Finally, the system attempts to resume normal operation.
The document concludes by stating that simulations showed the error resolution algorithms correctly identified the faulty module in 94.5% of tested cases.