Digital PDFs
Documents
Guest
Register
Log In
AA-LA27A-TE
December 2008
152 pages
Original
6.5MB
view
download
Document:
VMS VAXcluster Manual
Order Number:
AA-LA27A-TE
Revision:
Pages:
152
Original Filename:
http://bitsavers.org/pdf/dec/vax/vms/5.0/AA-LA27A-TE_VMS_5.0_VAXcluster_Manual_198804.pdf
OCR Text
VMS VAXcluster Manual Order Number: AA-LA27 A-TE April 1988 This manual describes the procedures for setting up and managing V AXcluster configurations. Revision/Update Information: This manual supersedes the Version 4.0 Guide to VAXc/usters and the Version 4.6 VMS Local Area VAXcluster Manual. Software Version: VMS Version 5.0 digital equipment corporation maynard, massachusetts April 1988 The information in this document is subject to change without notice and should not be construed as a commitment by Digital Equipment Corporation. Digital Equipment Corporation assumes no responsibility for any errors that may appear in this document. The software described in this document is furnished under a license and may be used or copied only in accordance with the terms of such license. No responsibility is assumed for the use or reliability of software on equipment that is not supplied by Digital Equipment Corporation or its affiliated companies. Copyright © 1988 by Digital Equipment Corporation All Rights Reserved. Printed in U.S.A. The postpaid READER'S COMMENTS form on the last page of this document requests the user's critical evaluation to assist in preparing future documentation. The following are trademarks of Digital Equipment Corporation: DEC DEC/CMS DEC/MMS DECnet DECsystem-10 DECSYSTEM-20 DEC US DECwriter DIBOL EduSystem IAS MASSBUS PDP PDT RSTS RSX UNIBUS VAX VAXcluster VMS VT ~U~DD~DTM ZK4477 HOW TO ORDER ADDITIONAL DOCUMENTATION DIRECT MAIL ORDERS USA & PUERTO Rico* CANADA INTERNATIONAL Digital Equipment Corporation P.O. Box CS2008 Nashua, New Hampshire 03061 Digital Equipment of Canada Ltd. 100 Herzberg Road Kanata, Ontario K2K 2A6 Attn: Direct Order Desk Digital Equipment Corporation PSG Business Manager c/o Digital's local subsidiary or approved distributor In Continental USA. Puerto Rico, Alaska, and Hawaii call 800-DIGIT AL. In Canada call 800-267-6215. *Any prepaid order from Puerto Rico must be placed with the local Digital subsidiary (809-754-7575). Internal orders should be placed through the Software Distribution Center (SDC), Digital Equipment Corporation, Westminster, Massachusetts 014 73. Production Note This book was produced with the VAX DOCUMENT electronic publishing system, a software tool developed and sold by DIGITAL. In this system, writers use an ASCII text editor to create source files containing text and English-like code; this code labels the structural elements of the document, such as chapters, paragraphs, and tables. The VAX DOCUMENT software, which runs on the VMS operating system, interprets the code to format the text, generate a table of contents and index, and paginate the entire document. Writers can print the document on the terminal or line printer, or they can use DIGITAL-supported devices, such as the LN03 laser printer and PostScript® printers (PrintServer 40 or LN03R ScriptPrinter), to produce a typeset-quality copy containing integrated graphics. ® PostScript is a trademark of Adobe Systems, Inc. Contents PREFACE xiii NEW AND CHANGED FEATURES xv CHAPTER 1 INTRODUCTION TO THE VAXCLUSTER ENVIRONMENT 1-1 1.1 CLUSTER HARDWARE 1-2 1.2 CLUSTER SOFTWARE 1-2 1.3 1.3.1 1.3.2 1.3.3 1.3.4 CLUSTER CONFIGURATION TYPES Cl-Only V AXcluster Configurations Local Area V AXcluster Configurations Mixed-Interconnect VAXcluster Configurations Cluster Security for Local Area and Mixed-Interconnect Configurations 1-4 1-4 1-5 1-7 1.4 DECNET-VAX COMMUNICATIONS 1-9 1.5 1.5.1 1.5.2 CLUSTER CONNECTION MANAGEMENT The Quorum Scheme Quorum Disk 1-9 1-10 1-11 1.6 SHARED PROCESSING AND PRINTER RESOURCES 1-12 1.7 SHARED DISK RESOURCES 1-12 CHAPTER 2 PREPARING THE CLUSTER OPERATING ENVIRONMENT 2.1 DIRECTORY STUCTURE ON A COMMON SYSTEM DISK 1-8 2-1 2-2 v Contents 2.2 INSTALLING THE VMS OPERATING SYSTEM IN THE VAXCLUSTER ENVIRONMENT 2-4 2.3 2.3.1 2.3.2 CONFIGURING THE DECNET-VAX NETWORK Copying Remote Node Databases Enabling Cluster Alias Operations 2-6 2-8 2-8 2.4 2.4.1 2.4.2 COORDINATING CLUSTER COMMAND PROCEDURES Building Common Command Procedures Using Node-Specific System Command Procedures 2-9 2-10 2-11 2.5 COORDINATING SYSTEM FILES TO DEFINE THE CLUSTER USER ENVIRONMENT 2-11 Coordinating User Accounts 2-12 Preparing the MAIL Database 2-13 Preparing the Rights Database 2-14 Coordinating Shared System Files in Clusters with Multiple Common System Disks 2-14 2.5.1 2.5.2 2.5.3 2.5.4 CHAPTER 3 BUILDING AND MAINTAINING THE CLUSTER 3.1 3.1.1 3.1.2 3.1.3 3.1.4 3.2 3.2.1 3.2.1.1 3.2.1.2 3.2.1.3 3.2.2 3.2.3 3.2.4 3.2.4.1 3.2.4.2 3.2.5 vi PLANNING CONFIGURATION PROCEDURES CLUSTER_CONFIG.COM Functions Determining Locations and Sizes for Satellite Page and Swap Files Selecting Boot Servers for Mixed-Interconnect Clusters __ Specifying Allocation Class Values in Mixed-Interconnect Clusters CONFIGURING THE CLUSTER Adding a Node to the Cluster Updating Network Data after Adding a Satellite • 3-11 Restoring a Satellite's Network Data • 3-12 Controlling Clusterwide Broadcast Messages on Satellites and Boot Servers • 3-12 Removing a Node from the Cluster Changing a Node's Characteristics Changing the Cluster Configuration Type Changing an Existing Cl-Only Cluster to a Mixed-Interconnect Configuration • 3-19 Changing an Existing Local Area Cluster to a Mixed-Interconnect Configuration • 3-20 Converting a Standalone Node to a Cluster Node 3-1 3-1 3-2 3-3 3-3 3-4 3-5 3-6 3-13 3-14 3-19 3-21 Contents 3.2.6 Creating a Duplicate System Disk 3-21 3.3 3.3.1 RECONFIGURING THE CLUSTER AFTER A MAJOR CHANGE Updating MODPARAMS.DAT Files to Adjust Cluster Quorum Shutting Down the Cluster Changing Allocation Class Values on HSCs Rebooting the Cluster 3-23 3.3.2 3.3.3 3.3.4 3.4 3.4.1 3.4.2 3.4.3 3.4.4 3.4.5 3.4.5.1 3.4.5.2 3.4.5.3 3.4.5.4 3.4.6 3.4.6.1 3.4.6.2 MAINTAINING THE CLUSTER Running AUTOGEN with the FEEDBACK Option Recording Configuration Data Monitoring Ethernet Activity in Local Area and Mixed-Interconnect Clusters Restoring Cluster Quorum after an Unexpected Node Failure Selecting Cluster Shutdown Options The REMOVE_NODE Option • 3-28 The CLUSTER_SHUTDOWN Option • 3-28 The REBOOT_CHECK Option • 3-29 The SAVE_FEEDBACK Option • 3-29 Performing Security Functions in Local Area and Mixed-Interconnect Clusters Maintaining Cluster Security Data • 3-30 Controlling Conversational Bootstrap Operations for Satellites • 3-31 CHAPTER4 SETTING UP AND MANAGING CLUSTER QUEUES 3-23 3-24 3-24 3-24 3-24 3-25 3-25 3-26 3-26 3-28 3-29 4-1 4.1 CLUSTERWIDE QUEUES 4-1 4.2 4.2.1 4.2.2 CLUSTER PRINTER QUEUES Setting Up Printer Queues Setting Up Clusterwide Generic Printer Queues 4-1 4-2 4-3 4.3 4.3.1 4.3.2 CLUSTER BATCH QUEUES Setting Up Executor Batch Queues Setting Up Generic Batch Queues 4-6 4-7 4-7 vii Contents 4.4 4.4.1 4.4.2 4.5 COMMAND PROCEDURES FOR ESTABLISHING QUEUES Starting Queues Using Node-Specific Command Procedures Starting Queues Using a Common Command Procedure 4-9 4-12 SUMMARY OF COMMANDS FOR SETTING UP CLUSTER QUEUES 4-14 CHAPTER 5 SETTING UP AND MANAGING CLUSTER DISKS 5.1 5.1.1 5.1.2 5.1.3 5-1 5-1 5-2 5-2 5-3 5.1.3.1 5.1.3.2 5.1.3.3 CLUSTER-ACCESSIBLE DISKS HSC Disks MSCP-Served Disks Dual-Pathed Disks Dual-Ported HSC Disks • 5-3 Dual-Ported DSA Disks • 5-4 Dual-Ported MASSBUS Disks • 5-4 5.2 5.2.1 5.2.2 CLUSTER DEVICE-NAMING CONVENTIONS Rules for Specifying Allocation Class Values Sample Configurations with Named Devices 5-5 5-5 5-6 5.3 SHARED DISKS 5-9 5.4 SETTING UP CLUSTER DEVICES 5-10 5.5 VOLUME SHADOWING IN MIXED-INTERCONNECT CLUSTERS Mounting Shadow Sets Dismounting Shadow Sets Using Shadow Sets as Satellite System Disks 5-10 5-11 5-11 5-12 5.5.1 5.5.2 5.5.3 APPENDIX A CLUSTER SYSGEN PARAMETERS viii 4-9 A-1 Contents APPENDIX B BUILDING A COMMON SYSUAF.DAT FILE FROM NODE-SPECIFIC FILES B-1 APPENDIXC VAXCLUSTER TROUBLESHOOTING INFORMATION C-1 C.1 C.1.2 C.1.3 C.1.4 C.1.5 DIAGNOSING FAILURES OF NODES TO BOOT OR TO JOIN THE CLUSTER Summary of Events for Nodes Booting and Joining the Cluster Cl-Connected Node Fails to Boot Satellite Node Fails to Boot Node Fails to Join the Cluster Startup Procedures Fail to Complete C-1 C-3 C-4 C-6 C-7 C.2 C.2.1 C.2.2 DIAGNOSING CLUSTER HANGS Cluster Quorum Is Lost A Shared Cluster Resource Is Inaccessible C-7 C-7 C-8 C.3 DIAGNOSING CLUEXIT BUGCHECKS C-8 C.4 C.4.1 C.4.2 C.4.2.1 C.4.2.2 C.4.2.3 C.4.3 C.4.3.1 C.4.3.2 C.4.3.3 C.4.3.4 C.4.4 DIAGNOSING VAXPORT DEVICE PROBLEMS VAXport Communication Mechanisms Port Failures Verifying Cl Port Functions • C-11 Verifying Cl Cable Connections • C-12 Repairing Cl Cables • C-15 Analyzing Error Log Entries for V AXport Devices Error Log Entry Formats • C-1 6 Device-Attention Entries• C-16 Logged-Message Entries • C-1 9 Error Log Entry Descriptions• C-21 OPAO Error Messages C-9 C-9 C-10 C.1.1 C-1 C-16 C-29 INDEX ix Contents EXAMPLES 2-1 Sample Interactive Network Configuration Session 2-7 3-1 Sample Interactive CLUSTER_CONFIG.COM Session tp Add a Cl-Connected Node as a Boot Server 3-7 3-2 Sample Interactive CLUSTER_CONFIG.COM Session to Add a Satellite Node with Local Page and Swap Files 3-9 3-3 Sample NETNODE_UPDATE.COM File 3-12 3-4 Sample Interactive CLUSTER_CONFIG.COM Session to Remove a Satellite Node with Local Page and Swap Files 3-13 Sample Interactive CLUSTER_CONFIG.COM Session to Enable the Local System as a Disk Server 3-16 Sample Interactive CLUSTER_CONFIG.COM Session to Change the Local System's ALLOCLASS Value 3-17 Sample Interactive CLUSTER_CONFIG.COM Session to Enable the Local System as a Boot Server 3-17 Sample Interactive CLUSTER_CONFIG.COM Session to Change a Satellite's Hardware Address 3-18 Sample Interactive CLUSTER_CONFIG.COM Session to Convert a Standalone Node to a Cluster Boot Server - - 3-21 Sample Interactive CLUSTER_CONFIG.COM CREATE Session 3-22 3-5 3-6 3-7 3-8 3-9 3-10 3-31 4-1 Sample Interactive SYSMAN CONFIGURATION Session STARTQ Command Procedure for Node JUPITR 4-2 STARTQ Command Procedure for Node SATURN 4-10 4-3 STARTQ Command Procedure for Node URANUS 4-11 4-4 4-13 5-1 Starting Queues Using a Common Command Procedure Shadow Set as Seen from Boot Server 5-2 Shadow Set as Seen from Satellite 5-11 C-1 Cl Device-Attention Entry C-17 C-2 Ethernet Device-Attention Entry C-18 C-3 Cl Logged-Message Entry C-20 3-11 4-9 5-10 FIGURES x 1-1 Typical Cl-Only VAXcluster Configuration 1-5 1-2 Typical Local Area V AXcluster Configuration 1-7 1-3 Typical Mixed-Interconnect VAXcluster Configuration 1-8 2-1 Directory Structure on Common System Disk 2-2 2-2 File Search Order on Common System Disk 2-3 Contents 4-1 Sample Printer Configuration 4-2 4-2 Printer Queue Configuration 4-3 4-3 Cluster Printer Queue Configuration With Clusterwide Generic Printer Queue 4-4 4-4 Printer Queue Configuration With Local Generic Queue 4-5 4-5 Sample Batch Queue Configuration 4-6 4-6 Batch Queue Configuration With Clusterwide Generic Queue 4-8 5-1 Cl-Only Configuration With Shared Disks 5-2 5-2 Configuration with a Dual-Pathed HSC Disk 5-7 5-3 Configuration with a Dual-Pathed DSA Disk 5-7 5-4 Device Names in a Mixed-Interconnect Cluster 5-8 C-1 A Correctly Connected Two-Node Cl Cluster C-13 C-2 Crossed Cl Cable Pair C-13 TABLES 1-1 VAX.cluster Hardware Components 1-2 2-1 Information Requested for Cl-Only Configurations 2-5 2-2 Information Requested for Local Area and Mixed-Interconnect Configurations 2-5 3-1 Data Requested by CLUSTER_CQNFIG.COM 3-5 3-2 CLUSTER_CONFIG.COM CHANGE Options 3-14 3-3 Summary of SYSMAN CONFIGURATION Commands for Cluster Authorization 3-30 Specifying Values for MSCP_LOAD and MSCP_SERVE-ALL Parameters 5-3 Cluster SYSGEN Parameters A-1 5-1 A-1 xi Preface Intended Audience This document addresses persons responsible for setting up and managing VAXcluster configurations. To use the document as a guide to cluster management, you must have a thorough understanding of VMS system management concepts and procedures, as described in the Introduction to VMS System Management, the Guide to Setting Up a VMS System, and the Guide to Maintaining a VMS System. Document Structure The VMS VAXcluster Manual contains five chapters and three appendixes. Chapter 1 describes the VAXcluster environment. Chapter 2 explains how to prepare the cluster operating environment before building a cluster. Chapter 3 explains how to build a cluster once the necessary preparations are made, and how to reconfigure and maintain the cluster. Chapter 4 discusses cluster queue management concepts and procedures. Chapter 5 discusses cluster disk management concepts and procedures. Appendix A lists and defines cluster SYSGEN parameters. Appendix B provides guidelines for building a cluster common user authorization file. Appendix C provides VAXcluster troubleshooting information. Associated Documents This document is not a one-volume reference manual. The VMS utilities and commands discussed are described in detail in separate VMS Utility Reference Manuals and in the VMS DCL Dictionary. For additional information on the topics covered in this manual, refer to the following documents: • Introduction to VMS System Management • Guide to Setting Up a VMS System • Guide to Maintaining a VMS System • Guide to VMS File Applications • VMS Networking Manual • VAX Volume Shadowing Manual • VMS Utility Reference Manuals xiii Preface Conventions Convention Meaning In examples, a key name (usually abbreviated) shown within a box indicates that you press a key on the keyboard; in text, a key name is not enclosed in a box. In this example, the key is the RETURN key. (Note that the RETURN key is not usually shown in syntax statements or in all examples; however, assume that you must press the RETURN key after entering a command or responding to a prompt.) xiv CTRL/C A key combination, shown in uppercase with a slash separating two key names, indicates that you hold down the first key while you press the second key. For example, the key combination CTRL/C indicates that you hold down the key labeled CTRL while you press the key labeled C. In examples, a key combination is enclosed in a box. $SHOW TIME 05-JUN-1988 11 :55:22 In examples, system output (what the system displays) is shown in black. User input (what you enter) is shown in red. $TYPE MYFILE.DAT In examples, a vertical series of periods, or ellipsis, means either that not all the data that the system would display in response to a command is shown or that not all the data a user would enter is shown. input-file, ... In examples, a horizontal ellipsis indicates that additional parameters, values, or other information can be entered, that preceding items can be repeated one or more times, or that optional arguments in a statement have been omitted. (logica I-name] Brackets indicate that the enclosed item is optional. (Brackets are not, however, optional in the syntax of a directory name in a file specification or in the syntax of a substring specification in an assignment statement.) quotation marks apostrophes The term quotation marks is used to refer to double quotation marks (" ) . The term apostrophe ( ') is used to refer to a single quotation mark. New and Changed Features New VAXcluster software features for VMS Version 5.0 include the following: • Support for MicroVAX class processors as VAXcluster members in mixed-interconnect cluster configurations. These systems can boot into a mixed-interconnect cluster over the Ethernet. • Support for an increased number of cluster nodes. • Enhanced Mass Storage Protocol (MSCP) Server functions. New server functions enable a disk-serving system to serve all suitable disks to the cluster early in the boot sequence, so that the disks become cluster accessible with minimal interruption whenever the serving system reboots. In addition, the server automatically serves any suitable disks that are added to the system later. • Failover support for DSA disks using UDA/KDA/BDA controllers. • A revised quorum disk scheme. • A new command procedure, SYS$MANAGER:CLUSTER_CONFIG.COM, which you execute to peform cluster configuration functions. This procedure replaces the following VMS Version 4.0 and 4.6 procedures: MAKEROOT.COM BOOT_CONFIG.COM SATELLITE_CONFIG.COM Note that the configuration information presented in this document is subject to change. For definitive information on supported VAXcluster configurations, refer to the current VAXcluster Software Product Description (SPD) document. xv 1 Introduction to the VAXcluster Environment A VAXcluster environment is a highly integrated organization of VAX or MicroVAX systems or a combination of these systems. As members of a cluster, the systems can share processing resources, queues, and disk storage under a single VMS security and management domain, and they can boot or fail independently. Using procedures described in Chapter 2, system managers can tailor the cluster operating environment to create a common-environment or a multipleenvironment cluster. • In a common-environment cluster, the same resources are available on all nodes. User accounts are identical, the same known images are installed, the same logical names are defined, and mass storage devices and queues are shared. • In a multiple-environment cluster, a group of nodes may share one set of resources, while another group shares a different set. Or an individual node may perform a specialized function using restricted resources, while other nodes are used for general time-sharing work. Although most cluster resources may be shared, user processes and system memory are node specific. When a process is created on a cluster node, the process must complete on that node, using memory local to the node. If the node should fail before the process completes, the process is terminated. However, users can recover from such a failure more quickly than on a standalone system, because they need not wait until the system is rebooted. Typically, they can log in on another cluster node to create a new process and continue working-provided that the resources required by the process (such as images and global sections) are available on that node. This chapter describes the key components and distinctive features of the VAXcluster environment. Topics include the following: • Cluster hardware and software components • Cluster configuration types • DECnet-VAX communications • Cluster connection management • Shared cluster resources Be sure you understand these topics before you attempt to perform the cluster setup operations described in Chapters 2 and 3. 1-1 Introduction to the VAXcluster Environment 1 .1 Cluster Hardware 1 .1 Cluster Hardware Basic VAXcluster hardware components are described in Table 1-1. Table 1-1 V AXcluster Hardware Components Component Function VAX processor A VAX or MicroVAX class processor running the VMS operating system. Any VAX processor in the cluster is considered an active node. Computer Interconnect (Cl) The Cl is a high-speed, dual-path bus that connects VAX processor nodes and intelligent 1/0 subsystems (HSCs) in a computer room environment. Cl Port Controller A microcoded, intelligent controller that connects VAX processors to the Cl. Each interface connects to the Cl bus, which consists of two transmitter and two receiver cables. Under normal operating conditions, both sets of cables are available to meet traffic demands. If one path becomes inoperative, then all traffic uses the remaining path. The VMS operating system periodically tests a failed path. As soon as a failed path becomes available, it will automatically be used for normal traffic. Star Coupler The Star Coupler is the common connection point for all nodes connected to a Cl. As with the Cl bus, the Star Coupler is dual pathed and contains separate components for each path. The star coupler connects all Cl cables from the individual nodes, creating a radial or "star" arrangement that has a maximum radius of 45 meters. It supports the physical connection or disconnection of nodes during normal cluster operations, without affecting the rest of the cluster. 1.2 Hierarchical Storage Controller (HSC) The HSC is a self-contained, intelligent, mass storage subsystem that enables cluster nodes to share DIGIT AL Standard Architecture (DSA) disks. Because the HSC is an intelligent controller, it optimizes physical disk operations. The HSC is considered a passive node. Ethernet The Ethernet is a bus that uses digital baseband signaling. The Ethernet is used both for DECnet-VAX transmissions, and, in some cluster configurations, for interprocessor System Communication Services (SCS). In the V AXcluster environment, the Ethernet and its circuit devices must be configured according to requirements specified in the V AXcluster Software Product Description (SPD) document. Cluster Software The software components used to implement VAXcluster functions are as follows: 1-2 • System Communication Services (SCS) • VAXport drivers • Connection Manager • Distributed File System and VMS Record Management Services (RMS) Introduction to the VAXcluster Environment 1 .2 Cluster Software • Distributed Lock Manager • Distributed Job Controller • Mass Storage Control Protocol (MSCP) Server and disk class driver(s) These components are always present on each cluster member, so that if one member fails, the cluster continues to function, because all the remaining members possess the necessary software components. The System Communication Services (SCS) software implements internode communication, according to DIGITAL's System Communication Architecture (SCA). The VAXport drivers (for example, P ADRIVER and PED RIVER) control the communication paths between local and remote ports. The Connection Manager dynamically defines and coordinates the cluster. The Connection Manager uses the system communication services and provides an acknowledged message delivery service for higher VMS software layers. The Connection Manager also maintains cluster integrity when nodes join or leave the cluster-that is, when cluster state transitions occur. The Distributed File System allows all processors to share disk mass storage, whether the disk is connected to an HSC or to a processor. A local disk may be made available to the entire cluster. All cluster-accessible disks appear as if they are local to every processor. The distributed file system and VMS Record Management Services (VMS RMS) provide the same access to disks and files clusterwide that is provided on a standalone system. VMS RMS files may be shared clusterwide to the record level. The Distributed Lock Manager is used for synchronization functions by the distributed file system, job controller, device allocation, and other cluster facilities. It is available to users to develop cluster applications. The Distributed Lock Manager implements the $ENQ and $DEQ system services to provide clusterwide synchronization of access to resources by allowing the locking and unlocking of resource names. (For detailed information on system services, refer to the VMS System Services Volume.) It also provides a queueing mechanism so that processes can be put into a wait state until a particular resource is available. As a result, cooperating processes can synchronize their access to shared objects such as files or records. If a processor in the cluster fails, all locks it holds are released. This mechanism allows processing to continue on the remaining processors. The Distributed Lock Manager also supports clusterwide deadlock detection. The Distributed Job Controller makes queues available clusterwide. A cluster operates with a common set of batch and print queues. Users can submit jobs to any queue within the cluster, provided that the necessary mass storage volumes and peripheral devices are accessible to the system on which the job executes. System managers can also set up generic batch queues that distribute batch processing workloads among nodes. The Mass Storage Control Protocol (MSCP) Server implements the MSCP protocol, which is used to communicate with a controller for local MASSBUS or UNIBUS disks, or for Digital Standard Architecture (DSA) disks, such as RA series disks. In conjunction with one or both of the disk class drivers (DUDRIVER, DSDRIVER), the MSCP Server implements this protocol on a processor, allowing the processor to function as a storage contoller. The 1-3 Introduction to the VAXcluster Environment 1 .2 Cluster Software processor submits 1/0 requests to locally accessed disks, such as UNIBUS, MASSBUS, and Unibus Disk Adapter (UDA) disks, and accepts the 1/0 requests from any node in the cluster. In this way, the MSCP Server makes locally connected disks available to all nodes in the cluster. The MSCP Server can also make HSC disks accessible over the Ethernet. 1.3 Cluster Configuration Types While site-specific processing needs and available hardware resources must determine how you configure your cluster, you always start with one of the following configuration types: • CI-only VAXcluster configuration • Local Area VAXcluster configuration • Mixed-interconnect VAXcluster configuration These configuration types are distinguished by the interconnect devices (Cl, Ethernet, or both) used for SCS interprocessor communications. Sections 1.3.1 through 1.3.3 describe each type of configuration. For complete information on currently supported configurations, including the type and number of nodes supported in each configuration type, and configuration requirements, refer to the VAXcluster Software Product Description (SPD) document. Depending on the type of configuration you plan to set up, one or more processor nodes may be required to perform specific functions. For example, in all local area and mixed-interconnect configurations, at least one node must perform both boot serving and disk serving functions. These functions are described in Section 1.3.2. Once you have determined which type of configuration best meets your needs, you can set up your cluster using the procedures described in Chapters 2 and 3. 1.3.1 Cl-Only VAXcluster Configurations A CI-only cluster uses the CI for interprocessor communication, with the Star Coupler as the common connection point for all cluster nodes (VAX processors and HSCs). Cluster nodes may be any VAX processors specified in the VAXcluster SPD, or they may be HSCs. Figure 1-1 shows how the components are typically configured. Note that any CI-only cluster may later be converted to a mixed-interconnect configuration. Refer to Section 3.2.4 for instructions. 1-4 Introduction to the VAXcluster Environment 1 .3 Cluster Configuration Types Figure 1-1 Typical Cl-Only VAXcluster Configuration Cl Cl - ZK-1640-84 1.3.2 Local Area VAXcluster Configurations In a local area cluster, interprocessor communication is carried out over the Ethernet by a VAXport driver that emulates certain CI port functions. A cluster node may be any VAX or MicroVAX processor specified in the VAXcluster SPD document. Because HSCs require CI connections, local area clusters do not include HSCs. A single Ethernet may support multiple local area clusters, each identified and secured by a unique group number and a cluster password. (For information on cluster security, see Section 1.3.4.) A local area cluster includes boot servers (boot nodes) and satellite nodes. A boot server is both a management center for the cluster and a major resource provider. Its system disk contains the cluster common files for startup, authorization, and queue setup, as well as the directory roots from which the satellite nodes are booted. (The system manager creates these directory roots-one for each satellite-using the CLUSTER_CONFIG.COM command procedure, described in Chapter 3.) A boot server makes available to the cluster such resources as user and application data disks, printers, and distributed batch processing facilities. 1-5 Introduction to the VAXcluster Environment 1 .3 Cluster Configuration Types Using DECnet Maintenance Operation Protocol (MOP), a boot server responds to downline load requests from satellites. When a satellite requests an operating system load, the boot server responds to the request and sends an image to the satellite that allows the satellite to load the VMS operating system and join the cluster. Note that because a boot server must serve its system disk to the cluster (and usually its data disks as well), a boot server is, by definition, always a disk server. The MSCP Server is therefore always loaded on a boot server, so that the node can serve its disks to the cluster. Boot servers should be the most powerful machines in the cluster. They should also use the highest bandwidth Ethernet adapters available. The satellite nodes are booted remotely from a boot server's system disk. Generally, these nodes are consumers of cluster resources, though they may also sometimes provide disk serving and batch processing resources. If satellite nodes are equipped with RD series disks, they may, for enhanced performance, use such local disks for paging and swapping. Figure 1-2 shows a typical local area cluster configuration. Note that any local area cluster may later be converted to a mixed-interconnect configuration. Refer to Section 3.2.4 for instructions. 1-6 Introduction to the VAXcluster Environment 1 .3 Cluster Configuration Types Figure 1-2 Typical Local Area VAXcluster Configuration DATA DISKS ETHERNET • • • LOCAL PAGE/SWAP DISK LOCAL PAGE/SWAP DISK ZK-6650-HC 1.3.3 Mixed-Interconnect VAXcluster Configurations Clusters with both CI and Ethernet interconnects are available for the first time with VMS Vers::m 5.0. A mixed-interconnect cluster may include VAX processors, HSCs, and Micro VAX satellites. Because the MSCP Server and disk class drivers allow VAX processors to serve HSC disks to the cluster, satellites can access the large amounts of storage available through HSC controllers. Mixed-interconnect clusters combine the advantages of both Ci-only and local area cluster configurations: • Use of HSCs for mass storage • Support for MicroVAX class processors as cluster members • High availability of system resources • Centralized cluster management 1-7 Introduction to the VAXcluster Environment 1.3 Cluster Configuration Types Figure 1-3 shows a typical mixed-interconnect configuration. Figure 1-3 Typical Mixed-,nterconnect V AXcluster Configuration ETHERNET ZK 6659 HC 1.3.4 Cluster Security for Local Area and Mixed-Interconnect Configurations Local area and mixed-interconnect clusters use a group number and a cluster password to allow multiple independent clusters to coexist on the same Ethernet and to prevent access to a cluster by unauthorized nodes. • 1-8 The group number uniquely identifies each mixed-interconnect and local area cluster on a single Ethernet. This number must be in the range from 1 to 4095 or from 61440 to 65535. Note that if you plan to have more than one of these clusters at your site, you must coordinate the assignment of group numbers among cluster system managers. Introduction to the VAXcluster Environment 1 .3 Cluster Configuration Types • The cluster password serves as an additional check to ensure the integrity of individual clusters on the same Ethernet that accidentally use identical group numbers. (Provided that each cluster's password is unique, the clusters will form independently.) The password also prevents an intruder who discovers the group number from joining the cluster. The password must be from 1 to 31 alphanumeric characters in length and may include dollar signs and underscores. Security data is maintained in the cluster authorization file, SYS$COMMON:[SYSEXE]CLUSTER_AUTHORIZE.DAT. This file is created during installation of the VMS operating system, if you indicate that you want to set up a local area or mixed-interconnect cluster. The installation procedure then prompts you for the cluster group number and password. Cluster security functions are described in detail in Chapter 3. (If you convert a CI-only cluster to a mixed-interconnect configuration, the file is created when you execute the CLUSTER_CQNFIG.COM command procedure described in Chapter 3.) 1 .4 DECnet-VAX Communications In any cluster configuration, DECnet-VAX communications are required for all processor nodes. Use of DECnet-VAX facilities ensures that system managers can access each node in the cluster from a single terminal, even if terminal-switching facilities are not available. In local area and mixed-interconnect clusters, DECnet is required both for system management functions and interprocessor communication. For example, DECnet is used for remote booting operations (downline loading of satellite nodes). In these configurations, DECnet and System Communication Services coexist on the same Ethernet. They share the same data link and physical link protocols, which are implemented by the Ethernet data link drivers, the Ethernet adapters, and the Ethernet itself. 1 .5 Cluster Connection Management Cluster integrity is controlled by a software component called the Connection Manager, which determines and coordinates cluster membership. The Connection Manager creates a cluster when the first active nodes are booted, and then reconfigures the cluster when nodes join or leave it. Cluster members can share various data and system resources, such as disk volumes. To achieve the coordination necessary to maintain resource integrity, the cluster nodes must share a clear sense of cluster membership. This sense of cluster membership is maintained by the Connection Manager. The integrity of shared resources, however, cannot be guaranteed unless their use is carefully coordinated in the cluster. In the unlikely event that a pair of nodes that are not members of the same cluster share some resource, cluster partitioning occurs. Partitioning is undesirable, because resource sharing between two clusters is not coordinated, and the integrity of the shared resource cannot be ensured. To prevent partitioning, the Connection Manager uses a scheme called quorum. 1-9 Introduction to the VAXcluster Environment 1.5 Cluster Connection Management 1.5.1 The Quorum Scheme The quorum scheme is based on the arithmetic principle that the whole cannot be divided into multiple parts in such a way that more than one part is greater than half of the whole. The quorum scheme functions as follows: • Each node in the cluster contributes a fixed number of votes towards quorum. The votes value is specified by the SYSGEN parameter VOTES. On satellites, the value is always set to zero by default. • Each active node in the cluster (including satellites) indirectly specifies an initial quorum value using the SYSGEN parameter EXPECTED_ VOTES. This parameter is the sum of all VOTES held by potential cluster members. It is used to derive an estimate of the correct quorum value for the cluster, according to the following formula: estimated quorum = (EXPECTED_VOTES + 2)/2 • During certain cluster state transitions, the system dynamically computes the cluster quorum to be the maximum of the following: The current cluster quorum value The largest of the values calculated from the following formula, where EV is the EXPECTED_VOTES value specified by each node: (EV+2)/2 The value calculated from the following formula, where V is the total of VOTES held by all cluster members: (V+2)/2 The cluster state transitions that cause cluster quorum to be recalculated occur when a node joins the cluster and when the cluster recognizes a quorum disk (see Section 1.5.2). • If the current number of votes ever drops below the quorum (because of nodes leaving the cluster), the cluster members suspend all process activity and all 1/0 operations to cluster-accessible disks until sufficient votes are added (nodes joining the cluster) to bring the total number of votes to a value greater than or equal to quorum. • As the cluster changes, the system only raises the cluster quorum value; it never lowers the value. (However, system managers can lower the value; for details, see Section 3.4.4.) For example, consider a cluster consisting of three nodes, each node having its VOTES parameter set to 1 and its EXPECTED_VOTES parameter set to 3. The Connection Manager dynamically computes the cluster quorum value to be 2. In this example, any two of the three nodes constitute a quorum and may run in the absence of the third node. No single node can constitute a quorum by itself. Therefore, there is no way the three cluster nodes can be partitioned and run as two independent clusters. 1-10 Introduction to the VAXcluster Environment 1.5 Cluster Connection Management 1.5.2 Quorum Disk A quorum disk acts as a virtual node, adding to the cluster votes total. By establishing a quorum disk in configurations with a small number of voting members, you can increase the availability of the cluster. Such configurations can tolerate the failure either of the quorum disk or of a processor node. To use a quorum disk, one or more nodes must have a direct (non-MSCPserved) connection to the disk. Such nodes are known as quorum disk watchers. Nodes that cannot access the disk directly rely on the quorum disk watchers for information about the status of votes contributed by the quorum disk. You should enable as quorum disk watchers any nodes that have an active direct connection to the quorum disk, or that have the potential for a direct connection. To enable a node as a quorum disk watcher, you use the CLUSTER_CONFIG.COM CHANGE function described in Section 3.2.3. The procedure prompts for the name of the quorum disk and specifies that name as a value for the SYSGEN parameter DISK_QUORUM in MODPARAMS.DAT. The procedure also sets an appropriate value for the QDKSVOTES parameter. The number of votes contributed by the quorum disk is equal to the smallest value of the SYSGEN parameter QDSKVOTES on any quorum disk watcher. Note: You can also enable the first installed cluster node as a quorum disk watcher by answering YES when the VMS installation procedure asks if the cluster will contain a quorum disk. For the quorum disk's votes to be counted in the cluster votes total, the following conditions must be met: • On one or more nodes capable of becoming watchers, you must specify the same device name as a value for DISK_QUORUM. The remaining nodes (nodes with a blank value for DISK_QUORUM) recognize the name specified by the first watcher node with which they communicate. • At least one watcher node must have a direct, active connection to the quorum disk. Thus, the quorum disk may be a dual-ported DSA disk, which has an active direct connection to only one node at a time. • The disk must contain a valid format file named QUORUM.DAT in the master file directory (MFD). The QUORUM.DAT file is created automatically after a system specifying a quorum disk has booted into the cluster. This file will be used on subsequent reboots. If no quorum disk is enabled when a node boots, the file will not be created on that node. • To permit recovery from failure conditions, the quorum disk must be mounted by all disk watchers. 1-11 Introduction to the VAXcluster Environment 1 .6 Shared Processing and Printer Resources 1 .6 Shared Processing and Printer Resources In any cluster configuration, nodes can share processing and printer resources. The ability to share resources allows for better workload balancing, because batch and print job processing can be distributed across the cluster. System managers control how jobs share batch processing and printer resources by setting up and maintaining clusterwide generic queues. The strategy used to set up and manage these queues will determine how well workloads are matched to available resources. Managers establish and maintain the queues with the same commands used to manage queues on a single-node system. All clusterwide queues are controlled by a single, cluster common job controller queue file (JBCSYSQUE.DAT), which must be accessible to the nodes participating in the clusterwide queue scheme. This file makes queues available across the cluster and enables jobs to execute on any queue from any node-provided that the necessary mass storage volumes can be accessed by the node on which the job executes. Procedures for setting up and managing cluster queues are described in Chapter 4. 1 .7 Shared Disk Resources A major advantage of cluster configurations is the ability to make disk resources accessible to all cluster nodes. A cluster-accessible disk can be used by any active node in the cluster that successfully mounts it. A disk that is not cluster accessible can be accessed only by the local node. Cluster-accessible disks offer the following advantages: • More efficient use of mass storage, because more than one node can use the same disk. • Access by users to their default work disks when logging in to any node on which the disks are accessible. • Clusterwide file sharing. Because nodes can share common versions of files, updates to a file are made only once to a single copy of the file. • Implementation of clusterwide job controller queues. Batch and print jobs can be processed on any node that has access to the disks. Procedures for setting up and managing cluster disks are described in Chapter 5. 1-12 2 Preparing the Cluster Operating Environment You must prepare the cluster operating environment on the first installed node before configuring other nodes in the cluster. You may prepare either a common-environment or a multiple-environment cluster. The operating environment you, choose depends mainly on the processing needs of your site. In a common-environment cluster, the operating environment is identical on each member node, because the nodes are run from the same system files. The nodes are set up with identical user accounts, the same known images are installed, the same logical names are defined, and mass storage devices and queues are shared. In effect, users in a common-environment cluster can log in to any node and work in the same operating environment. In a multiple-environment cluster, the environment varies from node to node, and users can work in environments that are specific to the node they are logged in to. A multiple-environment cluster is effective when you want to share data among member nodes, but when you want certain nodes to serve specialized needs. For example, you might want to set up a threenode cluster, in which the time-sharing environments on two nodes are the same, while the third node is set up exclusively for batch processing of large inventory jobs. In this case, the time-sharing nodes are set up with a common environment, sharing users, queues, and access to mass storage devices, while the third node runs in its own restricted environment. This chapter concentrates on the steps necessary to prepare a commonenvironment cluster. Approaches for preparing a multiple-environment cluster are also described, but are presented as general guidelines. Topics include the following: • Directory structure on a common system disk • Installing the VMS operating system in the VAXcluster environment • Configuring the DECnet-VAX network • Coordinating cluster command procedures • Coordinating system files to define the cluster user environment Once you have prepared the cluster operating environment on the first cluster node, you can build the cluster using the procedures described in Chapter 3. 2-1 Preparing the Cluster Operating Environment 2.1 Directory Stucture on a Common System Disk 2.1 Directory Stucture on a Common System Disk The VMS installation or upgrade procedure generates a common system disk, on which most operating system and optional product files are stored in a common root directory. The entire directory structure-that is, the common root plus each node's local root-is stored on the same disk. After the installation or upgrade completes, you use the CLUSTER_CONFIG.COM command procedure described in Chapter 3 to create a local root for each new cluster node and boot it into the cluster. Each local root contains, in addition to the usual system directories, a [SYSx.SYSCOMMON] directory that is an alias for [VMS$COMMON], the cluster common root directory in which cluster common files actually reside. When you add a node to the cluster, CLUSTER_CONFIG.COM sets up the alias. Figure 2-1 illustrates the directory structure set up for nodes JUPITR and SATURN, which are run from a common system disk. The disk's master file directory (MFD) contains the local roots (SYSO for JUPITR, SYSl for SATURN) and the cluster common root directory, [VMS$COMMON]. Figure 2-1 Directory Structure on Common System Disk device:[OOOOOO] (Master File Directory) I JUPITR SATURN [SYSO] I [SYS 1I [VMS$COMMON] I \ [SYS 1 SYSx] •••[SYS 1.SYSCOMMON] I \ [VMS$COMMON.SYSx] I [SYSO.SYSx] • • • [SYSO.SYSCOMMON] L(Di<•cto~ I I Ali•,1-----'--------' SYS$SPECIFIC ~ device:[SYSn.] SYS$COMMON ~ device:[SYSn.SYSCOMMON.] SYS$SYSROOT~ device:[SYSn. ], device:[SYSn.SYSCOMMON.] Key: n system root system subdirectory ZK-6658-HC The logical name SYS$SYSROOT is defined as a search list that points to a local root first (SYS$SPECIFIC) and then to the common root (SYS$COMMON). Thus, the logical names for the system directories (SYS$SYSTEM, SYS$LIBRARY, SYS$MANAGER, and so forth) point to two directories: a local root (for example, SYS$SPECIFIC:[SYSEXE]) and a common root (for example, SYS$COMMON :[SYSEXE]). Figure 2-2 shows how directories on a common system disk are searched when the logical name SYS$SYSTEM is used in file specifications. 2-2 Preparing the Cluster Operating Environment 2.1 Directory Stucture on a Common System Disk Figure 2-2 File Search Order on Common System Disk SYS$SYSTEM:file LSYS$SYSROOT•[SVSEXE~le L L JUPITR SYS$SPEC.IFIC:[SYSEXE]file SATURN SYS$COMMON:[SYSEXE]file JUPITR SATURN C C [SYSO.SYSEXE]file [SYS1 .SYSEXE]file [SYSO.SYSCOMMON.SYSEXESJfile [SYS1 .SYSCOMMON.SYSEXE]flle [VMS$COMMON.SYSEXE]file ZK·6657·HC It is important to keep this search order in mind when manipulating system files on a common system disk. Node-specific files must always reside and be updated in the appropriate node's system subdirectory. For example, MODPARAMS.DAT must reside in SYS$SPECIFIC:[SYSEXE], which is [SYSO.SYSEXE] on JUPITR, and [SYSl.SYSEXE] on SATURN. Thus, to create a new MODPARAMS.DAT for JUPITR when logged in on JUPITR, you would enter the following command: $ EDIT SYS$SPECIFIC: [SYSEXE]MODPARAMS.DAT Once the file is created, you could use the following command to modify it: $ EDIT SYS$SYSTEM:MODPARAMS.DAT However, to modify JUPITR's MODPARAMS.DAT when logged in on any other cluster node that boots from the same common system disk, you must enter the following command: $EDIT [SYSO.SYSEXE]MODPARAMS.DAT If you want to modify records in the cluster common system authorization file in a cluster with a single cluster common system disk, you could enter the following commands on any cluster node: $ SET DEFAULT SYS$COMMON: [SYSEXE] $ RUN AUTHORIZE But if, for example, you have set up a node-specific system authorization file (SYSUAF.DAT) for node JUPITR and you want to modify records in that file when logged in on another cluster node that boots from the same cluster common system disk, you must, before inovking AUTHORIZE, set your default directory to JUPITR's node-specific [SYSEXE] directory. For example: $SET DEFAULT [SYSO.SYSEXE] $ RUN AUTHORIZE 2-3 Preparing the Cluster Operating Environment 2.2 Installing the VMS Operating System in the VAXcluster Environment 2.2 Installing the VMS Operating System in the VAXcluster Environment You must perform the installation or upgrade once for each system disk in the cluster. Because, however, several nodes normally run from the same cluster common system disk, you need not perform the installation or upgrade on each cluster node. You may want to set up a cluster that has a combination of one or more common system disks and one or more individual system disks. Again, you must do the installation or upgrade once for each system disk. For example, if your cluster consists of ten nodes, four of which share one common system disk, four of which share a second common system disk, and each of the other two has its own system disk, you would do the installation or upgrade four times. Note that if your cluster includes multiple common system disks, you must later coordinate system files to define the cluster operating environment, as described in Section 2.5.4. To perform the installation, follow instructions in the installation and operations guide for your processor. However, before you start the installation, be sure you have determined which cluster configuration type you want to create (CI-only, local area, or mixed-interconnect), because the installation procedure will request configuration-specific information. (Configuration types are described in Section 1.3.) Table 2-1 lists the information requested for CI-only configurations; Table 2-2 lists the information requested for local area and mixed-interconnect configurations. Typical responses are explained in the tables. Note that initial questions are the same for all configuration types. If your system disk is on an HSC, you must obtain the HSC's disk allocation class value before starting the installation, because the installation procedure will request that information. (Allocation classes are discussed in detail in Section 5.2.) To obtain the value, enter a command sequence like the following at the HSC console. The information displayed will include the allocation class value. lcrnL/cl HSC> SHOW SYS 15-Apr-1988 14:31:43.41 Boot: DISK allocation class = 1 Start command file m Disabled 13-Apr-1988 11:31:11.41 TAPE allocation class = Up: 51:00 O SETSHO - Program Exit If you later want to change the allocation class value, follow the instructions in Section 3.3. Note: While rebooting at the end of the installation procedure, the system will display messages warning that you must install required licenses. Be sure to install these licenses, as well as the DECnet-VAX license, as soon as the system is available. Procedures for installing the licenses are described in the release notes distributed with the software kit. 2-4 Preparing the Cluster Operating Environment 2.2 Installing the VMS Operating System in the VAX.cluster Environment Table 2-1 Information Requested for Cl-Only Configurations Item Response Will this node be a cluster member (Y /N)? Enter Y. What is the node· s DECnet node name? Enter DECnet node name-for example, JUPITR. The DECnet node name may be from 1 to 6 alphanumeric characters in length and may not include dollar signs or underscores. What is the node's DECnet node address? Enter DECnet node address-for example, 2.2 Will the Ethernet be used for cluster communications (Y /N)? Enter N. The Ethernet is not used for cluster (SCS internode) communications in Cl-only configurations. Will JUPITR be a disk server (Y /N)? Enter Y or N, depending on your configuration requirements. Refer to Section 1.3.3 and Chapter 5 for information on served cluster disks. Enter a value for JUPITR' s ALLOCLASS parameter: If the system is connected to a dualported disk, enter a value from 1-255 that will be used on both sides. Otherwise, enter 0. Does this cluster contain a quorum disk [N]? Enter Y or N, depending on your configuration. If you enter Y, the procedure prompts for the name of the quorum disk. Enter the device name of the quorum disk. Table 2-2 Information Requested for Local Area and Mixed-Interconnect Configurations Item Response Will this node be a cluster member (Y /N)? Enter Y. What is the node's DECnet node name? Enter DECnet node name-for example, JUPITR. The DECnet node name may be from 1 to 6 alphanumeric characters in length and may not include dollar signs or underscores. What is the node· s DECnet node address? Enter DECnet node address-for example, 2.2 Will the Ethernet be used for cluster communications (Y /N)? Enter Y. The Ethernet is required for cluster (SCS internode) communications in local area and mixed-interconnect configurations. Enter this cluster's group number: Enter a number in the range from 1-4095 or 61440-65535. Enter this cluster's password: Enter the cluster password. The password must be from 1 to 31 alphanumeric characters in length and may include dollar signs and underscores. 2-5 Preparing the Cluster Operating Environment 2.2 Installing the VMS Operating System in the VAXcluster Environment Table 2-2 (Cont.) 2.3 Information Requested for Local Area and Mixed-Interconnect Configurations Item Response Re-enter this cluster's password for verification: Re-enter the password. Will JUPITR be a disk server (Y /N)? Enter Y. In local area and mixedinterconnect configurations, the system disk is always served to the cluster. Refer to Section 1.3.3 and Chapter 5 for information on served cluster disks. Will JUPITR serve HSC disks (Y /N)? Enter a response appropriate for your configuration. Enter a value for JUPITR' s ALLOCLASS parameter: If the system will serve HSC disks, enter the HSC's allocation class value. If the system is connected to a dual-ported disk, enter a value from 1-255 that will be used on both sides. Otherwise, enter 0. Does this cluster contain a quorum disk [NJ? Enter Y or N, depending on your configuration. If you enter Y, the procedure prompts for the name of the quorum disk. Enter the device name of the quorum disk. Configuring the DECnet-VAX Network After you have installed the operating system and required licenses, you configure, tailor, and start the DECnet-VAX network. This process typically entails several operations: • Executing the SYS$MANAGER:NETCONFIG.COM command procedure. • Making remote node data available clusterwide. • Optionally defining an alias node identifier for the cluster. You establish an alias using NCP commands like those shown in step 4 for alias SOLAR. (For more information on alias node identifiers, refer to the VMS Networking Manual.) Note that if you plan to define an alias node identifier, you must specify that one cluster node operate as a router node when you execute NETCONFIG.COM. Note further that you must later enable alias operations for other cluster nodes, as described in Section 2.3.2. • Starting the network. To perform these operations, proceed as follows: 1 Log in as system manager. 2 Execute the command procedure NETCONFIG.COM, entering information about your node when prompted, and responding YES when the procedure asks whether you want to configure the network ("want these commands to be executed"). Note: When the procedure asks whether you want the network started, answer NO if you first want to define a cluster alias. 2-6 Preparing the Cluster Operating Environment 2.3 Configuring the DECnet-VAX Network Example 2-1 shows typical responses for a cluster network configuration session using NETCONFIG.COM. Example 2-1 Sample Interactive Network Configuration Session $ ©NETCONFIG.COM DECnet-VAX network configuration procedure This procedure will help you define the parameters needed to get DECnet running on this machine. You will be shown the changes before they are executed, in case you want to perform them manually. What do you want your DECnet node name to be? [JUPITR] : ~ What do you want your DECnet address to be? [2.2]: ~ Do you want to operate as a router? [NO (nonrouting)]: YES Do you want a default DECnet account? [YES]: ~ Here are the commands necessary to set up your system. Do you want these commands to be executed? [YES]: ~ The changes have been made. If you have not already registered the DECnet-VAX key, then do so now. After the key has been registered, you should invoke the procedure SYS$MANAGER:STARTNET.COM to start up DECnet-VAX with these changes. (If the key is already registered) Do you want DECnet started? [YES] NO $ 3 NETCONFIG.COM creates, in the SYS$SPECIFIC:[SYSEXE] directory, the permanent remote node database file NETNODE_REMOTE.DAT, in which remote node data is maintained. To make this data available clusterwide, you must rename the file to the SYS$COMMON:[SYSEXE] directory: $ RENAME SYS$SPECIFIC: [SYSEXE]NETNODE_REMOTE.DAT _$ SYS$COMMON: [SYSEXE]NETNODE_REMOTE.DAT 4 If you want to define an alias node identifier for the cluster, invoke the Network Control Program (NCP) Utility to do so. For example: $RUN SYS$SYSTEM:NCP NCP> DEFINE NODE 2.1 NAME SOLAR NCP> DEFINE EXECUTOR ALIAS NODE SOLAR NCP> EXIT $ The information you specify using these commands is entered in the DECnet-VAX permanent executor database and takes effect when you start the network. 5 Start the network: $ ©SYS$MANAGER:STARTNET.COM 2-7 Preparing the Cluster Operating Environment 2.3 Configuring the DECnet-VAX Network 6 To ensure that the network is started each time the system boots, add the following line to your site-specific startup command file (for example, SYS$MANAGER:SYSTARTUP_V5.COM): $ ©SYS$MANAGER:STARTNET.COM For more detailed information on DECnet-VAX configuration issues and procedures, refer to the VMS Networking Manual. 2.3.1 Copying Remote Node Databases Some sites with large networks maintain remote node data in a central database file. If this is the case at your site, and if you want to make the data available clusterwide, you can, after starting the network, copy remote node database entries from that central file. For example, if the file resides on node SATURN, you could enter the following NCP commands to copy entries from the permanent database on SATURN to the permanent database on your system disk, and then to update your volatile database: NCP> COPY KNOWN NODES FROM SATURN USING PERMANENT TO PERMANENT NCP> SET KNOWN NODES ALL Note that only node names and addresses are copied. See the VMS Networking Manual for more information on copying node databases. 2.3.2 Enabling Cluster Alias Operations If you have defined an alias node identifier for your cluster as described in Section 2.3, you can enable alias operations for other cluster nodes after the nodes have joined the cluster. To enable such operations (that is, to allow a node to accept incoming connect requests directed toward the cluster alias node identifier), follow these steps: 1 Log in as system manager and invoke the SYSMAN Utility: $ RUN SYS$SYSTEM:SYSMAN 2 At the SYSMAN> prompt, enter the following commands: SYSMAN> SET ENVIRONMENT/CLUSTER %SYSMAN-I-ENV, current command environment: Clusterwide on local cluster Username LAZRUS wiJl be used on nonlocal nodes SYSMAN> SET PROFILE/PRIVILEGES=(OPER,SYSPRV) SYSMAN> DO MCR NCP SET EXECUTOR STATE OFF %SYSMAN-I-OUTPUT, command execution on node X... SYSMAN> DO MCR NCP DEFINE EXECUTOR ALIAS INCOMING ENABLED %SYSMAN-I-OUTPUT, command execution on node X... SYSMAN> DO ©SYS$MANAGER:STARTNET.COM %SYSMAN-I-OUTPUT, command execution on node X... 2-8 Preparing the Cluster Operating Environment 2.4 Coordinating Cluster Command Procedures 2.4 Coordinating Cluster Command Procedures You must coordinate your site-specific startup command procedures according to the type of cluster operating environment you want to prepare. For a common-environment cluster, these procedures should perform the same system startup and login functions for each cluster node. For a multipleenvironment cluster, you may want some startup commands to remain specific to certain nodes. Once you have created the common site-specific startup command procedures (for example SYSTARTUP_V5.COM and SYLOGIN.COM), you can set up each of them as a common file on a cluster-accessible disk or as separate duplicate files. Using either approach, you can include a command in the nodespecific startup file that will invoke the common startup procedure. In a common-environment cluster, the node-specific startup file for each node invokes a common startup procedure, named for example, SYSTARTUP_ COMMON.COM. Thus, each startup procedure on each node would include a command similar to the following: $©device: [SYSMGR]SYSTARTUP_COMMON.COM Certain startup functions, even in a common-environment cluster, are node specific. Therefore, you should include commands in the node-specific startup procedure on each node to do the following: • Set up dual-ported and local disks • Load device drivers • Set up terminals • Invoke the common startup command procedure If the common startup procedure is on a local disk, the node-specific procedure must set up the local disk as a cluster-accessible disk before invoking the common procedure. If the procedure is not on the system disk, the disk on which it resides must be mounted before the procedure can be invoked. Alternatively, you could set up duplicate copies of the common procedure on a separate volume on each cluster node. To set up a common SYLOGIN procedure, define the logical name SYS$SYLOGIN on each cluster node to be the full file specification of the procedure. If the common SYLOGIN file is on a cluster-accessible disk, you can include the command that defines SYS$SYLOGIN in the common startup procedure. If the cluster nodes use separate duplicate copies of SYLOGIN, you should include the definition in the node-specific startup procedure for each node. For example, the following command defines SYS$SYLOGIN to be the common file [SYSMGR]SYLOGIN on the cluster-accessible disk WORKS: $ DEFINE/SYSTEM/EXEC SYS$SYLOGIN WORK5: [SYSMGR]SYLOGIN Sections 2.4.1 and 2.4.2 present guidelines for using common and nodespecific command procedures to build a cluster environment. 2-9 Preparing the Cluster Operating Environment 2.4 Coordinating Cluster Command Procedures 2.4.1 Building Common Command Procedures The first step in preparing a common-environment cluster is to build cluster common startup and login command procedures. In a common-environment cluster, each cluster node executes the common procedures at startup time to set up the same operating environment on each cluster node. Because each node is set up using the common procedures, users can work in the same operating environment no matter which member node they are logged into. To build these procedures for a cluster in which existing nodes are to be merged, you should compare both the node-specific SYSTARTUP and SYLOGIN command procedures on each node and make any adjustments required. For example, you can compare the procedures from each node and include commands that define the same logical names in the common startup command procedure. An easy method of comparing the existing procedures and creating common versions is to log into each cluster node (in the single-system environment) and print the existing SYSTARTUP and SYLOGIN command procedure files. You can then use the file listings to compare the procedures. After you have chosen which commands to make common, you can build the common procedures on one of the cluster nodes. The strategy for clusters being formed from newly installed VMS systems is basically the same as that used for clusters that are to include previously installed systems-that is, include common elements in a common command procedure file. With newly installed systems, however, the SYSTARTUP and SYLOGIN command procedure files are empty. You must therefore build the common procedures from scratch. For example, you could build a common startup command procedure named SYSTARTUP_COMMON.COM, and include the commands that you want to be common to all nodes. You must decide which of the following elements you want to include in the common procedure: • Commands that install images. • Commands that define logical names; for example, the logical name that refers to the location of SYLOGIN.COM. • Commands that set up queues. (See Chapter 4 for information on setting up cluster queues.) • Commands that set up and mount physically accessible mass storage devices. (See Chapter 5 for information on setting up cluster disks.) • Commands that perform any other common site-specific startup functions. See the Guide to Setting Up a VMS System for more information on startup command procedures. In a common startup command procedure, the execution of commands that set up queues and mount cluster-accessible devices is node dependent. Therefore, you must include conditional DCL commands to control how these commands are executed. You can include commands that set up queues and mount clusteraccessible devices as part of the common startup procedure or as separate command procedures, such as STARTQ_COMMON.COM or MOUNT_ COMMON.COM that are invoked by the common procedure. Sample procedures for setting up queues and mounting cluster-accessible volumes are described in Chapter 4 and Chapter 5, respectively. 2-10 Preparing the Cluster Operating Environment 2.4 Coordinating Cluster Command Procedures Note: The job-controller queue file, JBCSYSQUE.DAT, must be set up as a common file on a cluster-accessible disk, accessible to all the nodes sharing queues. If you intend to set up common procedures such as SYSTARTUP_COMMON.COM or STARTQ_COMMON.COM as common files on a cluster-accessible disk volume, it is a good idea to locate these files on the same cluster-accessible volume containing JBCSYSQUE.DAT. To build a common SYLOGIN.COM command procedure, include in a common SYLOGIN command file commands that define symbols or that perform other site-specific functions. 2.4.2 Using Node-Specific System Command Procedures In a multiple-environment cluster, include elements that you want to remain unique to a node, such as commands to define node-specific logical names, in the node-specific versions of the SYSTARTUP and SYLOGIN files for that node. These files must be placed in the SYS$SPECIFIC root on each node. For example, consider a three-node cluster consisting of nodes JUPITR, SATURN, and URANUS. The time-sharing environments on nodes JUPITR and SATURN are the same. URANUS is set up for specific turn key accounts. In this case, you could create common SYSTARTUP and SYLOGIN command procedures for nodes JUPITR and SATURN that set up identical environments on these nodes. The command procedures for node URANUS, however, would be different, set up specifically for URANUS's turn key environment. 2.5 Coordinating System Files to Define the Cluster User Environment To prepare the cluster user environment, you must coordinate the following system files: • SYSUAF.DAT • NETPROXY.DAT • RIGHTSLIST.DAT • VMSMAIL_PROFILE.DATA • JBCSYSQUE.DAT • NETNODE_REMOTE.DAT 1 These files, which are part of the VMS operating system, contain information that controls such functions as user logins, proxy login access, mail, and access to files and job queues. By coordinating these files, you can define either a common-environment or a multiple-environment cluster. a To define a common-environment cluster, you use common version of each system file and place the files in the SYS$COMMON:[SYSEXE] directory on a common system disk. Note: If you want to set up a common-environment cluster with more than one common system disk (for example, in local area or mixed-interconnect 1 Depending on the network environment you have set up at your site, you may need to coordinate other network files. For detailed information on coordinating network files in the VAXcluster environment, see the VMS Networking Manual. 2-11 Preparing the Cluster Operating Environment 2.5 Coordinating System Files to Define the Cluster User Environment configurations), you must coordinate files on each disk and ensure that the disks are mounted with each cluster reboot. Refer to Section 2.5.4 for instructions. To define a multiple-environment cluster, you use node-specific versions of one or more system files. For example, if you want to allow only a certain group of users to log in to node URANUS, you would create a node-specific version of SYSUAF.DAT and place that file in URANUS's SYS$SPECIFIC:[SYSEXE] directory. That directory may be located in URANUS's root on a common system disk ([SYSB.SYSEXE] on JUPITR for instance), or on an individual system disk that you have set up on URANUS. Sections 2.5.1 through 2.5.3 describe the procedures for building a common version of system files. For information on individual system files, refer to the Guide to Setting Up a VMS System. 2.5.1 Coordinating User Accounts In a common-environment cluster, you must coordinate the user accounts from each node and build common versions of the following files: • SYSUAF.DAT • NETPROXY.DAT If you are setting up a common-environment cluster that consists of newly installed systems, you can follow instructions in the Guide to Setting Up a VMS System to build common SYSUAF.DAT and NETPROXY.DAT files. Because the SYSUAF.DAT file on new VMS systems is empty except for the four DIGITAL-supplied accounts, very little coordination is necessary. If, hc,wever, the cluster is to include one or more systems that have been running with node-specific SYSUAF.DAT and NETPROXY.DAT files, you must create common versions of the files. Procedures for building a common SYSUAF.DAT file from node-specific files are described in Appendix B. The procedure for creating a common NETPROXY.DAT file is basically the same as that for creating a common SYSUAF.DAT. The main difference is that less coordination is needed when merging the individual NETPROXY.DAT files. For example, UICs are not used in the NETPROXY records, and therefore need not be coordinated. You should decide which existing proxy login records you want to keep on the cluster and include these records in the common NETPROXY.DAT file. As with the SYSUAF.DAT files, you can use the Convert Utility to merge the NETPROXY.DAT file from each node to create a common file. Once you have created individual SYSUAF.DAT and NETPROXY.DAT files, you can set up each of them as either a common file on a cluster-accessible disk or as separate duplicate files. Note, however, that if you elect to use duplicate files, you must update all copies whenever you make changes. If your cluster is running from one common system disk, make sure that SYSUAF.DAT and NETPROXY.DAT are included in SYS$COMMON :[SYSEXE]. If your cluster is running from any other system disk configuration, you must decide where to locate SYSUAF.DAT and NETPROXY.DAT. Once you have placed these two files in a directory, you must define clusterwide logical names to point to them. 2-12 Preparing the Cluster Operating Environment 2.5 Coordinating System Files to Define the Cluster User Environment Assume that the disk WORKS: is a volume shared by all nodes in the cluster and that it contains cluster common SYSUAF.DAT and NETPROXY.DAT files. The following commands define system logical names that point to the location of the common files: $ DEFINE/SYSTEM/EXEC SYSUAF WORK5: [SYSEXE]SYSUAF $DEFINE/SYSTEM/EXEC NETPROXY WORK5: [SYSEXE]NETPROXY You must add the DEFINE commands to the common site-specific startup command file. After you have copied the files to the appropriate directory on the cluster-accessible disk volume, you should delete these files from the system disk. 2.5.2 Preparing the MAIL Database In a common-environment cluster, you may want to prepare a common mail database to allow users to use the Mail Utility (MAIL) to send and read their MAIL messages from any node in the cluster. Each time MAIL executes in a single-system environment, it accesses a database file named SYS$SYSTEM:VMSMAIL_pRQFILE.DATA. To set up VMSMAIL_PROFILE.DATA as a common file, define the logical name VMSMAIL _PROFILE to be the complete file specification of the common file by specifying the DEFINE command in the following format: $ DEFINE/SYSTEM/EXEC VMSMAIL_PROFILE file-spec You must make sure that you define the logical name before you invoke MAIL for the first time. When invoked for the first time, MAIL creates the database file, VMSMAIL_PROFILE.DATA, in SYS$SYSTEM by default. By defining VMSMAIL _PROFILE to be the location of a common file on a cluster-accessible disk, you cause MAIL to create and use that file. If your cluster is running from one common system disk, define VMSMAIL _ PROFILE to be SYS$COMMON :[SYSEXE]VMSMAIL _PROFILE and invoke the Mail Utility, by entering the following two commands: $DEFINE/SYSTEM/EXEC VMSMAIL_PROFILE SYS$COMMON:[SYSEXE]VMSMAIL_PROFILE $ MAIL VMSMAJL_pRQFILE.DATA will be created in the common system directory. You will no longer need to use the logical name, or make changes to the site-specific startup command file. If your cluster is running from any other system disk configuration, you must decide where to locate the common VMSMAJL_pROFILE.DATA file. (Typically, you would place this file in the same directory in which SYSUAF.DAT and NETPROXY.DAT reside-for example, WORKS:[SYSEXE].) You then define a logical name for the file and invoke the Mail Utility: $ DEFINE/SYSTEM/EXEC VMSMAIL_PROFILE WORK5: [SYSEXE]VMSMAIL_PROFILE $ MAIL The DEFINE command defines VMSMAIL_PROFILE.DATA to be a file located in [SYSEXE] on the cluster-accessible disk volume WORKS. The first time MAIL is invoked, VMSMAJL_pRQFILE.DATA is created in WORKS:[SYSEXE]. Subsequently, MAIL uses this file as the database. You must also add the DEFINE command to the common site-specific startup command file. 2-13 Preparing the Cluster Operating Environment 2.5 Coordinating System Files to Define the Cluster User Environment 2.5.3 Preparing the Rights Database In a common-environment cluster, you can create a common version of the rights database. The rights database is a file that associates users of the system or cluster with special names called identifiers. The rights database file, RIGHTSLIST.DAT, is the basis of the ACL-based protection scheme. For more information on ACLs, see the description in the Guide to VMS System Security. The cluster or security manager maintains the rights database, adding and removing identifiers as needs change. By allowing groups of users to hold identifiers, the manager has now created a different kind of group designation than the one used with the user's UIC. This alternative grouping allows the holders of the identifier to make more efficient use of resources. It also permits each user to be a member of multiple overlapping groups. For information on how the rights database is set up at the local node level, see the VMS Authorize Utility Manual. If your cluster is running from one common system disk, the installation or upgrade procedure will place the RIGHTSLIST.DAT file in SYS$COMMON:[SYSEXE]. No further action is required on your part. If your cluster is running from any other system disk configuration, copy SYS$SYSTEM:RIGHTSLIST.DAT to the directory in which you placed the SYSUAF, NETPROXY, and VMSMAIL_PROFILE system files. Then define a clusterwide logical name for the RIGHTSLIST.DAT file. For example: $ DEFINE/SYSTEM/EXEC RIGHTSLIST WORK5: [SYSEXE]RIGHTSLIST You must also add this DEFINE command to the common site-specific startup command file. 2.5.4 Coordinating Shared System Files in Clusters with Multiple Common System Disks To prepare a common user environment for any cluster configuration that includes more than one common system disk, you must coordinate the system files listed in Section 2.5. In local area and mixed-interconnect clusters, you must also coordinate the file SYS$MANAGER:NETNODE_UPDATE.COM. Proceed as follows: 1 Edit the file [VMS$COMMON.SYSMGR]SYLOGICALS.COM on each system disk and define logical names that specify the location of the cluster common files. For example, if the files are to be located on $1$DJA16, you could define logical names like the following: $ DEFINE/SYSTEM/EXEC SYSUAF $1$DJA16: [VMS$COMMON.SYSEXE]SYSUAF.DAT $ DEFINE/SYSTEM/EXEC NETPROXY $1$DJA16: [VMS$COMMON.SYSEXE]NETPROXY.DAT $ DEFINE/SYSTEM/EXEC RIGHTSLIST $1$DJA16: [VMS$COMMON.SYSEXE]RIGHTSLIST.DAT $ DEFINE/SYSTEM/EXEC VMSMAIL_PROFILE $1$DJA16: [VMS$COMMON.SYSEXE]VMSMAIL_PROFILE.DATA $ DEFINE/SYSTEM/EXEC NETNODE_REMOTE $1$DJA16: [VMS$COMMON.SYSEXE]NETNODE_REMOTE.DAT $ DEFINE/SYSTEM/EXEC NETNODE_UPDATE $1$DJA16: [VMS$COMMON.SYSMGR]NETNODE_UPDATE.COM 2-14 Preparing the Cluster Operating Environment 2. 5 Coordinating System Files to Define the Cluster User Environment 2 To ensure that the system disks are correctly mounted with each reboot, follow these steps: a. Copy the file SYS$EXAMPLES:CLU_MOUNT_DISK.COM to the directory [VMS$COMMON.SYSMGR]. b. Edit SYLOGICALS.COM and include commands to mount the system disks with appropriate volume labels. For example, if the system disks are $1$DJA16 and $1$DJA17, you would include commands like these: $ ©SYS$SYSDEVICE: [VMS$COMMON.SYSMGR]CLU_MOUNT_DISK.COM $1$DJA16: volume-label $ ©SYS$SYSDEVICE: [VMS$COMMON.SYSMGR]CLU_MOUNT_DISK.COM $1$DJA17: volume-label 3 In the site-specific file used for queue setup, specify the location of the job controller queue file (JBCSYSQUE.DAT), using a command like the following: $START/QUEUE/MANAGER $1$DJA16: [VMS$COMMON.SYSEXE]JBCSYSQUE.DAT When you execute CLUSTER_CONFIG.COM to add nodes to a cluster with more than one common system disk, a different device name must be used for each system disk on which nodes are added. For this reason, CLUSTER_ CONFIG.COM supplies as a default device name the logical volume name (for example, DISK$MARS_SYS1) of SYS$SYSDEVICE: on the local system. Different device names ensure that each node added will have a unique root directory specification, even if the system disks contain roots with the same name-for example, DISK$MARS_SYS1:[SYS10] and DISK$MARS_ SYS2:[SYS10]. 2-15 3 Building and Maintaining the Cluster After you have prepared the cluster operating environment as described in Chapter 2, you are ready to set up your site-specific configuration. This chapter provides information to help you build and maintain your cluster. Topics include the following: • Planning configuration procedures • Configuring the cluster • Reconfiguring the cluster after a major change • Maintaining the cluster Before you attempt to mnfigure your cluster, be sure you understand the discussions in Chapters 1 and 2. 3.1 Planning Configuration Procedures The planning needed to configure a cluster depends on several factors: • The configuration type (CI-only, local area, or mixed interconnect) • The components to be included in the cluster • The configuration function you want to execute Because you must execute the command procedure SYS$MANAGER:CLUSTER_CONFIG.COM to perform all basic configuration functions, it is important that you understand the operations that the procedure can perform. These are described in Section 3 .1.1. If you intend to set up a local area or mixed-interconnect cluster, you must, before executing CLUSTER_CONFIG.COM, do the following: • Determine locations and sizes for satellite page and swap files • Select cluster boot servers • Specify allocation classes for cluster nodes and disks (also applicable for CI-only configurations) Guidelines are provided in Sections 3.1.2, 3.1.3, and 3.1.4. Note that some configuration functions, such as adding or removing a voting cluster node, require one or more additional operations. Refer to Section 3.3 for instructions. 3-1 Building and Maintaining the Cluster 3.1 Planning Configuration Procedures 3.1.1 CLUSTER_CQNFIG.COM Functions When you invoke CLUSTER_CONFIG.COM, the procedure displays a menu of configuration options. By selecting the appropriate option, you can configure the cluster easily and reliably, without invoking VMS utilities directly. You use CLUSTER_CONFIG.COM to perform these functions: • Add a node to the cluster. • Remove a node from the cluster. • Change a cluster node's characteristics. • Create a duplicate system disk. Following is a summary of the operations that CLUSTER_CONFIG.COM performs for each configuration option: ADD Establish the new node's root directory on a cluster common system disk and generate the node's system parameter files (VAXVMSSYS.PAR and MODPARAMS.DAT) in its SYS$SPECIFIC:[SYSEXE] directory. Update the permanent and volatile remote node network databases for the system on which CLUSTER_CQNFIG.COM is executed (local system) to add the new node. If the new node is a satellite, update SYS$MANAGER:NETNODE_UPDA TE.COM on the local system. Generate the new node's page and swap files (PAGEFILE.SYS and SWAPFILE.SYS). Optionally set up a cluster quorum disk. Set allocation class (ALLOCLASS) value for the new node, if the node is being added as a disk server. Generate an initial (temporary) startup procedure for the new node. This initial procedure runs NETCONFIG.COM to configure the network, runs AUTOGEN to set appropriate SYSGEN parameter values for the node, and reboots the node with normal startup procedures. REMOVE Delete another node's root directory and its contents from the local system's system disk. If the node being removed is a satellite, update SYS$MANAGER:NETNODE_UPDA TE.COM on the local system. Update the permanent and volatile remote node network databases on the local system. 3-2 CHANGE Enable or disable the local system as a disk server; enable or disable the local system as a boot server; enable or disable the Ethernet for cluster communications on the local system; enable or disable a quorum disk on the local system; change the local system's ALLOCLASS value; change a satellite's Ethernet hardware address. Procedure displays CHANGE menu and prompts for appropriate information. CREATE Duplicate the local system's system disk and remove all system roots from the new disk. Building and Maintaining the Cluster 3.1 Planning Configuration Procedures 3.1.2 Determining Locations and Sizes for Satellite Page and Swap Files When you add a node to the cluster, CLUSTER_CONFIG.COM prompts for the sizes and location of the node's page and swap files. (The default sizes supplied by the procedure are minimums.) Depending on the configuration of your system disk and your network, you may realize a performance improvement in local area and mixed-interconnect configurations by locating page and swap files for satellites on a satellite's local RD series disk, if such a disk is available. To set up page and swap files on a satellite's local disk, CLUSTER_ CONFIG.COM creates (in the satellite's [SYSx.SYSEXE] directory on the boot server's system disk) the command procedure SATELLITE_pAGE.COM. This procedure executes when AUTOGEN reboots the satellite at the end of CLUSTER_CONFIG.COM, and it performs the following functions: • Mounts the satellite's local disk with a volume label in the format 'node'_ SCSSYSTEMID. • Installs the page and swap files on the local disk. If you want to alter the volume label, follow these steps after the satellite has been added to the cluster: 1 Enter a DCL command in the following format: $ SET VOLUME/LABEL=volume-label device-spec[:] Note that the SET VOLUME command requires write access (W) to the index file on the volume. If you are not the volume's owner, you must have either a system UIC or the SYSPRV privilege. 2 Update SATELLITE-PAGE.COM to reflect the new label. To relocate the satellite's page and swap files (for example, from the satellite's local disk to the boot server's system disk, or the reverse), or to change file sizes, the easiest way is to remove the satellite from the cluster and then add it again, using CLUSTER_CONFIG.COM. 3.1.3 Selecting Boot Servers for Mixed-Interconnect Clusters While every mixed-interconnect cluster must have at least one boot server, multiple servers offer the following advantages: • Higher availability-satellites can access served disks and boot, even if one of the boot servers is temporarily unavailable. • Better workload balancing-the task of serving HSC disks to satellites can place a significant load on a boot server. With multiple boot servers, this workload is distributed across more processors and Ethernet adapters. Use as boot servers the most powerful machines you have available. Processors with the power of a VAX 8530 or greater have sufficient CPU power to perform disk-serving functions without serious degradation in response time. Less powerful machines can become overloaded when serving many busy satellites, or when many satellites boot simultaneously. 3-3 Building and Maintaining the Cluster 3.1 Planning Configuration Procedures Note, however, that two or more lower-powered boot servers provide better performance than a single high-powered server. Multiple servers give better availability, and they distribute the workload across more Ethernet adapters. If, for example, you have 5 VAX processors available-a VAX 8800, a VAX 8350, two VAX-ll/785s, and a VAX-11/750-use all the machines as boot servers except the VAX-11/750. If you have several processors of roughly comparable power, it is reasonable to use them all as boot servers. This arrangement gives optimal load balancing. And if one machine fails or is shut down, others remain available to serve satellites. After CPU power, the second most important factor in selecting a boot server is the speed of its Ethernet adapter. Boot servers should be equipped with the highest-bandwith Ethernet adapters you have available for the machines. 3.1.4 Specifying Allocation Class Values in Mixed-Interconnect Clusters Before setting up any mixed-interconnect cluster, you must determine allocation class values for the boot server(s) and HSCs. It is easiest to use the same value for all HSCs and all boot servers-you can arbitrarily choose a number between 1 and 255. Note, however, that to change the allocation class value on any CI-connected VAX processor or HSC, you must shut down and reboot the entire cluster. (See Section 3.3.) Every device allocation class name (name of the form $1$ddcu) must be unique across all boot servers and HSCs. For RA series disks, make sure that all the removable unit plugs on all disks of that allocation class are unique. As long as you have no more than 256 such disks, this is easy to accomplish. Assume, for instance, that 10 disks are dual pathed between the HSCs VOYGRl and VOYGR2, and 10 others are dual pathed between the HSCs VIKNGl and VIKNG2. Provided that all 20 disks have unique unit numbers, you can assign all four HSCs the same allocation class value. If you have more than 256 HSC-connected disks, you must, to ensure unique disk names, use two or more allocation classes for the HSCs. You must also configure one or more nodes to serve HSC disks and assign allocation class values accordingly. To perform those operations, you can execute the CLUSTER_CONFIG.COM CHANGE function, described in Section 3.2.3. Additionally, you must make sure that all locally connected disks have unique allocation class names. Consider the following example: if nodes SATURN and URANUS each have one BDA disk controller with a single-pathed RA81 disk connected to it, and if both controllers have an allocation class value of 1, the RA81 connected to SATURN with unit plug 0 will receive the device name $1$DUAO. Likewise, the RA81 connected to URANUS with unit plug 0 will be $1$DUAO. Because both disks have the same name, they appear to VMS software to be the same disk, and confusion or even corruption could result. You can avoid this potential problem by switching one disk's unit plug. Note that because fewer unit numbers are available for MASSBUS or UNIBUS disks, fewer unique disk names are possible. To ensure that disk names remain unique in your cluster, you may have to relocate such disks or disqualify a node as a disk server. 3-4 Building and Maintaining the Cluster 3.2 Configuring the Cluster 3.2 Configuring the Cluster To perform configuration functions, you execute CLUSTER_CONFIG.COM. Before invoking the procedure, be sure to verify the following: • You are logged in to the system manager's account on an appropriate node. If you are building a new local area or mixed-interconnect cluster, you must be logged in on a node that you want to set up as a boot server. If you are adding a satellite node, you must be logged in on a boot server. Note that the process privileges SYSPRV, OPER, CMKRNL, BYPASS, and NETMBX are required, because the procedure performs sensitive system operations. • The DECnet-VAX network is up and running. • You have at hand the data listed in Table 3-1. Note that some items are configuration specific. • If your configuration has two or more system disks, you have coordinated cluster common files, as described in Section 2.5.4. Sections 3.2.1 through 3.2.6 provide examples of typical interactive CLUSTER_CQNFIG.COM sessions. Section 3.3 describes tasks you must perform after executing CLUSTER_CONFIG.COM to make major configuration changes. Caution: You may not initiate concurrent CLUSTER_CQNFIG.COM sessions. Table 3-1 Data Requested by CLUSTER_CONFIG.COM Item How To Specify Or Obtain Device name of cluster system disk on which root directories will be created. System manager specifies. Default is logical volume name of SYS$SYSDEVICE: (for example, DISK$VAXVMSRL5:). Node's root directory name on cluster system disk. System manager specifies. Name must be of the form SYSx. For Cl-connected nodes, x is a hexadecimal digit in the range 1 through 9 or A through D (for example, SYS1 or SYSA). For satellites, x must be in the range from 10 through FFFF. Procedure supplies valid default. Node's DECnet node name. Network manager supplies. Name must be from 1 to 6 alphanumeric characters and may not include dollar signs or underscores. Node's DECnet node address. Network manager supplies. Cluster group number and password if CHANGE is run to enable cluster communications over the Ethernet. System manager specifies. 3-5 Building and Maintaining the Cluster 3.2 Configuring the Cluster Table 3-1 (Cont.) Data Requested by CLUSTER_CONFIG.COM Item How To Specify Or Obtain If node is a satellite, satellite's Ethernet hardware address. Address has the form xx-xx-xx-xx-xxxx. Note that you must include the dashes when you specify a hardware address. When DECnet-VAX network is running on boot server, proceed as follows: • For MicroVAX II and VAXstation II satellites, enter the following commands at satellite's console: »> B/100 XQ Bootfile: READ_ADDR • For MicroVAX 2000 and V AXstation 2000 satellites, enter the following commands at successive console-mode prompts: 1 »> T 53 2 ?>» 3 »> B/100 ES Bootfile: READ_ADDR • For MicroVAX 3xxx series satellites, enter the following command at satellite's console: >>> • SHOW ETHERNET For V AXstation 8000 satellites, enter commands as shown in the following example, and then construct the Ethernet hardware address from the values displayed by the system. >>> E/P/1 20000218 >>> E/P/1 2000C21C OOOOBC9A 87654321 In this example, the address is 21-43-65-87-SA-BC. Workstation windowing system. System manager specifies. Workstation software must be installed before workstation satellites are added. If it is not, the procedure indicates that fact. Location and sizes of page and swap files. System manager specifies. Value for local system's allocation class (ALLOCLASS) parameter. System manager specifies. Device name of quorum disk. System manager specifies. 1 If the 3.2.1 second prompt appears as 3 ?> > > , press RETURN. Adding a Node to the Cluster Once you have made the necessary preparations, you can execute CLUSTER_ CONFIG.COM to add a new node to the cluster. • 3-6 If you are setting up a CI-only cluster, invoke CLUSTER_CONFIG.COM on an active cluster system and select the ADD function. Building and Maintaining the Cluster 3.2 Configuring the Cluster • If you are setting up a new local area or mixed-interconnect cluster, follow these steps: 1 Invoke CLUSTER_CONFIG.COM and execute the CHANGE function described in Section 3.2.3 to enable the local system as a boot server. 2 After the CHANGE function completes, execute the ADD function to add either CI-connected nodes or satellites to the cluster. To add satellites, you must be logged in on a cluster boot server. While adding nodes, you may want to disable broadcast messages to your terminal-the ADD function generates many such messages. To disable the messages, you can enter the DCL command REPLY /DISABLE=(NETWORK, CLUSTER). Whenever you add a voting (non-satellite) member to the cluster, you must, after the ADD function completes, reconfigure the cluster, following instructions in Section 3.3. In addition, if you add a CI-connected node that boots from a cluster common disk, you must create a new default bootstrap command procedure for the node before booting it into the cluster. For instructions, refer to your processor-specific installation and operations guide. Examples 3-1 and 3-2 illustrate the use of CLUSTER_CONFIG.COM on node JUPITR to add, respectively, CI-connected node SATURN and satellite node EUROP A to the cluster. Caution: If either the local system or the new node should fail before the ADD function completes, you must, after normal conditions are restored, perform the REMOVE function to erase any invalid data, and then restart the ADD function. Example 3-1 Sample Interactive CLUSTER_CONFIG.COM Session to Add a Cl-Connected Node as a Boot Server $ ©CLUSTER_CONFIG.COM Cluster Configuration Procedure Use CLUSTER_CONFIG.COM to set up or change a VAXcluster configuration. To ensure that you have the required privileges, invoke this procedure from the system manager's account. Enter ? for help at any prompt. 1. ADD a node to the cluster. 2. REMOVE a node from the cluster. 3. CHANGE a cluster node's characteristics. 4. CREATE a second system disk for JUPITR. Enter choice [1] : ~ The ADD function adds a new node to the cluster. If the node being added is a voting member, EXPECTED_VOTES in all other cluster members' MODPARAMS.DAT must be adjusted, and the cluster must be rebooted. If the new node is a satellite, the network databases on JUPITR are updated. The network databases on all other cluster members must be updated. Example 3-1 Cont'd. on next page 3-7 Building and Maintaining the Cluster 3.2 Configuring the Cluster Example 3-1 (Cont.) Sample Interactive CLUSTER_CQNFIG.COM Session to Add a ClConnected Node as a Boot Server For instructions, see the VMS VAXcluster Manual. What is the node's DECnet node name? SATURN What is the node's DECnet address? 2.3 Will SATURN be a satellite [Y]? N Will SATURN be a boot server [Y]? ~ This procedure will now ask you for the device name of SATURN's system root. The default device name (DISK$VAXVMSRL5:) is the logical volume name of SYS$SYSDEVICE: . What is the device name for SATURN'S system root [DISK$VAXVMSRL5:]? What is the name of the new system root [SYSA]? ~ Creating directory tree SYSA ... %CREATE-I-CREATED, $1$DJA11:<SYSA> created %CREATE-I-CREATED, $1$DJA11:<SYSA.SYSEXE> created ~ System root SYSA created. Enter a value for SATURN's ALLOCLASS parameter: 1 Does this cluster contain a quorum disk [N]? Y What is the device name of the quorum disk? $1$DJA12 Updating network database ... Size of page file for SATURN [10000 blocks]? 50000 Size of swap file for SATURN [8000 blocks]? 20000 Will a local (non-HSC) disk on SATURN be used for paging and swapping? N If you specify a device other than DISK$VAXVMSRL5: for SATURN's page and swap files, this procedure will create PAGEFILE_SATURN.SYS and SWAPFILE_SATURN.SYS in the <SYSEXE> directory on the device you specify. What is the device name for the page and swap files [DISK$VAXVMSRL5:]? %SYSGEN-I-CREATED, $1$DJA11:<SYSA.SYSEXE>PAGEFILE.SYS;1 created %SYSGEN-I-CREATED, $1$DJA11:<SYSA.SYSEXE>SWAPFILE.SYS;1 created The configuration procedure has completed successfully. SATURN has been configured to join the cluster. Before booting SATURN, you must create a new default bootstrap command procedure for SATURN. See your processor-specific installation and operations guide for instructions. The first time SATURN boots, NETCONFIG.COM and AUTOGEN.COM will run automatically. The following parameters have been set for SATURN: VOTES = 1 EXPECTED_VOTES = 2 QDSKVOTES = 1 After SATURN has booted into the cluster, you must increment the value for EXPECTED_VOTES in every cluster member's MODPARAMS.DAT. You must then reconfigure the cluster, using the procedure described in the VMS VAXcluster Manual. 3-8 ~ Building and Maintaining the Cluster 3.2 Configuring the Cluster Example 3-2 Sample Interactive CLUSTER_CONFIG.COM Session to Add a Satellite Node with Local Page and Swap Files $ ©CLUSTER_CONFIG.COM Cluster Configuration Procedure Use CLUSTER_CONFIG.COM to set up or change a VAXcluster configuration. To ensure that you have the required privileges, invoke this procedure from the system manager's account. Enter ? for help at any prompt. 1. ADD a node to the cluster. 2. REMOVE a node from the cluster. 3. CHANGE a cluster node's characteristics. 4. CREATE a second system disk for JUPITR. Enter choice [1]: ~ The ADD function adds a new node to the cluster. If the node being added is a voting member, EXPECTED_VOTES in all other cluster members' MODPARAMS.DAT must be adjusted, and the cluster must be rebooted. If the new node is a satellite, the network databases on JUPITR are updated. The network databases on all other cluster members must be updated. For instructions, see the VMS VAXcluster Manual. What is the node's DECnet node name? EUROPA What is the node's DECnet address? 2.21 Will EUROPA be a satellite [Y]? ~ Verifying circuits in network database ... This procedure will now ask you for the device name of EUROPA's system root. The default device name (DISK$VAXVMSRL5:) is the logical volume name of SYS$SYSDEVICE: . What is the device name for EUROPA'S system root [DISK$VAXVMSRL5:]? What is the name of the new system root [SYS10]? ~ Allow conversational bootstraps on EUROPA [NO]? ~ The following workstation windowing options are available: ~ 1. No workstation software 2. VWS Workstation Software Enter choice [1] : 2 Example 3-2 Cont'd. on next page 3-9 Building and Maintaining the Cluster 3.2 Configuring the ·cluster Example 3-2 (Cont.) Sample Interactive CLUSTER_CQNFIG.COM Session to Add a Satellite Node with Local Page and Swap Files Creating directory tree SYS10 ... %CREATE-I-CREATED, $1$DJA11:<SYS10> created %CREATE-I-CREATED, $1$DJA11:<SYS10.SYSEXE> created System root SYS10 created. Will EUROPA be a disk server [NJ? ~ What is EUROPA's Ethernet hardware address? 08-00-2B-03-51-75 Updating network database ... Size of pagefile for EUROPA [10000 blocks]? 20000 Size of swap file for EUROPA [8000 blocks]? 12000 Will a local disk on EUROPA be used for paging and swapping? YES Creating temporary page file in order to boot EUROPA for the first time ... %SYSGEN-I-CREATED, $1$DJA11:<SYS10.SYSEXE>PAGEFILE.SYS;1 created This procedure will now wait until EUROPA joins the cluster. Once EUROPA joins the cluster, this procedure will ask you to specify a local disk on EUROPA for paging and swapping. Please boot EUROPA now. Waiting for EUROPA to boot ... (User enters boot command at satellite's console-mode prompt (>>>). For MicroVAX II, VAXstation II, and MicroVAX 3xxx series satellites, user enters B XQ. For MicroVAX 2000 and VAXstation 2000 satellites, user enters B ES. For VAXstation 8000 satellites, user enters B ET60) The local disks on EUROPA are: Device Name EUROPA$DUAO: EUROPA$DUA1: Device Status Online Online Error Count 0 0 Volume Label Free Blocks Which disk can be used for paging and swapping? EUROPA$DUAO: May this procedure INITIALIZE EUROPA$DUAO: [YES]? NO Mounting EUROPA$DUAO: ... PAGEFILE.SYS already exists on EUROPA$DUAO: *************************************** Directory EUROPA$DUAO: [SYSO.SYSEXE] PAGEFILE.SYS;1 23600/23600 Total of 1 file, 23600/23600 blocks. *************************************** Example 3-2 Cont'd. on next page 3-10 Trans Count Mnt Cnt Building and Maintaining the Cluster 3.2 Configuring the Cluster Example 3-2 (Cont.) Sample Interactive CLUSTER_CQNFIG.COM Session to Add a Satellite Node with Local Page and Swap Files What is the file specification for the page file on EUROPA$DUAO: [ <SYSO.SYSEXE>PAGEFILE.SYS ]? ~ %CREATE-I-EXISTS, EUROPA$DUAO:<SYSO.SYSEXE> already exists This procedure will use the existing pagefile, EUROPA$DUAO:<SYSO.SYSEXE>PAGEFILE.SYS;. SWAPFILE.SYS already exists on EUROPA$DUAO: *************************************** Directory EUROPA$DUAO: [SYSO.SYSEXE] SWAPFILE.SYS;1 12000/12000 Total of 1 file, 12000/12000 blocks. *************************************** What is the file specification for the swap file on EUROPA$DUAO: [ <SYSO.SYSEXE>SWAPFILE.SYS ]? ~ This procedure will use the existing swapfile, EUROPA$DUAO:<SYSO.SYSEXE>SWAPFILE.SYS;. AUTOGEN will now reconfigure and reboot EUROPA automatically. These operations will complete in a few minutes, and a completion message will be displayed at your terminal. The configuration procedure has completed successfully. 3.2.1.1 Updating Network Data after Adding a Satellite Whenever you add a satellite, CLUSTER_CONFIG.COM updates both the permanent and volatile remote node network databases on the boot server. However, the volatile databases on other cluster members are not automatically updated. To share the new data throughout the cluster, you must update the volatile databases on all other cluster members. Log in as system manager, invoke the SYSMAN Utility, and enter the following commands at the SYSMAN > prompt: SYSMAN> SET ENVIRONMENT/CLUSTER %SYSMAN-I-ENV, current command environment: Clusterwide on local cluster Username LAZRUS will be used on nonlocal nodes SYSMAN> SET PROFILE/PRIVILEGES=(OPER,SYSPRV) SYSMAN> DO MCR NCP SET KNOWN NODES ALL %SYSMAN-I-OUTPUT, command execution on node X... SYSMAN> EXIT $ 3-11 Building and Maintaining the Cluster 3.2 Configuring the Cluster 3.2.1.2 Restoring a Satellite's Network Data The first time you execute CLUSTER_CONFIG.COM to add a satellite, the procedure creates the file NETNODE_UPDATE.COM in the boot server's SYS$SPECIFIC:[SYSMGR] directory. 1 This file, which is updated each time you add or remove a satellite, or change its Ethernet hardware address, contains all essential network configuration data for the satellite. If an unexpected condition at your site should cause configuration data to be lost, you can use NETNODE_UPDATE.COM to restore it. You can also read the file when you need to obtain data about individual satellites. Note that you may want to edit the file occasionally to remove obsolete entries. Example 3-3 shows the contents of the file after satellite nodes EUROPA and GANYMD have been added to the cluster. Example 3-3 Sample NETNODE_UPDATE.COM File $ run sys$system:ncp define node EUROPA address 2.21 define node EUROPA hardware address 08-00-2B-03-51-75 define node EUROPA load assist agent sys$share:niscs_laa.exe define node EUROPA load assist parameter $1$DJA11:<SYS10.> define node EUROPA tertiary loader sys$system:tertiary_vmb.exe define node GANYMD address 2.22 define node GANYMD hardware address 08-00-2B-03-58-14 define node GANYMD load assist agent sys$share:niscs_laa.exe define node GANYMD load assist parameter $1$DJA11:<SYS11.> define node GANYMD tertiary loader sys$system:tertiary_vmb.exe 3.2.1.3 Controlling Clusterwide Broadcast Messages on Satellites and Boot Servers When a satellite node joins the cluster, broadcasts for all message classes are initially enabled for the satellite by default. Users can disable such broadcasts selectively by including a form of the DCL command SET BROADCAST in their LOGIN.COM files. For example, the following command would disable OPCOM and SHUTDOWN messages: $ SET BROADCAST=(NOOPCOM, NOSHUTDOWN) Note that broadcasts to the operator console terminal (OPAO:) on satellite workstation nodes are disabled by default and should remain disabled at all times. Users who want to receive broadcast messages can create a terminal window, and then enter the DCL command REPLY/ENABLE. (This command requires OPER privilege.) For more detailed information on workstation operations, refer to the documentation supplied with the workstation software. In large clusters, state transitions (nodes joining or leaving the cluster) will generate many multi-line OPCOM messages on a boot server's console device. You can abbreviate such messages by including the DCL command REPLY /DISABLE=CLUSTER in the appropriate site-specific startup command file, or by entering the command interactively from the system manager's account. 1 For a common-environment cluster, you must rename this file to SYS$COMMON:[SYSMGR]:NETNODE_ UPDATE.COM, as described in Section 2.5.4. 3-12 Building and Maintaining the Cluster 3.2 Configuring the Cluster 3.2.2 Removing a Node from the Cluster Before you can remove a node from the cluster, you must shut down the node. If possible, use the command procedure SYS$SYSTEM:SHUTDOWN.COM to perform an orderly shutdown. Otherwise, halt the machine. Note that because the REMOVE function deletes the node's entire root directory tree, it generates VMS RMS error messages while deleting directory files. You can ignore these messages. Whenever you remove a voting member from the cluster, you must, after the REMOVE function completes, reconfigure the cluster, following instructions in Section 3.3. Example 3-4 illustrates the use of CLUSTER_CONFIG.COM on node JUPITR to remove satellite node EUROPA from the cluster. Note: If the page and swap files for the node being removed do not reside on the same disk as the node's root directory tree, the REMOVE function does not delete these files. It displays a message warning that the files will not be deleted, as in Example 3-4. If you want to delete the files, you must do so after the REMOVE function completes. Example 3-4 Sample Interactive CLUSTER_CQNFIG.COM Session to Remove a Satellite Node with Local Page and Swap Files $ ©CLUSTER_CONFIG.COM Cluster Configuration Procedure Use CLUSTER_CONFIG.COM to set up or change a VAXcluster configuration. To ensure that you have the required privileges, invoke this procedure from the system manager's account. Enter ? for help at any prompt. 1. ADD a node to the cluster. 2. REMOVE a node from the cluster. 3. CHANGE a cluster node's characteristics. 4. CREATE a second system disk for JUPITR. Enter choice [1] : 2 The REMOVE function disables a node as a cluster member. o It deletes the node's root directory tree. o It removes the node's network information from the network database. If the node being removed is a voting member, you must adjust EXPECTED_VOTES in each remaining cluster member's MODPARAMS.DAT. You must then reconfigure the cluster, using the procedure described in the VMS VAXcluster Manual. What is the node's DECnet node name? EUROPA Verifying network database ... Verifying that SYS10 is EUROPA's root ... WARNING - EUROPA's page and swap files will not be deleted. They do not reside on $1$DJA11:. Example 3-4 Cont'd. on next page 3-13 Building and Maintaining the Cluster 3.2 Configuring the Cluster Example 3-4 (Cont.) Sample Interactive CLUSTER_CONFIG.COM Session to Remove a Satellite Node with Local Page and Swap Files Deleting directory tree SYS10 ... %DELETE-I-FILDEL, $1$DJA11:<SYS10>SYSCBI.DIR;1 deleted (1 block) %DELETE-I-FILDEL, $1$DJA11:<SYS10>SYSERR.DIR;1 deleted (1 block) System root SYS10 deleted. Updating network database ... The configuration procedure has completed successfully. 3.2.3 Changing a Node's Characteristics You select the CHANGE function when you want to accomplish any of the operations described in Table 3-2. When you select this function, CLUSTER_CONFIG.COM displays a menu of CHANGE options. Note that all operations except changing a satellite's Ethernet hardware address must be executed on the system whose characteristics you want to change (local system). If you plan to set up a new local area or mixed-interconnect cluster, you must, before adding nodes, execute the CHANGE function to enable the first installed node as a boot server (see Example 3-7). Caution: Whenever you enable or disable disk serving funtions, you must run AUTOGEN with the REBOOT option to reboot the local system. For all other CHANGE operations (except changing a satellite's hardware address), you must reconfigure the cluster, following instructions in Section 3.3. Table 3-2 CLUSTER_CQNFIG.COM CHANGE Options Option Operation Performed Enable the local system as a disk server. Load the MSCP Server by setting, in MODPARAMS.DAT, the value of the MSCP_LOAD parameter to 1, and setting an appropriate value for the MSCP_SERVE _ALL parameter. Disable the local system as a disk server. Set MSCP_LOAD to 0. Enable the local system as a boot server. If you are setting up a local area or mixed-interconnect cluster, you must execute this operation once before you attempt to add nodes to the cluster. You thereby enable DECnet MOP service for the Ethernet adapter circuit that the node will use to service downline load requests from satellites. When you enable the node as a boot server, it automatically becomes a disk server (if it is not one already), because it must serve its system disk to satellites. Disable the local system as a boot server. Disable DECnet MOP service for the node's Ethernet adapter circuit. 3-14 Building and Maintaining the Cluster 3.2 Configuring the Cluster Table 3-2 (Cont.) CLUSTER_CQNFIG.COM CHANGE Options Option Operation Performed Enable the Ethernet for cluster communications on the local system. Load the V AXport driver PEDRIVER by setting the value of the NISCS_LOAD_PEAO parameter to 1 in MODPARAMS.DAT. Create the cluster security database file, SYS$SYSTEM: [SYSEXE]CLUSTER_ AUTHORIZE.DAT on the local system's system disk. Disable the Ethernet for cluster communications on the local system. Set NISCS_LOAD_PEAO to 0. Enable a quorum disk on the local system. Set, in MODPARAMS.DAT, an appropriate value for the SYSGEN parameter DISK_QUORUM; set the value of QDSKVOTES to 1 (default value). Disable a quorum disk on the local system. Set, in MODPARAMS.DAT, a blank value for the SYSGEN parameter DISK_QUORUM; set the value of QDSKVOTES to 1. Change the local system's allocation class value. Set a value for the node's ALLOCLASS parameter in MODPARAMS.DAT. Change a satellite's Ethernet hardware address. Change a satellite's hardware address, in the event that its Ethernet device should need replacement. Both the permanent and volatile network databases, and NETNODE_UPDATE.COM, are updated on the local system. You must execute this operation on any node enabled as a boot server for the satellite. Note: When CLUSTER_CONFIG.COM sets or changes values in MODP ARAMS.DA T, the new values are always appended at the end of the file, so that they override earlier values. You may want to edit the file occasionally and delete lines that specify earlier values. Examples 3-5 through 3-8 show the use of CLUSTER_CONFIG.COM to perform the following operations: • Enable node URANUS as a disk server • Change node URANUS' s ALLOCLASS value • Enable node URANUS as a boot server • Specify a new hardware address for satellite node ARIEL, which boots from URANUS's system disk. 3-15 Building and Maintaining the Cluster 3.2 Configuring the Cluster Example 3-5 Sample Interactive CLUSTER_CONFIG.COM Session to Enable the Local System as a Disk Server $ ©CLUSTER_CONFIG.COM Cluster Configuration Procedure Use CLUSTER_CONFIG.COM to set up or change a VAXcluster configuration. To ensure that you have the required privileges, invoke this procedure from the system manager's account. Enter ? for help at any prompt. 1. ADD a node to the cluster. 2. REMOVE a node from the cluster. 3. CHANGE a cluster node's characteristics. 4. CREATE a second system disk for URANUS. Enter choice [1] : 3 CHANGE Menu 1. Enable URANUS as a disk server. 2. Disable URANUS as a disk server. 3. Enable URANUS as a boot server. 4. Disable URANUS as a boot server. 5. Enable Ethernet for cluster communications on URANUS. 6. Disable Ethernet for cluster communications on URANUS. 7. Enable a quorum disk on URANUS. 8. Disable a quorum disk on URANUS. 9. Change URANUS's ALLOCLASS value. 10. Change a satellite's hardware address. Enter choice [1] : ffiETI Will URANUS serve HSC disks [Y]? ~ Enter a value for URANUS's ALLOCLASS parameter: 2 The configuration procedure has completed successfully. URANUS has been enabled as a disk server. MSCP_LOAD has been set to 1 in MODPARAMS.DAT. Please run AUTOGEN to reboot URANUS: $ ©SYS$UPDATE:AUTOGEN GETDATA REBOOT If you have changed URANUS's ALLOCLASS value, you must reconfigure the cluster, using the procedure described in the VMS VAXcluster Manual. 3-16 Building and Maintaining the Cluster 3.2 Configuring the Cluster Example 3-6 Sample Interactive CLUSTER_CONFIG.COM Session to Change the Local System's ALLOCLASS Value $ ©CLUSTER_CONFIG.COM Cluster Configuration Procedure Use CLUSTER_CONFIG.COM to set up or change a VAXcluster configuration. To ensure that you have the required privileges, invoke this procedure from the system manager's account. Enter ? for help at any prompt. 1. ADD a node to the cluster. 2. REMOVE a node from the cluster. 3. CHANGE a cluster node's characteristics. 4. CREATE a second system disk for URANUS. Enter choice [1] : 3 CHANGE Menu 1. Enable URANUS as a disk server. 2. Disable URANUS as a disk server. 3. Enable URANUS as a boot server. 4. Disable URANUS as a boot server. 5. Enable Ethernet for cluster communications on URANUS. 6. Disable Ethernet for cluster communications on URANUS. 7. Enable a quorum disk on URANUS. 8. Disable a quorum disk on URANUS. 9. Change URANUS's ALLOCLASS value. 10. Change a satellite's hardware address. Enter choice [1] : 9 Enter a value for URANUS's ALLOCLASS parameter [2]: 1 The configuration procedure has completed successfully If you have changed URANUS'S ALLOCLASS value, you must reconfigure the cluster, using the procedure described in the VMS VAXcluster Manual. Example 3-7 Sample Interactive CLUSTER_CONFIG.COM Session to Enable the Local System as a Boot Server $ ©CLUSTER_CONFIG.COM Cluster Configuration Procedure Use CLUSTER_CONFIG.COM to set up or change a VAXcluster configuration. To ensure that you have the required privileges, invoke this procedure from the system manager's account. Enter ? for help at any prompt. 1. ADD a node to the cluster. 2. REMOVE a node from the cluster. 3. CHANGE a cluster node's characteristics. 4. CREATE a second system disk for URANUS. Enter choice [1] : 3 Example 3-7 Cont'd. on next page 3-17 Building and Maintaining the Cluster 3.2 Configuring the Cluster Example 3-7 (Cont.) Sample Interactive CLUSTER_CQNFIG.COM Session to Enable the Local System as a Boot Server CHANGE Menu 1. Enable URANUS as a disk server. 2. Disable URANUS as a disk server. 3. Enable URANUS as a boot server. 4. Disable URANUS as a boot server. 5. Enable Ethernet for cluster communications on URANUS. 6. Disable Ethernet for cluster communications on URANUS. 7. Enable a quorum disk on URANUS. 8. Disable a quorum disk on URANUS. 9. Change URANUS's ALLOCLASS value. 10. Change a satellite's hardware address. Enter choice [1] : 3 Verifying circuits in network database ... Updating permanent network database ... In order to enable or disable DECnet MOP service in the volatile network database, DECnet traffic must be interrupted temporarily. Do you want to proceed [Y]? ~ Enter a value for URANUS's ALLOCLASS parameter [1]: ~ The configuration procedure has completed successfully. URANUS has been enabled as a boot server. Disk serving and Ethernet capabilities are enabled automatically. If URANUS was not previously set up as a disk server, please run AUTOGEN to reboot URANUS: $ ©SYS$UPDATE:AUTOGEN GETDATA REBOOT If you have changed URANUS'S ALLOCLASS value, you must reconfigure the cluster, using the procedure described in the VMS VAXcluster Manual. Example 3-8 Sample Interactive CLUSTER_CONFIG.COM Session to Change a Satellite's Hardware Address $ ©CLUSTER_CONFIG.COM Cluster Configuration Procedure Use CLUSTER_CONFIG.COM to set up or change a VAXcluster configuration. To ensure that you have the required privileges, invoke this procedure from the system manager's account. Enter ? for help at any prompt. 1. ADD a node to the cluster. 2. REMOVE a node from the cluster. 3. CHANGE a cluster node's characteristics. 4. CREATE a second system disk for URANUS. Enter choice [1] : 3 Example 3-8 Cont'd. on next page 3-18 Building and Maintaining the Cluster 3.2 Configuring the Cluster Example 3-8 (Cont.) Sample Interactive CLUSTER_CONFIG.COM Session to Change a Satellite's Hardware Address CHANGE Menu 1. Enable URANUS as a disk server. 2. Disable URANUS as a disk server. 3. Enable URANUS as a boot server. 4. Disable URANUS as a boot server. 5. Enable Ethernet for cluster communications on URANUS. 6. Disable Ethernet for cluster communications on URANUS. 7. Enable a quorum disk on URANUS. 8. Disable a quorum disk on URANUS. 9. Change URANUS's ALLOCLASS value. 10. Change a satellite's hardware address. Enter choice [1] : 10 What is the node's DECnet node name? ARIEL What is the new hardware address [08-00-2B-06-81-44]? 08-00-3B-05-37-78 Updating network database ... The configuration procedure has completed successfully. 3.2.4 Changing the Cluster Configuration Type As your processing needs change, you may want to add satellites to an existing CI-only cluster, or you may want to add CI-connected processors or HSCs to an existing local area cluster. In either case, you can use CLUSTER_ CONFIG.COM to convert your existing cluster to a mixed-interconnect configuration. 3.2.4.1 Changing an Existing Cl-Only Cluster to a Mixed-Interconnect Configuration If you want to convert an existing CI-only cluster to a mixed-interconnect configuration, you must enable cluster communications over the Ethernet on all VAX processors, and you must enable one or more processors as boot servers. Proceed as follows: 1 Log in as system manager on each VAX processor, invoke CLUSTER_ CONFIG.COM, and execute the CHANGE function to enable the Ethernet for cluster communications. You must perform this operation on all VAX processors. 2 Execute the CHANGE function to enable one or more processors as boot servers. 3 Shut down and reboot the cluster, following instructions in Section 3.3. 3-19 Building and Maintaining the Cluster 3.2 Configuring the Cluster 3.2.4.2 Changing an Existing Local Area Cluster to a Mixed-Interconnect Configuration Before performing the operations described in this section, be sure that the VAX processors and HSCs you intend to include in your new mixedinterconnect configuration are correctly installed and checked for proper operation. The method you use to convert an existing local area cluster to a mixedinterconnect configuration depends on whether your current boot server is a CI-capable VAX processor. Note that the following procedures assume that the system disk containing satellite node roots will reside on an HSC. If the boot server is a CI-capable processor, proceed as follows: 1 Log in as system manager on the boot server and perform an image backup operation to back up the current system disk to a disk on an HSC. (For complete information on backup operations, refer to the VMS Backup Utility Manual.) 2 Modify the system's default bootstrap command procedure to boot the system from the HSC disk, following instructions in the appropriate processor-specific installation and operations guide. 3 Shut down the cluster. Shut down the satellites first, then shut down the boot server. 4 Boot the boot server from the newly created system disk on the HSC. 5 Reboot the satellites. If your current boot server is not a CI-capable processor, proceed as follows: 3-20 1 Shut down the old local area cluster. Shut down the satellites first, then shut down the boot server. 2 Install the VMS operating system on the new CI-connected VAX processor's HSC system disk. When the installation procedure asks if you want to enable the Ethernet for cluster communications, answer YES. 3 When the installation completes, log in as system manager and configure and start the DECnet-VAX network, as described in Chapter 2. 4 Execute the CLUSTER_CONFIG.COM CHANGE function to enable the node as a boot server. 5 Log in as system manager on the newly added CI-connected node and execute CLUSTER_CONFIG.COM's ADD function to add the former local area cluster members (including the former boot server) as satellites on the new HSC system disk. Building and Maintaining the Cluster 3.2 Configuring the Cluster 3.2.5 Converting a Standalone Node to a Cluster Node You execute CLUSTER_CONFIG.COM on a standalone node to perform the following operations: • Add the standalone node to an existing cluster. • Set up the standalone node to form a new cluster, if the node was not set up as a cluster node during installation of the VMS operating system. Example 3-9 illustrates the use of CLUSTER_CONFIG.COM on standalone node PLUTO to convert PLUTO to a cluster boot server. Example 3-9 Sample Interactive CLUSTER_CONFIG.COM Session to Convert a Standalone Node to a Cluster Boot Server $ ©CLUSTER_CONFIG.COM Cluster Configuration Procedure This procedure sets up this standalone node to join an existing cluster or to form a new cluster. What is the node's DECnet node name? PLUTO What is the node's DECnet address? 2.5 Will the Ethernet be used for cluster communications (Y/N)? Y Enter this cluster's group number: 3378 Enter this cluster's password: Re-enter this cluster's password for verification: Will PLUTO be a boot server [Y]? ~ Verifying circuits in network database ... Enter a value for PLUTO's ALLOCLASS parameter: 1 Does this cluster contain a quorum disk [N]? ~ AUTOGEN computes the SYSGEN parameters for your configuration and then reboots the system with the new parameters. 3.2.6 Creating a Duplicate System Disk To duplicate a cluster system disk, proceed as follows, after you have coordinated cluster common files, as described in Section 2.5.4. 1 Log in as system manager. 2 Place a blank disk in an appropriate drive and spin up the disk. 3 Invoke CLUSTER_CONFIG.COM and select the CREATE function. The procedure will prompt you for the device names of the current and new system disks. It will then back up the current system disk to the new one, delete all directory roots from the new disk, and mount that disk clusterwide. Note that you will see VMS RMS error messages while the procedure deletes directory files. You can ignore these messages. Example 3-10 shows a typical interactive CREATE session on node JUPITR. 3-21 Building and Maintaining the Cluster 3.2 Configuring the Cluster Example 3-10 Sample Interactive CLUSTER_CONFIG.COM CREATE Session $ @CLUSTER_CONFIG.COM Cluster Configuration Procedure Use CLUSTER_CONFIG.COM to set up or change a VAXcluster configuration. To ensure that you have the required privileges, invoke this procedure from the system manager's account. Enter? for help at any prompt. 1. ADD a node to the cluster. 2. REMOVE a node from the cluster. 3. CHANGE a cluster node's characteristics. 4. CREATE a second system disk for JUPITR. Enter choice [1] : 4 The CREATE function generates a duplicate system disk. o It backs up the current system disk to the new system disk. o It then removes from the new system disk all system roots. WARNING - Do not proceed unless you have defined appropriate logical names for cluster common files in your site-specific startup procedures. For instructions, see the VMS VAXcluster Manual. Do you want to continue [N]? YES This procedure will now ask you for the device name of JUPITR's system root. The default device name (DISK$VAXVMSRL5:) is the logical volume name of SYS$SYSDEVICE: . What is the device name of the current system disk [DISK$VAXVMSRL5:]? ~ What is the device name for the new system disk? $1$DJA16: %DCL-I-ALLOC, _$1$DJA16: allocated %MOUNT-I-MOUNTED, SCRATCH mounted on _$1$DJA16: What is the unique label for the new system disk [JUPITR_SYS2]? ~ Backing up the current system disk to the new system disk ... Deleting all system roots ... Deleting directory tree SYS1 ... %DELETE-I-FILDEL, $1$DJA16:<SYSO>DECNET.DIR;1 deleted (2 blocks) System root SYS1 deleted. Deleting directory tree SYS2 ... %DELETE-I-FILDEL, $1$DJA16:<SYS1>DECNET.DIR;1 deleted (2 blocks) System root SYS2 deleted. All the roots have been deleted. %MOUNT-I-MOUNTED, JUPITR_SYS2 mounted on _$1$DJA16: The second system disk has been created and mounted clusterwide. Satellites can now be added. 3-22 Building and Maintaining the Cluster 3.2 Configuring the Cluster 3.3 Reconfiguring the Cluster after a Major Change Because the following operations affect the integrity of the entire cluster, you must reconfigure the cluster after executing any of them. • Adding or removing a voting cluster member • Enabling or disabling the Ethernet for cluster communications • Enabling or disabling a quorum disk • Changing allocation class values • Changing the cluster group number or password (see Section 3.4.6) In all cases, you must shut down and reboot the entire cluster. Note that if you add or remove a voting member, or if you enable or disable a quorum disk, you must update MODPARAMS.DAT files before shutting down the cluster. To perform these reconfiguration tasks, follow instructions in Sections 3.3.1 through 3.3.4. 3.3.1 Updating MODPARAMS.DAT Files to Adjust Cluster Quorum Whenever you add or remove a voting cluster node, or whenever you enable or disable a quorum disk, you must edit MODPARAMS.DAT in all other cluster members' [SYSn.SYSEXE] directories and adjust the value for the SYSGEN parameter EXPECTED_VOTES appropriately. For example, if you add a voting node, or if you enable a quorum disk, you must increment the value by the number of votes assigned to the new member (usually 1). If you add a voting node with 1 vote and enable a quorum disk with 1 vote on that node, you must increment the value by 2. You must then prepare to shut down and reboot the entire cluster. To ensure that the new values take effect when you reboot, log in on each node as system manager and run AUTOGEN to propagate the values to the node's VAXVMSSYS.P AR file. Enter the following command: $ ©SYS$UPDATE:AUTOGEN GETDATA SETPARAMS Be sure not to specify the SHUTDOWN or REBOOT options. Caution: Do not perform this operation until you are ready to shut down and reboot the entire cluster. If a node should fail or crash, and then reboot with the new parameters, normal cluster operations can be seriously compromised. 3-23 Building and Maintaining the Cluster 3.3 Reconfiguring the Cluster after a Major Change 3.3.2 Shutting Down the Cluster After you have run AUTOGEN to set parameter values correctly, you must shut down the entire cluster. Log in as system manager on each node locally and enter the following command to perform an orderly shutdown: $ ©SYS$SYSTEM:SHUTDOWN When you are prompted for the shutdown options, specify CLUSTER for cluster shutdown. Note that you must run the shutdown procedure and specify this option on each node. You cannot shut down the entire cluster from one node. 3.3.3 Changing Allocation Class Values on HSCs If it is necessary to change allocation class values on any HSC controller, you must do so while the entire cluster is shut down. Enter a command sequence like the following at the appropriate HSC consoles: lcTRL/CI HSC> RUN SETSHO SETSHO> SET ALLOCATE DISK 1 SETSHO> EXIT SETSHO-Q Rebooting HSC; Y to continue, CTRL/Y to abort:? Y 3.3.4 Rebooting the Cluster After all HSCs have been set and rebooted, reboot each cluster node. Watch the console listings for unusual messages or warnings. Caution: In local area and mixed-interconnect clusters, you must reboot boot servers before rebooting satellites. Note that several new messages may appear. For example, if you have used the CLUSTER_CONFIG.COM CHANGE function to enable cluster communications over the Ethernet, one message will report that the Local Area VAXcluster security database is being loaded. Then, for every diskserving node, you will see a message reporting that the MSCP Server is being loaded, followed by a list of all the disks being served by that node. You should verify that all disks are being served in the manner that you specified when you designed the configuration. 3.4 Maintaining the Cluster Once your cluster is up and running, you can implement routine site-specific maintenance operations-for example, backing up disks or adding user accounts. And you should plan to run AUTOGEN with the FEEDBACK option on a regular basis, as described in Section 3.4.1. You should also maintain records of current configuration data, especially any changes to hardware or software components. Section 3.4.2 lists items that should be included in your records. If you are managing a local area or mixed-interconnect cluster, it is important to monitor Ethernet activity. Section 3.4.3 provides information to help you set up a monitoring procedure. 3-24 Building and Maintaining the Cluster 3.4 Maintaining the Cluster From time to time conditions may occur that require the following special maintenance operations: • Restoring cluster quorum after an unexpected node failure • Executing conditional shutdown operations • Performing security functions in local area and mixed-interconnect clusters These operations are discussed in Sections 3.4.4, 3.4.5, and 3.4.6. 3.4.1 Running AUTOGEN with the FEEDBACK Option In VMS Version 5.0, AUTOGEN has been enhanced with a mechanism called feedback. This new mechanism examines data collected during normal system operation, and it adjusts system parameters on the basis of the collected data whenever you run AUTOGEN with the FEEDBACK option. DIGITAL strongly recommends that you use the new feedback mechanism. Without feedback, it is difficult for AUTOGEN to anticipate patterns of resource usage, particularly in complex configurations. Factors such as the number of nodes and disks in the cluster, and the types of applications being run, require adjustment of system parameters for optimal performance. You should therefore run AUTOGEN with feedback frequently. As a cluster grows, settings for many parameters must be adjusted. The settings AUTOGEN chooses for a cluster with 3 CI-connected VAX processors and 5 satellites will no longer be appropriate when you add more processors or satellites. In summary, you should rerun AUTOGEN whenever you make significant changes in your configuration. For detailed information on AUTOGEN, refer to the Guide to Setting Up a VMS System. 3.4.2 Recording Configuration Data Effective maintenance of a VAXcluster configuration requires that you keep accurate records on the current 3tatus of all hardware and software components and on any changes made to those components. Changes to cluster components can have a significant effect on the operation of the entire cluster. And if a failure should occur, you will need to consult your records when diagnosing problems. At a minimum, your configuration records should include the following: • SCSNODE and SYSSYSTEMID parameter values for all cluster nodes. • DECnet names and addresses for all cluster nodes. • Current values for cluster-related SYSGEN parameters, especially ALLOCLASS values for HSCs and VAX processors. (Cluster SYSGEN parameters are described in Appendix A.) • Default bootstrap command procedures for all CI-connected nodes. • Names of Ethernet adapter circuits. • Names of cluster disk and tape devices. • In local area and mixed-interconnect clusters, Ethernet hardware addresses for satellite nodes. 3-25 Building and Maintaining the Cluster 3.4 Maintaining the Cluster • Serial numbers of all hardware components. • Changes to any hardware or software components (including site-specific command procedures) along with dates and times when changes were made. Maintaining current records for your configuration is necessary both for routine operations and for eventual troubleshooting activities. 3.4.3 Monitoring Ethernet Activity in Local Area and Mixed-Interconnect Clusters In local area and mixed-interconnect clusters it is important that you monitor Ethernet activity on a regular basis. Using NCP commands like those shown in the accompanying example, (where BNA-0 is the line-id of the Ethernet line), you can set up a convenient monitoring procedure to report activity for each 12-hour period. Note that DECnet event logging for event 0.2 (automatic line counters) must be enabled. (For detailed information on DECnet-VAX event logging, refer to the VMS Network Control Program Manual.) NCP> DEFINE LINE BNA-0 COUNTER TIMER 43200 NCP> SET LINE BNA-0 COUNTER TIMER 43200 Every timer interval (in this case 12 hours) DECnet will create an event that sends counter data to the DECnet event log. If you experience a performance degradation in your cluster, check the event log for increases in counter values that exceed normal variations for your cluster. If all nodes show the same increase, there may be a general problem with your Ethernet configuration. If, on the other hand, only one node shows a deviation from usual values, there is probably a problem with that node or its Ethernet interface device. 3.4.4 Restoring Cluster Quorum after an Unexpected Node Failure During the life of a cluster, nodes join and leave the cluster. For example, you may need to add more processors to the cluster to extend the cluster's processing capabilities, or a node may shut down unexpectedly as the result of a hardware or fatal software error. The connection management software coordinates these cluster transitions and controls cluster operation. When a cluster node shuts down unexpectedly, the remaining nodes, with the help of the Connection Manager, reconfigure the cluster, excluding the node that shut down. The cluster will survive the failure of the node and continue to process, as long as the cluster votes total is greater than the cluster quorum value. If the cluster votes total falls below the cluster quorum value, the cluster suspends the execution of all processes. For process execution to resume, the cluster votes total must be restored to a value greater than or equal to the cluster quorum value. Often, the required votes are added as nodes join or rejoin the cluster. However, waiting for a node to join the cluster and raising the votes value is not always a simple or convenient remedy. An alternative solution, for example, might be to shut down and reboot all the nodes with a lower quorum value. In any case, it is important to be aware of cluster state changes in order to prevent potential problems. 3-26 Building and Maintaining the Cluster 3.4 Maintaining the Cluster Following the failure of a node, you may want to run the Show Cluster Utility and examine values for the VOTES, EXPECTED_VOTES, CL_VOTES, and CL _QUORUM fields. (See the VMS Show Cluster Utility Manual for a complete description of these fields.) The VOTES and EXPECTED_VOTES fields show the settings for each cluster member; the CL_VOTES and CL_ QUORUM fields show the cluster votes total and the current cluster quorum value. To examine these values, enter the following commands: $ SHOW CLUSTER/CONTINUOUS COMMAND> ADD VOTES,EXPECTED_VOTES,CL_VOTES,CL_QUORUM Note: If you want to enter SHOW CLUSTER commands interactively, you must specify the /CONTINUOUS qualifier as part of the SHOW CLUSTER command string. If you do not specify this qualifier, SHOW CLUSTER will display cluster status information returned by the DCL command SHOW CLUSTER and will return you to the DCL command level. If the display from the Show Cluster Utility shows the CL_VOTES value equal to the CL_QUORUM value, the cluster will not survive the failure of any remaining voting node. If one of these nodes shuts down, all process activity in the cluster will stop. To prevent the disruption of cluster process activity, you can lower the cluster quorum value. You can use the DCL command SET CLUSTER/EXPECTED_ VOTES to adjust the cluster quorum to a value you specify. If you do not specify a value, the system calculates an appropriate value for you. You need enter the command on only one node to propagate the new value throughout the cluster. When you enter the command, the system reports the new value. Note that you normally use the SET CLUSTER/EXPECTED_VOTES command only when a node is leaving the cluster for an extended period. (For more information on this command, see the VMS DCL Dictionary.) If, for example, you want to change expected votes to set the cluster quorum to 2, enter the following command: $ SET CLUSTER/EXPECTED_VOTES=3 The resulting value is (3 + 2)/2 = 2. Note that no matter what value you specify for the SET CLUSTER /EXPECTED_VOTES command, you cannot increase quorum to a value that is greater than the number of the votes present, nor can you reduce quorum to a value that is half or fewer of the votes present. To make the new value active clusterwide, you must adjust the SYSGEN parameter EXPECTED_VOTES in MODPARAMS.DAT files on each cluster node, and then reconfigure the cluster, following instructions in Section 3.3. When a node that was previously a cluster member is ready to rejoin, you must reset the SYSGEN parameter EXPECTED_VOTES to its original value in MODPARAMS.DAT on all nodes and then reconfigure the cluster, following instructions in Section 3.3. You do not need to use the SET CLUSTER /EXPECTED_VOTES command to increase cluster quorum, because the quorum value will be increased automatically when the node rejoins the cluster. You can also reduce cluster quorum by selecting one of the cluster-related shutdown options described in Section 3.4.5. 3-27 Building and Maintaining the Cluster 3.4 Maintaining the Cluster 3.4.5 Selecting Cluster Shutdown Options The VMS operating system provides four options for shutting down cluster nodes: • REMOVE_NODE • CLUSTER_SHUTDOWN • REBOOT_CHECK • SAVE__FEEDBACK Sections 3.4.5.1 through 3.4.5.4 explain these options. If you do not select any option (if you select the default SHUTDOWN option NONE) the SHUTDOWN procedure will default to the normal behavior for shutting down a standalone system. If you want to shut down a node that you expect to rejoin the cluster shortly, you can select the default option. In that case, cluster quorum will not be adjusted, because it is assumed that the node will soon rejoin the cluster. 3.4.5.1 The REMOVE_NODE Option If you want to shut down a cluster node that you expect will not be rejoining the cluster for an extended period, select the REMOVE_NODE option. For example, a node may be waiting for new hardware, or you may decide that you want to use a node standalone indefinitely. When you use the REMOVE_NODE option, the active quorum in the remainder of the cluster will be adjusted downward to reflect the fact that the removed node's votes will no longer be contributing to the quorum value. The SHUTDOWN procedure readjusts the quorum by issuing the SET CLUSTER/EXPECTED_VOTES command, which is subject to the usual constraints described in Section 5.4. Note that it is still the responsibility of the system manager to change the SYSGEN parameter EXPECTED_VOTES on the remaining nodes, to reflect the new configuration. 3.4.5.2 The CLUSTER_SHUTDOWN Option If you want to shut down the entire cluster, select the CLUSTER_ SHUTDOWN option. When you select this option, the node will suspend activity, just short of shutting down completely, until all nodes in the cluster have reached the same point in the SHUTDOWN procedure. When this condition occurs, all nodes shut down together. Note that when you select the CLUSTER_SHUTDOWN option to perform a clusterwide shutdown operation, you must still shut down each node in the cluster by invoking the SHUTDOWN.COM procedure at each node's console. If any one node in the cluster is not completely shut down, clusterwide shutdown cannot occur. Instead, operations on all other nodes in the cluster are suspended. 3-28 Building and Maintaining the Cluster 3.4 Maintaining the Cluster 3.4.5.3 The REBOQT_CHECK Option When you select the REBOOT_CHECK option, the SHUTDOWN procedure checks for the existence of basic system files that are needed to reboot the system successfully and notifies you if any files are missing. You should replace such files before proceeding. If all files are present, the following success message appears: %SHUTDOWN-I-CHECKOK, Basic reboot consistency check completed. Note that you can select the REBOOT_CHECK option separately or in conjunction with either the REMOVE_NODE or CLUSTER_SHUTDOWN option. If you select REBOOT_CHECK with one of the other options, be sure to separate the option list with a comma. 3.4.5.4 The SAVE_FEEDBACK Option You select the SAVE_FEEDBACK option to enable AUTOGEN feedback operation. Note that you should select this option only when your system has been running long enough to reflect your typical workload. For detailed information on AUTOGEN feedback, see the Guide to Setting Up a VMS System. 3.4.6 Performing Security Functions in Local Area and Mixed-Interconnect Clusters Because multiple local area and mixed-interconnect clusters may coexist on a single Ethernet, mechanisms are provided to ensure the integrity of individual clusters and to prevent access to a cluster (accidental or deliberate) by an unauthorized node. Cluster security mechanisms prevent problems that could otherwise occur under circumstances like the following: • When setting up a new cluster, the system manager specifies a group number identical to that of an existing cluster on the same Ethernet. (This condition is not as unlikely as it may at first appear, because system managers will probably not assign group numbers randomly.) However, provided each cluster's password is unique, the new cluster will form independently. • A satellite node user with access to a local system disk tries to join a cluster by executing a conversational SYS BOOT operation at the satellite's console. The following mechanisms are designed to help system managers perform security functions: • A cluster authorization file (SYS$COMMON:[SYSEXE]CLUSTER_ AUTHORIZE.DAT), initialized during installation of the VMS operating system or during execution of the CLUSTER_CQNFIG.COM CHANGE function. The file is maintained with the SYSMAN Utility. • Control of conversational bootstrap operations on satellite nodes. These mechanisms are discussed in Sections 3.4.6.1 and 3.4.6.2. 3-29 Building and Maintaining the Cluster 3.4 Maintaining the Cluster 3.4.6.1 Maintaining Cluster Security Data Security data is maintained in the cluster authorization file, SYS$COMMON:[SYSEXE]CLUSTER_AUTHORIZE.DAT, which contains the cluster group number and (in encrypted form) the cluster password. The file is accessible only to users with the SYSPRV privilege. Under normal conditions, you need not alter records in the CLUSTER_ AUTHORIZE.DAT file interactively. If, however, you suspect a security breach, you may want to change the cluster password. In that case, you use the SYSMAN Utility to make the change. Note that if your configuration has multiple system disks, each disk must have a copy of CLUSTER_AUTHORIZE.DAT. You must run the utility to update all copies. Caution: If you change either the group number or password, you must reboot the entire cluster. For instructions, see Section 3.3. To invoke the SYSMAN Utility, log in as system manager on a boot server and enter the following command: $ RUN SYS$SYSTEM:SYSMAN SYSMAN> When the utility responds with the SYSMAN> prompt, you can enter any of the CONFIGURATION commands listed in Table 3-3. Table 3-3 Summary of SYSMAN CONFIGURATION Commands for Cluster Authorization Command Qualifiers Function HELP CONFIGURATION SET CLUSTER_AUTHORIZA TION None Explains the command's functions. Updates the cluster authorization file, SYS$COMMON :[SYSEXE]CLUSTER_ AUTHORIZE.DAT. (The SET command will create this file if it does not already exist.) CONFIGURATION SET CLUSTER_AUTHORIZA TION CONFIGURATION SHOW CLUSTER_AUTHORIZA TION /GROUP_NUMBER Specifies a cluster group number. Group number must be in the range from 1 to 4095 or 61440 to 65535. /PASSWORD Specifies a cluster password. Password may be from 1 to 31 characters in length and may include alphanumeric characters, dollar signs, and underscores. None Displays the cluster group number. Example 3-11 illustrates the use of the SYSMAN Utility to change the cluster password. 3-30 Building and Maintaining the Cluster 3.4 Maintaining the Cluster Example 3-11 Sample Interactive SYSMAN CONFIGURATION Session $ RUN SYS$SYSTEM:SYSMAN SYSMAN> SET ENVIRONMENT/CLUSTER %SYSMAN-I-ENV, current command environment: Clusterwide on local cluster Username LAZRUS will be used on nonlocal nodes SYSMAN> SET PROFILE/PRIVILEGES=SYSPRV SYSMAN> CONFIGURATION SET CLUSTER_AUTHORIZATION/PASSWORD=newpassword %SYSMAN-I-CAFOLDGROUP, existing group will not be changed %SYSMAN-I-CAFREBOOT, cluster authorization file updated The entire cluster should be rebooted. SYSMAN> EXIT $ 3.4.6.2 Controlling Conversational Bootstrap Operations for Satellites When you add a satellite node to the cluster using CLUSTER_CONFIG.COM, the procedure asks whether you want to allow conversational bootstrap operations for the satellite (default is NO). If you press RETURN, SYSGEN parameter NISCS_CONV_BOOT in the satellite's SYSGEN parameter file remains set to 0 to disable such operations. The parameter file, VAXVMSSYS.PAR, resides in the satellite's root directory on a boot node's system disk (device:[SYSx.SYSEXE]). You may later enable conversational bootstrap operations for a given satellite at any time by setting this parameter to 1. For example, to enable such operations for a satellite booted from root 10 on device $1$DJA11, you would proceed as follows: 1 Log in as system manager on the boot server. 2 Invoke the System Generation Utility (SYSGEN) and enter the following commands: $ RUN SYS$SYSTEM:SYSGEN SYSGEN> USE $1$DJA11:[SYS10.SYSEXE]VAXVMSSYS.PAR SYSGEN> SET NISCS_CONV_BOOT 1 SYSGEN> WRITE $1$DJA11:[SYS10.SYSEXE]VAXVMSSYS.PAR SYSGEN> EXIT $ 3-31 4 Setting Up and Managing Cluster Queues On a standalone system, print and batch job processing is limited to a single processor and local devices. In VAXcluster configurations, however, nodes can share device and processing resources. This ability to share resources allows for better workload balancing because batch and print job processing can be distributed across the cluster. You control how jobs share device and processing resources in a cluster by setting up and maintaining cluster queues. The strategy you use to set up and manage these queues will determine how well you match workloads to your cluster's device and processor resources. You establish and control cluster queues with the same commands you use to manage queues on a standalone VMS system. These commands are described in the VMS DCL Dictionary. The sections that follow describe how to set up cluster queues. The chapter assumes some knowledge of queue management on a standalone system, as described in the Guide to Setting Up a VMS System. 4.1 Clusterwide Queues Clusterwide queues are controlled by a clusterwide job controller queue file. This file makes queues available across the cluster and enables jobs to execute on any queue from any node, provided that the necessary mass storage volumes can be accessed by the node on which the job executes. There can be only one job controller queue file on a cluster. If there is such a queue file, it must be on a disk that is accessible to the nodes participating in the clusterwide queue scheme. You control which nodes in the cluster share clusterwide queues by specifying the location of the job controller queue file, JBCSYSQUE.DAT, with the DCL command START/QUEUE/MANAGER. You could use the following command string, for example, to set up a clusterwide queue: $START/QUEUE/MANAGER SYS$COMMON: [SYSEXE]JBCSYSQUE.DAT All nodes using queues must specify the same queue file in the START /QUEUE/MANAGER command. 4.2 Cluster Printer Queues To establish printer queues, you should first decide on the type of queue configuration that will best suit your system. On a cluster, you have several alternatives that depend on the number and type of print devices you have on each node, and how you want print jobs to be processed. For example, make these decisions: • Whether to set up generic printer queues that are local to each node • Which printer queues should be assigned to any local generic queues 4-1 Setting Up and Managing Cluster Queues 4.2 Cluster Printer Queues • Whether to set up any clusterwide generic queues that will distribute print job processing across the cluster Once you determine the strategy for your system, you can create a command procedure that will set up your queues. Figure 4-1 shows the printer configuration for a cluster consisting of the active nodes JUPITR, SATURN, and URANUS. The sections that follow will use this example configuration to illustrate various methods for establishing and naming cluster printer queues. Sample command procedures are also included in Section 4.4 to serve as a guide to setting up queues. Figure 4-1 Sample Printer Configuration JUPITR Ill• . SATURN URANUS ...I ..L. 4.2.1 ZK-1631-84 Setting Up Printer Queues You should set up printer queues using the same procedures that you would use for a single-node system (see the Guide to Setting Up a VMS System). However, since each local node is part of the cluster system, you must provide a unique name for each queue you create in a cluster. You assign a unique name to a printer queue by specifying the DCL command INITIALIZE/QUEUE in the following format: INITIALIZE/QUEUE/ON=node: :device queue-name The /ON qualifier specifies the node and printer that the queue is assigned to. The commands in the following example make local printer queue assignments for the cluster node JUPITR shown in Figure 4-2: $ INITIALIZE/QUEUE/ON=JUPITR: :LPAO/START $ INITIALIZE/QUEUE/ON=JUPITR: :LPBO/START 4-2 JUPITR_LPAO JUPITR_LPBO Setting Up and Managing Cluster Queues 4.2 Cluster Printer Queues Figure 4-2 Printer Queue Configuration 1 ~ 4.2.2 ZK-1632-84 Setting Up Clusterwide Generic Printer Queues The clusterwide job controller queue file enables you to establish generic queues that function throughout the cluster. Jobs queued to clusterwide generic queues are placed in any assigned printer queue that is available, regardless of its location in the cluster. However, the file queued for printing must be accessible to the node to which the printer is connected. Figure 4-3 illustrates a clusterwide generic printer queue, in which the queues for all LP AO printers in the cluster are assigned to a clusterwide generic queue named SYS$PRINT. 4-3 Setting Up and Managing Cluster Queues 4.2 Cluster Printer Queues Figure 4-3 Cluster Printer Queue Configuration With Clusterwide Generic Printer Queue J. Id ZK-1634-84 The following command initializes and starts the clusterwide generic queue SYS$PRINT: $ INITIALIZE/QUEUE/GENERIC=(JUPITR_LPAO,SATURN_LPAO,URANUS_LPAO)/START SYS$PRINT Jobs queued to SYS$PRINT are placed in whichever assigned printer queue is available. Thus, in this example, a print job from node JUPITR that is queued to SYS$PRINT may in fact be queued to JUPITR_LPAO, SATURN_LPAO, or URANUS_LPAO. In addition to creating a queue for each local printer, you may want to establish at least one local generic queue for similar devices on the local node. The following commands set up the local generic queue for node JUPITR shown in Figure 4-4. $ INITIALIZE/QUEUE/GENERIC=(JUPITR_LPAO,JUPITR_LPBO)/START JUPITR_PRINT $ DEFINE/SYSTEM SYS$PRINT JUPITR_PRINT 4-4 Setting Up and Managing Cluster Queues 4.2 Cluster Printer Queues Figure 4-4 Printer Queue Configuration With Local Generic Queue ...l ..L.il. ZK-1633-84 In Figure 4-4 the generic printer queue JUPITR_pRINT is set up and explicitly assigned the printer queues JUPITR_LP AO and JUPITR_LPBO. In a single-node environment, you would name the generic queue SYS$PRINT, because print jobs are queued to SYS$PRINT by default. In a cluster, however, the separate nodes cannot have independent queues with the same name; therefore, you cannot create multiple generic queues named SYS$PRINT. To get around this problem, you can create a generic queue, assign it a unique queue name, and then establish a systemwide logical name equating SYS$PRINT to the generic queue name. This logical name assignment is systemwide on the local node, affecting operations on that node. Thus, only print jobs from users on JUPITR are queued to JUPITR_ PRINT by default. Because print jobs on each cluster node are queued to SYS$PRINT by default, you might want to establish SYS$PRINT as a clusterwide generic printer queue that distributes print job processing throughout the cluster. 4-5 Setting Up and Managing Cluster Queues 4.3 Cluster Batch Queues 4.3 Cluster Batch Queues Before you establish batch queues, you should first decide on the type of queue configuration that will best suit your cluster. As system manager, you are responsible for setting up batch queues to maintain efficient batch job processing on the cluster. For example, you should do the following: • Determine what type of processing will be performed on each node • Set up local batch queues that conform to these processing needs • Decide whether to set up any clusterwide generic queues that will distribute batch job processing across the cluster Once you determine the strategy that best suits your system needs, you can create a command procedure that will set up your queues. Figure 4-5 shows the batch queue configuration for a cluster consisting of the active nodes JUPITR, SATURN, and URANUS. The sections that follow will use this example configuration to illustrate various methods for establishing and naming cluster batch queues. Sample command procedures for this configuration are also included in Section 4.4 to serve as a guide to setting up queues. Figure 4-5 Sample Batch Queue Configuration ZK-1635-84 4-6 Setting Up and Managing Cluster Queues 4.3 Cluster Batch Queues 4.3.1 Setting Up Executor Batch Queues Generally, you set up executor batch queues on each cluster node using the same procedures you use for a single-node system. For more detailed information on how this is done, see the Guide to Setting Up a VMS System. You assign a unique name to a batch queue by specifying the DCL command INITIALIZE/QUEUE in the following format: INITIALIZE/QUEUE/ON=node:: queue-name The /ON qualifier specifies the node on which the batch queue runs. The commands in the following example make local batch queue assignments for the cluster node JUPITR shown in Figure 4-5: $ INITIALIZE/QUEUE/BATCH/ON=JUPITR::/START $ INITIALIZE/QUEUE/BATCH/ON=JUPITR::/START JUPITR_BATCH JUPITR_TEXT In a single-node environment, you would name one batch queue SYS$BATCH, because batch jobs are queued to SYS$BATCH by default. You may decide to follow this convention for each node in the cluster. In a cluster, however, the separate nodes cannot have independent queues with the same name; therefore you cannot create a queue named SYS$BATCH for each node in the cluster. To get around this problem, you can create a queue, assign it a unique queue name, and then establish a systemwide logical name equating SYS$BATCH to the queue name as follows: $ INITIALIZE/QUEUE/BATCH/ON=JUPITR: :/START JUPITR_BATCH $ DEFINE/SYSTEM SYS$BATCH JUPITR_BATCH This logical name definition is systemwide on the local node, affecting only operations on that node. Thus, only batch jobs from users on JUPITR are queued to JUPITR_BATCH by default. Because batch jobs on each cluster node are queued to SYS$BATCH by default, you should consider establishing SYS$BATCH as a clusterwide generic batch queue that distributes batch job processing throughout the cluster. Note, however, that you should do this only if you have a commonenvironment cluster. Guidelines for establishing clusterwide generic batch queues are presented in the following section. 4.3.2 Setting Up Generic Batch Queues Unlike a printer queue, a batch queue can be set up to allow more than one job to execute simultaneously. For this reason it is often not necessary on a single-node system to create multiple batch queues of the same type and assign them to a generic batch queue. On a cluster, however, where you have multiple processors, you may want to distribute batch processing across the nodes to balance the use of processing resources. You can achieve this workload distribution by assigning local batch queues to one or more clusterwide generic batch queues. These generic batch queues control batch processing over the cluster by placing batch jobs in assigned batch queues that are available. 4-7 Setting Up and Managing Cluster Queues 4.3 Cluster Batch Queues Figure 4-6 Batch Queue Configuration With Clusterwide Generic Queue ~SY~BATCH ZK-1636-84 Instead of having a queue named SYS$BATCH set up on each cluster node (as described in Section 4.3.1), you can create a clusterwide generic batch queue and name it SYS$BATCH. For example, in Figure 4-6 batch queues from each node are assigned to a clusterwide generic batch queue named SYS$BATCH. Users can submit a job to a specific queue, or if they have no special preference, submit it by default to the clusterwide generic queue, SYS$BATCH. The generic queue in turn places the job in an available assigned queue in the cluster. If more than one assigned queue is available, the system selects the queue that will minimize the ratio (executing jobs/job limit) for all assigned queues. 4-8 Setting Up and Managing Cluster Queues 4.4 Command Procedures for Establishing Queues 4.4 Command Procedures for Establishing Queues To configure queues on a cluster properly, you must coordinate, among cluster nodes, commands in procedures that initialize and start queues. Each active node in a cluster must initialize its local queues as well as the queues of other cluster nodes, so that when new nodes join the cluster, queues are recognized by all the nodes. However, because cluster nodes boot separately rather than simultaneously, a booting node must start only its own local queues. As a rule, the startup command procedure for each active cluster node must initialize every queue in the cluster, but start only its local queues and any clusterwide generic queues. You should include commands to establish queues in the SYSTARTUP procedure or in a separate command procedure file named, for example, STARTQ.COM that is invoked by your SYSTARTUP procedure. DIGITAL suggests that you set up your STARTQ command procedure(s) as a common file on a shared disk. In this case, the common STARTQ.COM file may reside on the same disk as the job controller queue file. 4.4.1 Starting Queues Using Node-Specific Command Procedures For each node in the cluster, either add node-specific queue commands to the node-specific SYSTARTUP procedure or create a STARTQ command procedure that is invoked by the node-specific SYSTARTUP procedure. Examples 4-1 through Example 4-3 illustrate the use of separate nodespecific command procedures to initialize and start the printer configuration shown in Figure 4-1 and the batch configuration shown in Figure 4-5. Example 4-1 STARTQ Command Procedure for Node JUPITR $ SET NOON $ $ STARTQ Command Procedure for Node JUPITR $ ! $ ! Start job queue manager. $ ! $START/QUEUE/MANAGER WORK1:[CLUSMAN] $ $ Initialize and start local printer queues. $ $ INITIALIZE/QUEUE/ON=JUPITR: :LPAO/START $ INITIALIZE/QUEUE/ON=JUPITR: :LPBO/START JUPITR_LPAO JUPITR_LPBO $ $ ! Initialize remote printer queues. $ ! $ INITIALIZE/QUEUE/ON=SATURN: :LPAO SATURN_LPAO $ INITIALIZE/QUEUE/ON=SATURN: :LPBO SATURN_LPBO $ INITIALIZE/QUEUE/ON=SATURN: :LPCO SATURN_LPCO $ INITIALIZE/QUEUE/ON=URANUS: :URANUS_LPAO URANUS_LPAO $ ! $ ! Initialize and start clusterwide generic printer queue. $ ! $ INITIALIZE/QUEUE/GENERIC=(JUPITR_LPAO,SATURN_LPAO,URANUS_LPAO /START SYS$PRINT Example 4-1 Cont'd. on next page 4-9 Setting Up and Managing Cluster Queues 4.4 Command Procedures for Establishing Queues Example 4-1 (Cont.) STARTQ Command Procedure for Node JUPITR $ ! $ ! Initialize batch queues on local node. $ ! $ INITIALIZE/QUEUE/BATCH/ON=JUPITR: :/START $ INITIALIZE/QUEUE/BATCH/ON=JUPITR: :/START JUPITR_BATCH JUPITR_TEXT $ ! $ ! Initialize queues from other nodes. $ ! $ INITIALIZE/QUEUE/BATCH/ON=SATURN:: $ INITIALIZE/QUEUE/BATCH/ON=SATURN:: $ INITIALIZE/QUEUE/BATCH/ON=URANUS:: SATURN_BATCH SATURN_TEXT URANUS_BATCH $ ! $ ! Initialize clusterwide generic batch queue. $ ! $ INITIALIZE/QUEUE/BATCH/GENERIC=(JUPITR_BATCH,SATURN_BATCH,URANUS_BATCH)/START SYS$BATCH Example 4-2 STARTQ Command Procedure for Node SATURN $ SET NOON $ $ STARTQ Command Procedure for Node SATURN $ ! $ ! Start job queue manager. $ ! $START/QUEUE/MANAGER WORK1: [CLUSMAN] $ $ ! Initialize and start local printer queues. $ ! $ INITIALIZE/QUEUE/ON=SATURN: :LPAO/START SATURN_LPAO $ INITIALIZE/QUEUE/ON=SATURN: :LPBO/START SATURN_LPBO $ INITIALIZE/QUEUE/ON=SATURN: :LPCO/START SATURN_LPCO $ $ Initialize remaining printer queues. $ $ INITIALIZE/QUEUE/ON=JUPITR: :LPAO JUPITR_LPAO $ INITIALIZE/QUEUE/ON=JUPITR: :LPBO JUPITR_LPBO $ INITIALIZE/QUEUE/ON=URANUS: :URANUS_LPAO URANUS_LPAO Example 4-2 Cont'd. on next page 4-10 Setting Up and Managing Cluster Queues 4.4 Command Procedures for Establishing Queues Example 4-2 (Cont.) STARTQ Command Procedure for Node SATURN $ ! Initialize and start clusterwide generic printer queue. $ ! $ INITIALIZE/QUEUE/GENERIC=(JUPITR_LPAO,SATURN_LPAO,- URANUS_LPAO)/START SYS$PRINT $ $ Initialize batch queues on local node. $ $ INITIALIZE/QUEUE/BATCH/ON=SATURN: :/START SATURN_BATCH $ INITIALIZE/QUEUE/BATCH/ON=SATURN::/START SATURN_TEXT $ $ ! Initialize queues from other nodes. $ ! $ INITIALIZE/QUEUE/BATCH/ON=JUPITR:: JUPITR_BATCH $ INITIALIZE/QUEUE/BATCH/ON=JUPITR:: JUPITR_TEXT $ INITIALIZE/QUEUE/BATCH/ON=URANUS:: URANUS_BATCH $ $ ! Initialize clusterwide generic batch queue. $ ! $ INITIALIZE/QUEUE/BATCH/GENERIC=(JUPITR_BATCH,SATURN_BATCH,- URANUS_BATCH) SYS$BATCH Example 4-3 STARTQ Command Procedure for Node URANUS $ SET NOON $ $ STARTQ Command Procedure for Node URANUS $ ! $ ! Start job queue manager. $ ! $ START/QUEUE/MANAGER WORK1: [CLUSMAN] $ $ Initialize and start local printer queue. $ $ INITIALIZE/QUEUE/ON=URANUS: :LPAO/START URANUS_PRINT $ $ Initialize remaining printer queues. $ $ INITIALIZE/QUEUE/ON=JUPITR: :LPAO $ INITIALIZE/QUEUE/ON=JUPITR: :LPBO JUPITR_LPAO JUPITR_LPBO $ INITIALIZE/QUEUE/ON=SATURN: :LPAO SATURN_LPAO $ INITIALIZE/QUEUE/ON=SATURN: :LPBO SATURN_LPBO $ INITIALIZE/QUEUE/ON=SATURN: :LPCO SATURN_LPCO $ $ ! Initialize and start clusterwide generic printer queue. $ ! $ INITIALIZE/QUEUE/GENERIC=(JUPITR_LPAO,SATURN_LPAO,- URANUS_LPAO)/START SYS$PRINT $ $ ! Initialize batch queues on local node. $ ! $ INITIALIZE/QUEUE/BATCH/ON=URANUS: :/START URANUS_BATCH Example 4-3 Cont'd. on next page 4-11 Setting Up and Managing Cluster Queues 4.4 Command Procedures for Establishing Queues Example 4-3 (Cont.) STARTQ Command Procedure for Node URANUS $ $ $ Initialize queues from other nodes. $ INITIALIZE/QUEUE/BATCH/ON=JUPITR:: $ INITIALIZE/QUEUE/BATCH/ON=JUPITR:: $ INITIALIZE/QUEUE/BATCH/ON=SATURN:: $ INITIALIZE/QUEUE/BATCH/ON=SATURN:: $ ! $ ! JUPITR_BATCH JUPITR_TEXT SATURN_BATCH SATURN_TEXT Initialize clusterwide generic batch queue. $ ! $ INITIALIZE/QUEUE/BATCH/GENERIC=(JUPITR_BATCH,SATURN_BATCH,URANUS_BATCH) SYS$BATCH In Examples 4-1 through 4-3, each command procedure performs the following operations for the specific node: 4.4.2 • Starts the system job queue manager • Specifies the location of the job controller queue file • Initializes and starts each local queue on the local node • Initializes all other queues from other nodes • Initializes and starts the clusterwide generic printer queue SYS$PRINT • Initializes and starts the clusterwide generic batch queue SYS$BATCH Starting Queues Using a Common Command Procedure You can create a common command procedure, named for example, STARTQ.COM, and store it on a shared disk. Using this method, each node can share the same copy of the common STARTQ.COM procedure. Each node invokes the common STARTQ.COM procedure from the common version of SYSTARTUP. You can also include the commands to set up queues in the common SYSTARTUP file instead of in a separate STARTQ.COM file. Example 4-4 illustrates the use of a common STARTQ command procedure on a shared disk to initialize and start the printer queues shown in Figure 4-1. 4-12 Setting Up and Managing Cluster Queues 4.4 Command Procedures for Establishing Queues Example 4-4 Starting Queues Using a Common Command Procedure $ ! $ ! Compute the name of the executing node. $ ! $ NODE = F$GETSYI( 11 NODENAME 11 ) $ ! $ JUPITR_START = 11 /NOSTART" $ SATURN_START = "/NOSTART" $ URANUS_START = 11 /NOSTART" $ $ ! Redefine one of the previous symbols. $ ! $ 'NODE'_START = "/START" $ ! $ SET NOON $ ! $ ! Start up the job controller. $ ! $ START/QUEUE/MANAGER WORK!: [CLUSMAN] $ $ $ Set up printer queues. Initialize all nodes. Start local node only. $ $ INITIALIZE/QUEUE/ON=JUPITR: :LPAO 'JUPITR_START' $ INITIALIZE/QUEUE/ON=JUPITR: :LPBO 'JUPITR_START' JUPITR_LPAO JUPITR_LPBO $ $ INITIALIZE/QUEUE/ON=SATURN: :LPAO 'SATURN_START' $ INITIALIZE/QUEUE/ON=SATURN: :LPBO 'SATURN_START' $ INITIALIZE/QUEUE/ON=SATURN: :LPCO 'SATURN_START' SATURN_LPAO SATURN_LPBO SATURN_LPCO $ $ INITIALIZE/QUEUE/ON=URANUS: :LPAO 'URANUS_START' URANUS_PRINT $ $ ! Set up main batch queues. $ ! $ INITIALIZE/QUEUE/BATCH/ON=JUPITR: :/JOB=6/WSEXTENT=500 'JUPITR_START' JUPITR_BATCH $ $ INITIALIZE/QUEUE/BATCH/ON=SATURN: :/JOB=5/WSEXTENT=600 'SATURN_START' SATURN_BATCH $ $ INITIALIZE/QUEUE/BATCH/ON=URANUS/JOB=6/WSEXTENT=600 'URANUS_START' URANUS_BATCH $ $ ! Set up batch processing queues. $ ! $ INITIALIZE/QUEUE/BATCH/ON=JUPITR: :/JOB=2/WSEXTENT=1500 'JUPITR_START' JUPITR_TEXT $ $ INITIALIZE/QUEUE/BATCH/ON=SATURN: :/JOB=2/WSEXTENT=1500 'SATURN_START' SATURN_TEXT $ $ ! Set up clusterwide generic batch processing queue. $ ! . $ INITIALIZE/QUEUE/BATCH/GENERIC=(JUPITR_BATCH,SATURN_BATCH,URANUS_BATCH) SYS$BATCH 4-13 Setting Up and Managing Cluster Queues 4.4 Command Procedures for Establishing Queues The command procedure in Example 4-4 performs the same queue setup operations as the command procedures shown in Examples 4-1 through 4-3. However, the common STARTQ file in this example executes a common set of commands that function according to the node executing them. A set of conditional symbols are assigned to control whether queues are started. In this way, each node initializes all the queues in the cluster but starts only its own. 4.5 Summary of Commands for Setting Up Cluster Queues Following is a summary of commands used to set up cluster queues. • Start the system job queue manager $ START/QUEUE/MANAGER file-spec • Set up printer queues $ INITIALIZE/QUEUE/ON=node: :device queue-name $ INITIALIZE/QUEUE/ON=node: :device/START queue-name • Set up generic printer queues $ • INITIALIZE/QUEUE/GENERIC=(queue1,queue2 ... )/START queue-name Set up batch queues $ INITIALIZE/QUEUE/BATCH/ON=node:: queue-name $ INITIALIZE/QUEUE/BATCH/ON=node: :/START queue-name • Set up generic batch queues $ INITIALIZE/QUEUE/BATCH/GENERIC=(queue1,queue2 ... )/START queue-name 4-14 5 Setting Up and Managing Cluster Disks In any VAXcluster configuration, there are two types of disk and tape devices: • Restricted-access devices, which are accessible only by the local node or nodes to which they are directly connected. • Cluster-accessible devices, which are accessible by any node in the cluster. A disk or magnetic tape device connected to an HSC is by design a clusteraccessible device. Any other disk device, such as a MASSBUS, UNIBUS, or BI disk, is a restricted-access device, unless you explicitly set it up as a cluster-accessible device. As system manager, you are responsible for planning, organizing, and setting up the proper cluster device configuration for your site. You must decide which disk devices should have access restricted to the local node, and which should be accessible to the cluster. For example, you may want to restrict access to a particular disk to the users on the node directly connected to the device. Or, you may decide to set up a disk as a cluster-accessible device, so that any user on any cluster node can allocate and use it. Once you have planned your configuration strategy, you can use the procedures outlined in this chapter to set up and manage cluster disks. Topics include the following: 5.1 • Cluster-accessible disks • Cluster device-naming conventicns • Shared disk volumes • Setting up cluster devices • Volume shadowing in mixed-interconnect clusters Cluster-Accessible Disks A cluster-accessible disk is a disk that every node in the cluster can recognize and access. The following types of disks are cluster accessible: • HSC disks • MSCP-served disks • Dual-pathed disks Figure 5-1 illustrates how disks might be configured in a typical CI-only cluster. The HSC disks and the dual-ported MSCP-served local disk are considered cluster accessible. 5-1 Setting Up and Managing Cluster Disks 5.1 Cluster-Accessible Disks Figure 5-1 Cl-Only Configuration With Shared Disks HSC DISKS 5.1.1 HSC Disks An HSC disk is a DIGITAL Storage Architecture (DSA) disk that is connected to an HSC. If an HSC is connected in a cluster, its disks are automatically accessible by ~ny node in the cluster. You can also set up HSC disks to be dual pathed between two HSCs. Dual-pathed disks are described in Section 5.1.3. 5.1.2 MSCP-Served Disks MSCP is the protocol used to communicate between a VAX host and a DSA controller. The MSCP Server enables a VAX processor to make locally connected disks such as MASSBUS, UNIBUS, or BI disks available to all other cluster members. Unlike HSC devices, controllers for locally connected disks are not automatically cluster accessible. Access to these devices is restricted to the local node unless you explicitly set them up as cluster accessible, using the MSCP Server. To make a disk accessible to all cluster nodes, the MSCP Server must be loaded on the local node, and it must be instructed to make the disk available clusterwide. These functions are enabled with the SYSGEN parameters MSCP_LOAD and MSCP_SERVE_ALL. By specifying appropriate values for these parameters in a node's MODPARAMS.DAT file, and then running AUTOGEN to reboot the node, you enable the node to serve all suitable disks to the cluster early in the boot sequence. (You can also use the CLUSTER_ CONFIG.COM CHANGE function to perform these operations.) The served disks thus become accessible with minimal interruption whenever the serving node reboots. Further, the MSCP Server automatically serves any suitable disk that is added to the system later. For example, if new drives are attached 5-2 Setting Up and Managing Cluster Disks 5.1 Cluster-Accessible Disks to an HSC controller, the disks become available within seconds after the cables are connected. Table 5-1 shows the values you can specify for the parameters to configure the MSCP Server. Initial values are determined by your responses when you execute the VMS installation or upgrade procedure, or when you execute the CLUSTER_CONFIG.COM command procedure described in Chapter 3 to set up your configuration. Note that if you later change the values, you must reboot the system on which the values are changed, before the new values can take effect (see Section 3.2.3). Table 5-1 Specifying Values for MSCP_LOAD and MSCP_SERVE_ ALL Parameters Parameter Value Function MSCP_LOAD 0 Do not load the MSCP Server (default value). Load the MSCP Server with attributes specified by MSCP_SERVE_ALL parameter. MSCP_SERVE_ALL 5.1.3 0 Do not serve any disks (default value). 1 Serve all available disks. 2 Serve only locally-connected (non-HSC) disks. Dual-Pathed Disks A dual-pathed disk is a dual-ported disk that is accessible to all the nodes in the cluster, not just to the nodes that are physically connected to the disk. Dual-pathed disks can be any of the following: • Dual-ported HSC disks • Dual-ported DSA disks using UDA/KDA/BDA controllers • Dual-ported MASSBUS disks The term dual-pathed refers to the two paths through which cluster nodes can access a disk to which they are not directly connected. If one path fails, the disk is accessed over the other path. (Note that with a dual-ported MASSBUS disk, a node directly connected to the disk always accesses it locally.) 5.1.3.1 Dual-Ported HSC Disks By design, HSC disks are cluster accessible. Therefore, if they are dual ported, they are automatically dual pathed. CI-connected cluster nodes can access a dual-pathed HSC disk by way of a path through either HSC connected to the device. For each dual-ported HSC disk, you can control failover to a specific port using the port select buttons on the front of each drive. By pressing either port select button (A or B) on a particular drive, you can cause the device to fail over to the specified port. With the port select buttons, you can select alternate ports to balance the disk controller workload between two HSCs. For example, you could set half of your disks to use Port A and set the other half to use Port B. The port select buttons also enable you to fail over all the disks to an alternate port manually when you anticipate the shutdown of one of the HSCs. 5-3 Setting Up and Managing Cluster Disks 5.1 Cluster-Accessible Disks 5.1.3.2 Dual-Ported DSA Disks A dual-ported DSA disk be failed over between the two VAX systems that serve it to the cluster. However, because a DSA disk can be online to only one controller at a time, only one of the systems can use its local connection to the disk. The second system accesses the disk through the MSCP Server. If the system that is currently serving the disk fails, the other system detects the failure and fails the disk over to its local connection. The disk is thereby made available to the cluster once more. 5.1.3.3 Dual-Ported MASSBUS Disks In clusters with only two active nodes, a dual-ported MASSBUS disk is considered cluster accessible if it is connected between the two nodes, and if it has the same device name on both nodes. The Distributed File System synchronizes access to files on the disk. To set up a dual-ported MASSBUS disk in a two-node cluster, enter the DCL command SET DEVICE in the following format before mounting the disk: $ SET DEVICE/DUAL_PORT device-name Note: A MASSBUS disk may be used either as a dual-ported disk or as a system disk, but not both. In clusters with more than two active nodes, you can set up a dual-ported MASSBUS disk to be cluster accessible through the MSCP Server on either or both nodes to which the disk is connected. Be sure, however, not to use the SYSGEN commands AUTOCONFIGURE or CONFIGURE to configure a dual-ported MASSBUS disk that is already available on the system through the MSCP Server. Establishing a local connection to the disk when a remote path is already known creates two uncoordinated paths to the same disk. Use of these two paths can corrupt files and data on any disk mounted on the drive. If the local path to the disk is not found during the system bootstrap procedure, the MSCP Server path from the remote node is the only available access to the drive. The local path is not found during a boot if any of the following conditions exist: • The port select switch for the drive is not enabled for the local node. • The disk, cable, or adapter hardware for the local path is broken. • There is sufficient activity on the other port to "mask" the existence of the port. • The system is booted in such a way that the SYSGEN command AUTOCONFIGURE ALL in the site-independent startup procedure (SYS$SYSTEM:STARTUP .COM) was not executed. Use of the disk is still possible through the MSCP Server path. Caution: Under these conditions, do not attempt to add the local path back into the system 1/0 database using the SYSGEN commands AUTOCONFIGURE or CONFIGURE. SYSGEN is currently unable to detect the presence of the disk's MSCP path and would incorrectly build a second set of data structures to describe it. Subsequent events could lead to incompatible and uncoordinated file operations, which might corrupt the volume. To recover the local path to the disk, you must reboot the system connected to that local path. 5-4 Setting Up and Managing Cluster Disks 5.1 Cluster-Accessible Disks Note that if the disk is not dual ported or is never MSCP served on the remote host, this restriction does not apply. 5.2 Cluster Device-Naming Conventions To manage cluster devices properly, you must understand the conventions used to identify them. Every cluster device is identified by a unique name, which provides a reliable way to access it in the cluster. Devices that are local to a cluster node can be accessed by that node through the traditional device name (for example, DJAl) or through a cluster device name in the format node$device (for example, JUPITR$DJA1). However, a device that is dual pathed between two nodes must be identified by a unique, path-independent name that includes an allocation class. The allocation class is a numeric value from 0 to 255 that is used to create a device name in the following format: $allocation-class$device-name For example, the allocation class device name $1$DJA16 identifies a disk that is dual ported between two nodes (VAX or HSC) that both have an allocation class value of 1. Each time a node that is not directly connected to such a disk tries to access the disk, the choice of which path to take is made arbitrarily, because no path to the disk is ever guaranteed. Because the access path is chosen without regard to the names of the nodes (VAX or HSC) serving the disk, an allocation class device name is required to identify the disk uniquely. 5.2.1 Rules for Specifying Allocation Class Values Allocation classes play an important role in determining strategies for configurating and naming disks. In fact, the VMS operating system uses allocation class values above all other available information when determining the configuration of cluster devices. The following rules apply for specifying allocation class values: • VAX or HSC nodes connecting a dual-pathed disk must have the same non-zero allocation class value. • All cluster-accessible disks on nodes with a non-zero allocation class value must have unique names. For example, if two VAX nodes have the same allocation class value, it is invalid for both nodes to have a disk named DJAO. This restriction also applies to HSCs. • Single-ported disks with an allocation class value of zero can have the same unit number on different cluster nodes. 5-5 Setting Up and Managing Cluster Disks 5.2 Cluster Device-Naming Conventions Note that 0 is the default allocation class value. Any node in a CI-only cluster that is not connected to a dual-pathed disk should be assigned this value. In a mixed-interconnect cluster, however, all of the following must have a non-zero allocation class value: • HSCs • Systems serving HSC disks • Systems connected to dual-pathed disks Failure to set allocation class values correctly may cause both disk corruption and locking conflicts that can suspend normal cluster operations. To assign an allocation class value to a VAX node that supports dual-pathed devices, specify the value with the SYSGEN parameter ALLOCLASS. To assign an allocation class for an HSC, specify the value using the HSC console to enter a command in the following format, where n is the allocation class value. SET ALLOCATE DISK n For complete information on HSC console commands, refer to the HSC hardware documentation. 5.2.2 Sample Configurations with Named Devices Figures 5-2 and 5-3 show how cluster device names are specified for the following: • Dual-pathed HSC disks • Dual-pathed DSA disks Figure 5-4 shows how device names are typically specified in a mixedinterconnect cluster. This figure also shows relevant SYSGEN parameter settings in MODPARAMS.DAT. A typical configuration with a dual-pathed HSC disk is illustrated in Figure 5-2. Note that the allocation class value (1) is the same on all nodes, and that the disk's device name ($1$DJA17) is constructed using that value. VAX nodes JUPITR and SATURN can access the disk through either of the HSCs VOYGRl or VOYGR2. 5-6 Setting Up and Managing Cluster Disks 5.2 Cluster Device-Naming Conventions Figure 5-2 Configuration with a Dual-Pathed HSC Disk ALLOCLASS ~ 1 ALLOCLASS 1 ALLOCLASS 1 ALLOCLASS 1 ZK-6656-HC Figure 5-3 shows a configuration with a dual-pathed DSA disk. Figure 5-3 Configuration with a Dual-Pathed DSA Disk ETHERNET URANUS I A·iu1@f4i.WM .., NEPTUN ... ,,.~,-ZK-6655-HC Nodes URANUS and NEPTUN can access the disk either locally or through the other node's MSCP Server. When satellite node ARIEL accesses the disk, however, it arbitrarily chooses a path through either URANUS or NEPTUN. If ARIEL tries to access the disk by using the node-specific device name URANUS$DJA8, and this disk is not currently accessible through URANUS, access will fail. But if ARIEL uses the allocation class device name $1$DJA8, it can access the disk through NEPTUN. As a general rule, you should always use a path-independent, allocation class device name to identify dual-pathed cluster disks. 5-7 Setting Up and Managing Cluster Disks 5.2 Cluster Device-Naming Conventions Figure 5-4 illustrates the use of device names in a mixed-interconnect cluster. Figure 5-4 Device Names in a Mixed-Interconnect Cluster EUROPA$DUAO ETHERNET $1$DUA1 $1$DUA3 $1$DUA2 $1$DUA4 $1$DJA18 ALLOCLASS ~ 1 MSCP_LOAD ~ 1 MSCP_SERVE _ALL ~ 1 ALLOCLASS ~ 1 MSCP_LOAD ~ 1 MSCP_SERVE_ALL ~ 2 ZK-6660-HC In this configuration, a set of disks is dual-pathed to the HSC controllers named VOYGRl and VOYGR2, and these controllers are connected to VAX processor JUPITR. Because ALLOCLASS is set to the same value (1) on JUPITR and on both HSCs, JUPITR can serve the disks on VOYGRl and VOYGR2 to all satellite nodes in the cluster. Disks on the HSCs have allocation class names of the form $1$ddcu. For example, the disk DUA17 is named $1$DUA17. On CI-connected nodes, VMS software would also recognize the disk as JUPITR$DUA17 and as either VOYGR1$DUA17 or VOYGR2$DUA17. On satellites, it would recognize the disk as JUPITR$DUA17 or as $1$DUA17. This example shows why you 5-8 Setting Up and Managing Cluster Disks 5.2 Cluster Device-Naming Conventions should always use an allocation class name like $1$DUA17 when configuring cluster devices: the allocation class name is the only name that all cluster nodes recognize at all times. Note that, for optimal availability, two or more CI-connected VAX processors should serve HSC disks to the cluster. For example, because MSCP_SERVE_ ALL is set to 1 on nodes JUPITR, SATURN, and URANUS, and because ALLOCLASS is set to the same value on those nodes and on the HSCs, JUPITR, SATURN, and URANUS can serve disks on the HSCs. But because MSCP_SERVE_ALL is set to 2 on node NEPTUN, that node can serve only its local disks. 5.3 Shared Disks A shared disk is a disk that is mounted on a cluster-accessible device by one or more nodes in the cluster. Shared disks play a key role in commonenvironment clusters, because when you place system files or command procedures on a shared disk, cluster nodes can share a single copy of each common file (see Chapter 2). Note, however, that a shared disk is a single point of failure for data access by the nodes sharing the disk. To mount cluster-accessible disks that are to be shared among all cluster nodes, specify the same MOUNT command on each node or specify the MOUNT command with the /CLUSTER qualifier on one node. When you execute MOUNT/CLUSTER on one node, the disk is mounted on every node in the cluster at the time the command executes. Note that only system or group disks can be mounted clusterwide. Thus, if you specify MOUNT /CLUSTER without the /SYSTEM or /GROUP qualifier, /SYSTEM is assumed. Also note that each cluster disk mounted with the /SYSTEM, /GROUP, or /SHARED qualifiers must have a unique volume label. If you want to mount a shared disk on some but not all the nodes in the cluster, execute the same MOUNT command (without the /CLUSTER qualifier) on each node sharing the disk. For example, suppose you want all the nodes in a three-node cluster to share a disk named COMPANYDOCS. To share the disk, each of the three nodes could execute identical MOUNT commands, or one of the three nodes could mount COMP ANYDOCS using the MOUNT /CLUSTER command, as follows: $ MOUNT/SYSTEM/CLUSTER/NOASSIST $1$DUA4: COMPANYDOCS If you want just two of the three nodes to share the disk, those two nodes must both mount the disk with the same MOUNT command. For example: $ MOUNT/SYSTEM/NOASSIST $1$DUA4: COMPANYDOCS To mount the disk at startup time, include the mount command either in a common command procedure that is invoked at startup time, or in the node-specific startup command procedure. 5-9 Setting Up and Managing Cluster Disks 5.4 Setting Up Cluster Devices 5.4 Setting Up Cluster Devices To implement your plans for configuring cluster disks, you can create command procedures to set up and mount them. You may want to include commands that set up and mount cluster disks in a separate command procedure file that is invoked by a site-specific SYSTARTUP procedure. Depending on your cluster environment, you can set up your command procedure in either of the following ways: • As a separate file specific to each node in the cluster • As a common node-independent file You can set up the common procedure as a shared file on a shared disk, or you can make duplicate copies of the common procedure and store them as separate files. With either method, each node can invoke the common procedure from the site-specific SYSTARTUP procedure. The MSCPMOUNT.COM example in the SYS$EXAMPLES directory on your system shows a sample common command procedure used to mount cluster disks. 5.5 Volume Shadowing in Mixed-Interconnect Clusters If shadowing is to be used anywhere in a mixed-interconnect cluster, all CIconnected VAX nodes must have the SYSGEN parameter SHADOWING set to 1. This setting causes them to use the shadowing driver, DSDRIVER. The MSCP Server serves the shadow set virtual unit to the satellites. Example 5-1 shows how the shadow set appears when you enter the DCL command SHOW DEVICES Don a boot server. Example 5-1 Device Name $1$DUA111: $1$DUA151: $1$DUS111: Shadow Set as Seen from Boot Server (VOYGR1) (VOYGR1) (VOYGR1) Device Status ShadowSetMember ShadowSetMember Mounted Error Volume Free Trans Mnt Count Label Blocks Count Cnt 0 (member of $1$DUS111:) 0 (member of $1$DUS111:) 0 VMS08JUL 244688 118 21 Satellites must have the SHADOWING parameter set to 0. This setting causes them to use the non-shadowing driver, DUDRIVER. Satellites access the shadow set by mounting the virtual unit, and they can see the virtual unit through the MSCP Server. The shadow set appears to have the same characteristics as any other disk device, as shown in Example 5-2. However, while satellites can see shadow set member units, they cannot access them individually. 5-10 Setting Up and Managing Cluster Disks 5.5 Volume Shadowing in Mixed-Interconnect Clusters Example 5-2 Device Name $1$DUA111: $1$DUA151: $1$DUS111: Shadow Set as Seen from Satellite (SATURN) (SATURN) (SATURN) Device Status Online Online Mounted Error Count 0 0 0 Volume Free Trans Mnt Label Blocks Count Cnt (remote shadow member) (remote shadow member) VMS08JUL 244688 121 21 In mixed-interconnect clusters it is recommended that at least two boot servers should serve the shadow set, so that if one server should fait another is available to keep the shadow set intact. For complete information on volume shadowing, see the VAX Volume Shadowing Manual. 5.5.1 Mounting Shadow Sets Satellites have no knowlege of shadow set configuration, and they cannot issue any shadow set maintenance commands using the /SHADOW qualifier. All commands that create, modify, and dissolve shadow sets must be entered on a CI-connected node. For example, you must enter a command like the following on a CI-connected node: $MOUNT/SYSTEM $1$DUS111:/SHADOW=($1$DUA111,$1$DUA151) VMS08JUL When a shadow set virtual unit is created by a MOUNT command on a CI-connected node, the MSCP Server automatically serves the virtual unit to other Cl-connected nodes. A MOUNT/SYSTEM command entered on a CI-connected node forms the shadow set on the CI-connected node. Once the shadow set is formed, you can use the MOUNT/CLUSTER command to mount it on all CI-connected nodes and satellites. For example, to mount clusterwide the shadow set shown in Example 5-1, you must enter two commands. First, enter the following command on any CI-connected node: $MOUNT/SYSTEM $1$DUS111:/SHADOW=($1$DUA111,$1$DUA151) VMS08JUL This command creates the virtual unit, forms the shadow set, and mounts it on the Cl-connected node. The virtual unit is automatically served after it is created. Next, enter the following command: $MOUNT/CLUSTER $1$DUS111: /SHADOW=VMS08JUL This command mounts the shadow set on the remaining CI-connected nodes and on satellites. 5.5.2 Dismounting Shadow Sets Be careful when dismounting shadow sets. The shadow set virtual unit must always be dismounted on all satellites before being dismounted (and possibly dissolved) on the CI-connected VAX nodes. If these nodes dismount the shadow set before satellites do, the shadow set will be dissolved. The satellites will then have the virtual unit mounted, but will have no path (through a CI-connected node) to the member units. The satellites will therefore place the virtual unit in mount verification. This condition can result in suspended operations, and require a cluster reboot, because satellites 5-11 Setting Up and Managing Cluster Disks 5.5 Volume Shadowing in Mixed-Interconnect Clusters may hold locks that must be released before the CI-connected node can rebuild the shadow set. If this condition occurs, you can remount the shadow set on a CI-connected serving node. When that node reforms the shadow set, the satellites can once again access the volume-provided that the CI-connected node has been able to rebuild the shadow set. In general, you should use the command DISMOUNT/SYSTEM, rather than DISMOUNT/CLUSTER, to dismount shadow sets in mixed-interconnect clusters. 5.5.3 Using Shadow Sets as Satellite System Disks A satellite system disk can be a shadow set. The system device parameter in the DECnet database for satellites must be the device name of the shadow set virtual unit (for example, $1$DUS111). No description of shadow set member units is needed. 5-12 A Cluster SYSGEN Parameters For systems to boot properly into a cluster, certain system parameters must be set on each cluster node. Table A-1 lists SYSGEN parameters used in cluster configurations. Table A-1 Cluster SYSGEN Parameters Parameter Description ALLOCLASS Specifies a numeric value from 0 to 255 to be assigned as the allocation class for the node. The default value is 0. DISK_QUORUM The name, in ASCII, of an optional quorum disk. ASCII spaces indicate that no quorum disk is being used. DISK_QUORUM must be defined on one or more cluster nodes capable of having a direct (non-MSCP served connection to the disk). These nodes are called quorum disk watchers. The remaining nodes (nodes with a blank value for DISK_QUORUM) recognize the name defined by the first watcher node which which they commmunicate. EXPECTED_VOTES Specifies a setting that is used to derive the initial quorum value. This setting is the sum of all VOTES held by potential cluster members. By default, the value is 1 . The connection manager sets a quorum value to a number that will prevent cluster partitioning (see Section 1.5). To calculate quorum, the system uses the following formula: estimated quorum = (EXPECTED_VOTES + 2)/2 MSCP_LOAD Controls whether the MSCP Server is loaded. Specify 1 to load the server. By default, the value is set to zero, and the server is not loaded. MSCP_SERVE_ALL Specifies MSCP disk serving functions when the MSCP Server is loaded. The default value of zero specifies that no disks are served. A value of 1 specifies that all available disks are served. A value of 2 specifies that only locally-connected (non-HSC) disks are served. NISCS_CONV_BQOT Specifies whether conversational bootstraps are enabled on the node. The default value of zero specifies that conversational bootstraps are disabled. A value of 1 enables conversational bootstraps. NISCS_LOAD_PEAO Specifies whether the V AXport driver PEDRI VER is to be loaded to enable cluster communications over the Ethernet. The default value of zero specifies that the driver is not loaded. A value of 1 specifies that that driver is loaded. NISCS_PQRT_SERV Specifies whether data checking is enabled for the node. The default value of zero specifies that data checking is disabled. QDSKVOTES Specifies the number of votes contributed to the cluster votes total by a quorum disk. The maximum is 127, the minimum is 0, and the default is 1. This parameter is used only when DISK_QUORUM is defined. QDSKINTERV AL Specifies the disk quorum polling interval, in seconds. The maximum value is 32767, the minimum value is 1, and the default is 10. Lower values trade increased overhead cost for greater responsiveness. DIGIT AL recommends that this parameter be set to the same value on each cluster node. A-1 Cluster SYSGEN Parameters Table A-1 (Cont.) Cluster SYSGEN Parameters Parameter Description RECNXINTERV AL Specifies, in seconds, the interval during which the connection manager attempts to reconnect a broken connection to another VMS system. If a new connection cannot be established during this period, the connection is declared irrevocably broken, and either this system or the other must leave the cluster. This parameter trades faster response to certain types of system failures against the ability to survive transient faults of increasing duration. DIGIT AL recommends that this parameter be set to the same value on each cluster node. VAXCLUSTER Controls whether the system should join or form a cluster. This parameter accepts the following three values: • 0-Specifies that the system will not participate in a cluster. • 1-Specifies that the system should participate in a cluster if hardware supporting SCS is present (Cl, UDA, HSC50). • 2-Specifies that the system should participate in a cluster You should always set this parameter to 2 on systems intended to run in a cluster, 0 on systems that boot from a UDA and are not intended to be part of a cluster, and 1 (the default) otherwise. VOTES Specifies the number of votes towards a quorum to be contributed by the node. By default, the value is 1. SCS Parameters PANUMPOLL Specifies the number of ports to poll at each interval. DIGIT AL recommends that this parameter be set to the same value on each cluster node. PASTIMOUT Specifies the interval at which the Cl port driver performs time-based bookkeeping operations. This interval is also the period after which a start handshake datagram is assumed to have timed out. Normally the default value is adequate. DIGIT AL recommends that this parameter be set to the same value on each cluster node. PASTDGBUF Specifies the number of datagram receive buffers to queue for the Cl port driver's configuration poller; that is, the maximum number of start handshakes that can be in progress simultaneously. Normally the default value is adequate. DIGITAL recommends that this parameter be set to the same value on each cluster node. PAMAXPORT Specifies the maximum number of Cl ports the Cl port driver polls for a broken port-to-port virtual circuit, or a failed remote node. You can decrease this parameter in order to reduce polling activity if the hardware configuration has fewer than 16 ports. For example, if the configuration has a total of five ports assigned port numbers 0-4, then you should set PAMAXPORT to 4. The default for this parameter is 15 (poll for all possible ports 0 through 15). DIGIT AL recommends that this parameter be set to the same value on each cluster node. A-2 Cluster SYSGEN Parameters Table A-1 (Cont.) Cluster SYSGEN Parameters Parameter Description PANOPOLL Disables Cl polling for ports if set to 1. (The default is 0.) When PANOPOLL is set, a system will not discover that another system has shut down or powered down promptly and will not discover a new system that has booted. This parameter is useful when you want to bring up a system detached from the rest of the cluster for checkout purposes. It is roughly equivalent to uncabling the system from the star coupler. PANOPOLL = 0 is the normal setting and is required if you are booting from an HSC. PAPOLLINTERVAL Specifies in seconds, the polling interval the computer interconnect (Cl) port driver uses to poll for a newly booted system, a broken port-to-port virtual circuit, or a failed remote node. This parameter trades polling overhead against quick response to virtual circuit failures. DIGITAL recommends that you use default value for this parameter. DIGIT AL recommends that this parameter be set to the same value on each cluster node. PAPOOLINTERV AL Specifies in seconds, the interval at which the PA port driver checks for available nonpaged pool after a failure to allocate. PASANITY Controls whether the port sanity timer is enabled to permit remote systems to detect a system that has been halted or retained at IPL 7 for a prolonged period. This parameter is normally set to 1 and should only be set to 0 when debugging with XDELTA. Normally the default value is adequate. PASANITY is a dynamic parameter (altered the next time the port is initialized) and has a default value of 1 . PRCPOLINTERV AL Specifies, in seconds, the polling interval used to look for SCS applications, such as the connection manager and MSCP disks, on other nodes. Each node is polled, at most, once each interval. This parameter trades polling overhead against quick recognition of new systems or servers as they appear. DIGIT AL recommends that you set this parameter to 15, which is the default. SCSBUFFCNT Specifies the number of computer interconnect (Cl) buffer descriptors configured for all Cl ports on the system. SCSCONNCNT Specifies the total number of SCS connections that are configured for use by all system applications. Normally, the default value is adequate. SCSMAXMSG Specifies the SCS maximum sequenced message size. Normally, the default value is adequate. SCSMAXDG Specifies the maximum number of bytes of application data in one datagram. Normally the default value is adequate. A-3 Cluster SYSGEN Parameters Table A-1 (Cont.) Cluster SYSGEN Parameters Parameter Description SCSFLOWCUSH Specifies the lower limit for receive buffers at which point SCS starts to notify the remote SCS of new receive buffers. For each connection, SCS tracks the number of receive buffers available. SCS communicates this number to the SCS at the remote end of the connection. However, SCS does not need to do this for each new receive buffer added. Instead, SCS notifies the remote SCS of new receive buffers if the number of receive buffers falls as low as the SCSFLOWCUSH value. Normally the default value is adequate. SCSSYSTEMID Specifies the lower-order 32 bits of the 48-bit system identification number. This parameter is not dynamic and must be the same as the DECnet node number (1024 * <DECnet area> +DECnet node number). SCSSYSTEMIDH Specifies the high-order 16 bits of the 48 bit system identification number. This parameter must be set to 0. It is reserved by DIGITAL for future use. SCSNODE Specifies the SCS system name. This parameter is not dynamic. You should use a name that is the same as the DECnet node name (limited to six characters) since the name must be unique among all systems in the cluster. Note that once a node has been recognized by another node in the cluster, you cannot change the SCSSYSTEMID or SCSNODE parameter without changing both. SCSRESPCNT A-4 Specifies the total number of response descriptor table entries configured for use by all system applications. B Building a Common SYSUAF. DAT File from Node-Specific Files This appendix provides guidelines for building a common user authorization file from node-specific files. For more detailed information on how to set up a node-specific authorization file, see the descriptions in the VMS Authorize Utility Manual and in the Guide to Setting Up a VMS System. To build a common SYSUAF.DAT file, proceed as follows: steps. 1 Print a listing of SYSUAF.DAT on each node. To print this listing, invoke AUTHORIZE and specify the AUTHORIZE command LIST as follows: $ SET DEF SYS$SYSTEM $ RUN AUTHORIZE UAF> LIST/FULL [*,*] 2 Use the listings to compare the accounts from each node. On the listings, mark down any necessary changes. One such change is to delete any accounts that you no longer need. You should also make sure that each user account in the cluster has a unique UIC. For example, node VENUS of the cluster may have a user account JONES that has the same UIC as user account SMITH on node MARS. When nodes VENUS and MARS are joined to form a cluster, accounts JONES and SMITH will exist in the cluster environment with the same UIC. If the UICs of these accounts are not differentiated, each user will have the same access rights to various objects in the cluster. In this case you should assign each account a unique UIC. Make sure that accou~ts that perform the same type of work have the same group UIC. Accounts in a single-system environment probably follow this convention. However, there may be groups of users on each node that will perform the same work in the cluster but have group UICs unique to their local node. As a rule, the group UIC for any given work category should be the same on each node in the cluster. For example, data entry accounts on node VENUS should have the same group UIC as data entry accounts from node MARS and node RED. Note that if you change the UIC for a particular user, you should also change the owner UICs for that user's existing files and directories. You can use the DCL commands SET FILE and SET DIRECTORY to make these changes. These commands are described in detail in the VMS DCL Dictionary. 3 Choose the SYSUAF.DAT from one of the nodes to be a master SYSUAF.DAT. 4 Merge the SYSUAF.DAT files from the other nodes to the master SYSUAF.DAT by running the Convert Utility (CONVERT) on the node that owns the master SYSUAF.DAT. (See the VMS Convert and Convert/Reclaim Utility Manual for a description of CONVERT.) To use CONVERT to merge the files, each SYSUAF.DAT file must be accessible to the node that is running CONVERT. 8-1 Building a Common SYSUAF.DAT File from Node-Specific Files To merge the UAFs into the master SYSUAF.DAT file, specify the CONVERT command in the following format: $CONVERT SYSUAF1,SYSUAF2, ... SYSUAFn MASTER_SYSUAF Note that if a given username appears in more than one source file, only the first occurrence of that name will appear in the merged file. The command in the following example adds the SYSUAF.DAT file from two cluster nodes to the master SYSUAF.DAT in the current default directory: $ SET DEFAULT SYS$SYSTEM $CONVERT [SYS1.SYSEXE]SYSUAF.DAT, [SYS2.SYSEXE]SYSUAF.DAT SYSUAF.DAT The CONVERT command in this example adds the records from the files [SYSl.SYSEXE]SYSUAF.DAT and [SYS2.SYSEXE]SYSUAF.DAT to the file SYSUAF.DAT on the local node. After you run CONVERT, you are left with a master SYSUAF.DAT that contains records from the other SYSUAF.DAT files. 5 8-2 Use AUTHORIZE to modify the accounts in the master SYSUAF.DAT according to the changes you marked on the initial listings of the SYSUAF.DAT files from each node. C VAXcluster Troubleshooting Information This appendix contains information to help you perform troubleshooting operations for the following: C.1 • Failures of nodes to boot or to join the cluster • Cluster hangs • CLUEXIT bugchecks • VAXport device problems Diagnosing Failures of Nodes to Boot or to Join the Cluster Before you initiate diagnostic procedures, be sure to verify that these conditions are met: • All cluster hardware components are correctly connected and checked for proper operation. • Cluster nodes and mass storage devices are configured according to requirements specified in the VAXcluster Software Product Description (SPD) document. When attempting to add a new or recently repaired CI-connected node to the cluster, you must verify that the CI cables are correctly connected, as described in Section C.4.2.2. When attempting to add a satellite node to a local area or mixed-interconnect cluster, you must verify that the Ethernet is configured according to requirements specified in the VAXcluster SPD document, and that the machine's memory resources and Ethernet adapter device meet the requirements specified in that document. You must also verify that you have correctly configured and started the DECnet-VAX network, following the procedures described in Section 2.3. If after performing preliminary checks and taking appropriate corrective action, you find that a node still fails to boot or to join the cluster, you can follow the procedures in Sections C.1.2 through C.1.4 to attempt recovery. C.1.1 Summary of Events for Nodes Booting and Joining the Cluster To perform diagnostic and recovery procedures effectively, you must understand the events that occur when a node boots and attempts to join the cluster. This section outlines those events and shows typical messages displayed at the console. Note that events vary, depending on whether a node is the first node to boot in a new cluster or whether it is booting in an active cluster. Note further that some events (such as loading the cluster security database) occur only in local area and mixed-interconnect clusters. C-1 VAXcluster Troubleshooting Information C.1 Diagnosing Failures of Nodes to Boot or to Join the Cluster The normal sequence of events is as follows: 1 The node boots. If the node is a satellite, a messsage like the following shows the name and Ethernet address of the boot server that has downline loaded the satellite: %VAXcluster-I-SYSLOAD. system loaded from node X... (XX-XX-XX-XX-XX-XX) For any booting node, the VMS "banner message" is displayed in the following format: VAX/VMS Version n.n DD-MMM-YYYY hh:mm.ss 2 The node attempts to form or join the cluster, and the following message appears: waiting to form or join a VAXcluster system If the node is a member of a local area or mixed-interconnect cluster, the cluster security database is loaded. Optionally, the MSCP Server may be loaded: %VAXcluster-I-LOADSECDB, loading the cluster security database %MSCPLOAD-I-LOADMSCP, loading the MSCP disk server 3 If the node discovers a cluster, the node attempts to join. If a cluster is found, the Connection Manager displays one or more messages in the following format: %CNXMAN, Sending VAXcluster membership request to system X... Otherwise, the Connection Manager forms the cluster when it has enough votes to establish quorum (that is, when enough voting nodes have booted). 4 As the booting node joins the cluster, the Connection Manager displays a message in the following format: %CNXMAN, now a VAXcluster member -- system X... Note that if quorum is lost while the node is booting, or if a node is unable to join the cluster within two minutes of booting, the Connection Manager displays messages like the following: %CNXMAN, Discovered system X... %CNXMAN, Deleting CSB for system X... %CNXMAN, Established "connection" to quorum disk %CNXMAN, Have connection to system X... %CNXMAN, Have "connection" to quorum disk The last two messages show any connections that have already been formed. If the cluster includes a quorum disk, you may also see messages like the following: %CNXMAN, Using remote access method for quorum disk %CNXMAN, Using local access method for quorum disk The first message indicates that the Connection Manager is unable to access the quorum disk directly, either because the disk is unavailable, or because it is accessed through the MSCP Server. Another node in the cluster that can access the disk directly must verify that a reliable connection to the disk exists. C-2 VAXcluster Troubleshooting Information C.1 Diagnosing Failures of Nodes to Boot or to Join the Cluster The second message indicates that the Connection Manager can access the quorum disk directly and can supply information about the status of the disk to nodes that cannot access the disk directly. Note that the Connection Manager may not see the quorum disk initially, because the disk may not yet be configured. In that case, the Connection Manager first uses remote access, then switches to local access. 5 Once the node has joined the cluster, normal startup procedures execute. One of the first functions is to start the OPCOM process: %%%%%%%%%%% OPCOM 15-APR-1988 16:33:55.33 %%%%%%%%%%% Logfile has been initialized by operator _X ... $0PAO: Logfile is SYS$SYSROOT:[SYSMGR]OPERATOR.LOG;17 %%%%%%%%%%% OPCOM 15-APR-1988 16:33:56.43 %%%%%%%%%%% 16:32:32.93 Node X... (csid 0002000E) is now a VAXcluster member When other nodes join the cluster, OPCOM displays messages like the following: %%%%%%%%%%% OPCOM 15-APR-1988 16:34:25.23 %%%%%%%%%%% (from node X... at 16:34:25.23) 16:34:24.42 Node X... (csid 000100F3) received VAXcluster membership request from node X... As startup procedures continue, various messages report startup events. Note: For troubleshooting purposes, you may want to include in your sitespecific startup procedures messages announcing each phase of the the startup process-for example, mounting disks or starting queues. C.1.2 Cl-Connected Node Fails to Boot If a CI-connected node fails to boot, perform the following checks: • Verify that the node's SCSNODE and SYSSYSTEMID parameters are unique in the cluster. If they are not, you must either alter both values or reboot all other nodes. • Verify that you are using the correct bootstrap command file. This file must specify the internal bus node number (if applicable), the HSC node number, and the HSC disk from which the node is to boot. Refer to your processor-specific installation and operations guide for information on setting values in default bootstrap command procedures. • Verify that the SYSGEN parameter P AMAXPORT is set to a value greater than or equal to the largest CI port number. • Verify that the HSC is ONLINE. The ONLINE switch on the HSC Operator Control Panel should be depressed. • Verify that the disk is available. The correct port switches on the disk's operator control panel should be depressed. • Verify that the node has access to the HSC. The SHOW HOSTS command of the HSC SETSHO Utility displays status for all VAX nodes (hosts) in the cluster. (For complete information on the SETSHO Utility, consult the HSC hardware documentation.) If the node in question appears in the display as DISABLED, use the SETSHO Utility to set the node to the ENABLED state. C-3 VAXcluster Troubleshooting Information C.1 Diagnosing Failures of Nodes to Boot or to Join the Cluster • C.1.3 Verify that the HSC allows access to the boot disk. Invoke the SETSHO Utility to ensure that the boot disk is available to the HSC. The utility's SHOW DISKS command displays the current state of all disks visible to the HSC and displays all disks in the no-host-access table. If the boot disk appears in the no-host-access table, use the SETSHO Utility to set the boot disk to host-access. If the boot disk is AVAILABLE or MOUNTED and host-access ENABLED, but does not appear in the nohost-access table, contact your Field Service representative and explain both the problem and the steps you have taken. Satellite Node Fails to Boot To boot successfully, a satellite must communicate with a boot server over the Ethernet. You can use DECnet event logging to verify this communication. Proceed as follows: 1 Log in as system manager on the boot server. 2 If event logging for management layer events is not already enabled, enter the following NCP commands to enable it: NCP> SET LOGGING MONITOR EVENT O.* NCP> SET LOGGING MONITOR STATE ON 3 Enter the following DCL command: $ REPLY/ENABLE=NETWORK This command enables the terminal to receive DECnet messages reporting downline load events. 4 Boot the satellite. If the satellite and the boot server can communicate, and if all boot parameters are correctly set, messages like the following are displayed at the boot server's terminal: DECnet event 0.3, automatic line service From node 2.4 (URANUS), 15-APR-1988 09:42:15.12 Circuit QNA-0, Load, Requested, Node = 2.42 (OBERON) File = SYS$SYSDEVICE:<SYS10.>, Operating system Ethernet address = 08-00-2B-07-AC-03 DECnet event 0.3, automatic line service From node 2.4 (URANUS), 15-APR-1988 09:42:16.76 Circuit QNA-0, Load, Successful, Node = 2.42 (ARIEL) File= SYS$SYSDEVICE:<SYS11.>, Operating system Ethernet address = 08-00-28-07-AC-13 If the satellite cannot communicate with the boot server, no message for that satellite appears. There may be a problem with an Ethernet cable connection or adapter service. If the satellite's data in the DECnet database is incorrectly specified (for example, if the hardware address is incorrect), a message like the following displays the correct address and indicates that a load was requested: DECnet event 0.7, aborted service request From node 2.4 (URANUS), 15-APR-1988 09:42:09.67 Circuit QNA-0, Line open error, Ethernet address = 08-00-2B-03-29-99 Note the absence of the node name, node address, and system root. C-4 VAXcluster Troubleshooting Information C.1 Diagnosing Failures of Nodes to Boot or to Join the Cluster If a satellite fails to boot, perform the following checks: • Verify that the boot device is available. This check is particularly important for local area and mixed-interconnect clusters in which satellites boot from multiple system disks. • Verify that the satellite's SCSNODE and SCSSYSTEMID values and its DECnet node name and address are unique in the cluster. • Verify that the DECnet-VAX network is up and running. • Verify that circuit service is enabled for the boot server's Ethernet adapter device. Invoke the NCP Utility and enter an NCP command in the following format, where circuit-id is the name of the Ethernet adapter circuit that the boot server uses to service downline load requests from satellites: NCP> SHOW CIRCUIT circuit-id If service is not enabled, you can enter NCP commands like the following to enable it: NCP> SET CIRCUIT circuit-id STATE OFF NCP> DEFINE CIRCUIT circuit-id SERVICE ENABLED NCP> SET CIRCUIT circuit-id SERVICE ENABLED STATE ON The DEFINE command updates the permanent database and ensures that service is enabled the next time you start the network. Note that DECnet traffic will be interrupted while the circuit is off. • Verify that you have specified the correct Ethernet hardware address for the satellite. Proceed as follows: 1 Enter an NCP command in the following format on the boot server, specifiying the satellite's node name: NCP> SHOW NODE X... CHARACTERISTICS The system displays data like the following: Node Volatile Characteristics as of 15-APR-1988 13:15:28 Remote node = 2.41 (ARIEL) Hardware address Tertiary loader Load Assist Agent Load Assist Parameter = 08-00-2B-03-27-95 = SYS$SYSTEM:TERTIARY_VMB.EXE = SYS$SHARE:NISCS_LAA.EXE = DISK$VAXVMSRL5:<SYS12.> 2 At the satellite's console prompt(> > > ), enter the commands shown in Table 3-1 to display the satellite's current Ethernet hardware address. 3 Compare the hardware address values displayed by NCP and at the satellite's console. The values should be identical and should also match the value shown in the file SYS$MANAGER:NETNODE_ UPDATE.COM. If the values do not match, you must make appropriate adjustments. For example, if you have recently replaced the satellite's Ethernet adapter device, you must exectue CLUSTER_ CONFIG's CHANGE function to update the network database and NETNODE_UPDATE.COM on the appropriate boot server. C-5 VAXcluster Troubleshooting Information C.1 Diagnosing Failures of Nodes to Boot or to Join the Cluster • C.1.4 Verify that the satellite's load assist parameter specifies the correct device and root directory name and that the satellite's root is unique in the cluster. If changes are needed, you can use CLUSTER_CONFIG.COM to remove the satellite and then add it again with correct values. Node Fails to Join the Cluster If a node boots but fails to join the cluster, proceed as follows: • Verify that VAXcluster software has been loaded. Look for Connection Manager (%CNXMAN) messages like those shown in Section C.1.1. If no such messages are displayed, it is likely that VAXcluster software was not loaded at boot time. Reboot the node in conversational mode. At the SYSBOOT> prompt, set the VAXCLUSTER parameter to 2. (In local area or mixed-interconnect clusters, you must also set NISCS_LQAD_ PEAO to 1.) Note that these parameters should also be set in the node's MODP ARAMS.DAT file. For more information on booting a node in conversational mode, consult your processor-specific installation and operations guide. In local area and mixed-interconnect clusters, verify that the cluster security database file (SYS$COMMON:CLUSTER_AUTHORIZE.DAT) exists and that you have specified the correct group number for this cluster. • Verify that the node has booted from the correct disk and system root. If %CNXMAN messages are displayed, and if after the conversational reboot the node still does not join the cluster, check the console output on all active cluster nodes and look for messages indicating that one or more nodes found a remote system that conflicted with a known or local system. Such messages suggest that two nodes have booted from the same system root. Review the boot command files for all CI-connected nodes and ensure that all are booting from the correct disks and from unique system roots. If you find it necessary to modify the node's bootstrap command procedure (console media), you may be able to do so on another processor that is already running in the cluster. Replace the running processor's console media with the media to be modified, and use the Exchange Utility and a text editor to make the required changes. Consult the appropriate processor-specific installation and operations guide for information on examining and editing boot command files. • C-6 Verify that the node's SCSNODE and SCSSYSTEMID parameters are unique in the cluster. To be eligible to join a cluster, a node must have unique SCSNODE and SYSSYSTEMID parameter values. Check that the current values do not duplicate any values set for existing cluster nodes. Note that if you discover that one or the other value is not unique, you must alter both values or reboot all other cluster nodes. To check or modify values, you can perform a conversational bootstrap operation. However, for reliable future bootstrap operations, you must specify appropriate values for these parameters in the node's MODPARAMS.DAT file. VAXcluster Troubleshooting Information C.1 Diagnosing Failures of Nodes to Boot or to Join the Cluster C.1.5 Startup Procedures Fail to Complete If a node boots and joins the cluster but appears to hang before startup procedures complete-that is, before you are able to log in to the system, be sure that you have allowed sufficient time for the startup procedures to execute. If the startup procedures fail to complete after a period that is normal for your site, try to access the procedures from another cluster node and make appropriate adjustments. For example, verify that all required devices are configured and available. One potential cause of such a failure is the lack of some system resource such as NP AGEDYN or page file space. If you suspect that the value for the NPAGEDYN parameter is set too low, you can perform a conversational bootstrap operation to increase it. Use SYSBOOT to check the current value, and then double the value. If this procedure is unsuccessful, double the value once more. If you suspect a shortage of page file space, and if another cluster node is available, you can log in on that node and use the System Generation Utility (SYSGEN) to provide adequate page file space for the problem node. (Note that insufficent page file space on the booting node may cause other nodes to hang.) If the node still cannot complete the startup procedures, contact your DIGITAL Field Service Representative. C.2 Diagnosing Cluster Hangs Conditions like the following can cause a VAXcluster member system to suspend process or system activity-that is, to hang: • Cluster quorum is lost. • A shared cluster resource is inaccessible. Sections C.2.1 and C.2.2 discuss these conditions. C.2.1 Cluster Quorum Is Lost The VAXcluster quorum scheme coordinates activity among cluster member systems and ensures the integrity of shared cluster resources. (The quorum scheme is described fully in Section 1.5.1.) Quorum is checked after any change to the cluster configuration-for example, when a voting node leaves or joins the cluster. If quorum is lost, process creation and 1/0 activity on all nodes in the cluster are blocked. Information about the loss of quorum and clusterwide events that cause loss of quorum are sent to the OPCOM process, which broadcasts messages to designated operator terminals. The information is also broadcast to each cluster node's operator console (OPAO), unless broadcast activity is explicitly disabled on that terminal. Because, however, quorum may be lost before OPCOM has been able to inform the operator terminals, the messages sent to OP AO are the most reliable source of information about events that may cause loss of quorum. If quorum is lost, you can follow instructions in Section 3.4.4 to recover. C-7 VAXcluster Troubleshooting Information C.2 Diagnosing Cluster Hangs C.2.2 A Shared Cluster Resource Is Inaccessible Access to shared cluster resources is coordinated by the Distributed Lock Manager. If a particular process is granted a lock on a resource (for example, a shared data file), other processes in the cluster that request incompatible locks on that resource must wait until the original lock is released. If the original process retains its lock for an extended period, other processes waiting for the lock to be released may appear to hang. Occasionally a system activity must acquire a restrictive lock on a resource for an extended period. For example, to perform a volume rebuild, system software takes out an exclusive lock on the volume being rebuilt. While this lock is held, no processes can allocate space on the disk volume. If they attempt to do so, they may appear to hang. Access to files that contain data necessary for the operation of the system itself is coordinated by the Distributed Lock Manager. For this reason, a process that acquires a lock on one of these resources and is then unable to proceed may cause the cluster to appear to hang. For example, this condition may occur if a process locks a portion of the system authorization file (SYS$SYSTEM:SYSUAF.DAT) for write access. Any activity that requires access to that portion of the file, such as logging into an account with the same or similar username or sending mail to that username, will be blocked until the original lock is released. Normally this lock would be released quickly, and users would not notice the locking operation. However, if the process holding the lock is itself unable to proceed, other processes could enter a wait state. Because the authorization file is used during login and for most process creation operations (for example, batch and network jobs) blocked processes could rapidly accumulate in the cluster. Because the Distributed Lock Manager is functioning normally under these conditions, users are not notified by broadcast messages or other means that a problem has occurred. C.3 Diagnosing CLUEXIT Bugchecks The VMS operating system performs bugcheck operations only when it detects conditions that could compromise normal system activity or endanger data integrity. A CLUEXIT bugcheck is a type of bugcheck initiated by the Connection Manager, the VAXcluster software component that manages the interaction of cooperating VAXcluster member systems. Most such bugchecks are triggered by conditions resulting from hardware failures (particularly failures in communications paths), configuration errors, or system management errors. The conditions that most commonly result in CLUEXIT bugchecks are as follows: • The cluster connection between two nodes is broken for longer than RECNXINTERVAL seconds. Thereafter, the connection is declared irrevocably broken. If the connection is later reestablished, either or both of the nodes shut down with a CLUEXIT bugcheck. This condition can occur upon power failure recovery with battery backup, after the repair of an SCS communication link, or after the node was halted for a period longer than RECNXINTERVAL seconds, and was restarted with a CONTINUE command entered at the operator console. You must determine the cause of the interrupted connection and C-8 VAXcluster Troubleshooting Information C.3 Diagnosing CLUEXIT Bugchecks correct the problem. For example, if powerfail recovery is longer than RECNXINTERVAL seconds, you may want to increase the value of the RECNXINTERVAL parameter on all nodes. C.4 • Cluster partitioning occurs. A member of a cluster discovers or establishes connection to a member of another cluster, or a foreign cluster is detected in the quorum file. In this case, you must review the setting of EXPECTED_VOTES on all nodes. • The value specified for the SYSGEN parameter SCSMAXMSG on a node is too small. Verify that the value of SCSMAXMSG on all cluster nodes is set to a value that is at the least the default value. Diagnosing VAXport Device Problems The following sections present information on the CI and Ethernet VAXport devices. Information is also provided on entries in the system error log and on corrective actions to take when errors occur. Topics include the folllowing: C.4.1 • VAXport communication mechanisms • Port failures • VAXcluster error log entries • OP AO error messages VAXport Communication Mechanisms This section describes CI and Ethernet port communication mechanisms and System Communications Services (SCS) connections. Port Polling Shortly after a CI-connected system boots, the CI port driver (PAD RIVER) begins configuration polling to discover other active ports on the CI. Normally the poller runs every five seconds (the default value of the SYSGEN parameter P APOLLINT). In the first polling pass, all addresses are probed over cable path A; on the second pass all addresses are probed over path B; on the third pass path A is probed again, and so on. The poller probes by sending request id (REQID) packets to all possible port numbers, including itself. Active ports receiving the REQIDs return id packets (IDREC) to the port issuing the REQID. A port may respond to a REQID even if the system attached to the port is not running. In any CI-only, local area, or mixed-interconnect cluster, the port drivers perform a start handshake when a pair of ports and port drivers has successfully exchanged id packets. The port drivers exchange datagrams containing information about the systems, such as the type of CPU and the operating system version. If this exchange is successful, each system declares a virtual circuit open. An open virtual circuit is prerequisite to all other activity. C-9 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems Ethernet Communications In local area and mixed-interconnect clusters, a multicast scheme is used to locate cluster nodes on the Ethernet. Every three seconds the Port Emulator driver (PEDRIVER) sends HELLO messages to a cluster-specific multicast address that is derived from the cluster group number. The driver also enables the reception of these messages from other nodes. When the driver receives a HELLO message from a node with which it does not currently share an open virtual circuit, it attempts to create a circuit. HELLO messages received from a node with a currently open virtual circuit indicate that the remote node is operational. A standard three-message exchange handshake is used to create a virtual circuit. The handshake messages contain information about the transmitting node and its record of the cluster password. These parameters are verified at the receiving system, which continues the handshake only if its verification is successful. Thus, each node authenticates the other. After the final message, the virtual circuit is opened for use by both nodes. System Communications Services (SCS) Connections System services such as the disk class driver, the VAXcluster Connection Manager, and the MSCP Server communicate between nodes with a protocol called System Communications Services (SCS). Primarily, SCS is responsible for the formation and breaking of intersystem process connections and for flow control of message traffic over those connections. In VMS Version 5.0, SCS is implemented in the VAXport driver (for example, PADRIVER, PBDRIVER, PEDRIVER), and in a loadable piece of the system called SCSLOA.EXE (loaded automatically during system initialization). When a virtual circuit has been opened, a VMS system periodically probes a remote node for system services that the remote system may be offering. The SCS directory service, which makes known services that a node is offering, is always present on both VMS and HSC systems. As system services discover their counterparts on other systems, they establish SCS connections to each other. These connections are full duplex and are associated with a particular virtual circuit. Multiple connections are typically associated with a virtual circuit. C.4.2 Port Failures Taken together, SCS, the VAXport drivers, and the port itself support a hierarchy of communications paths. Working up from the most fundamental level, these are as follows: C-10 • The physical wires. The Ethernet is a single coaxial cable. The CI has two pairs of transmit and receive cables (Path A transmit and receive and Path B transmit and receive). For the CI, VMS software normally sends traffic in automatic path select mode. The port chooses the free path or, if both are free, an arbitrary path (implemented in the cables and Star Coupler, and managed by the port). • The virtual circuit (implemented partly in the CI port or Ethernet Port Emulator driver (PEDRIVER) and partly in SCS software). • The SCS connections (implemented in system software). VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems Failures can occur at each communications level and in each component. Failures at one level translate into failures at other levels as follows: • Wires. If the Ethernet fails or is disconnected, Ethernet traffic stops or is interrupted, depending on the nature of the failure. For the Ct either Path A or B can fail while the virtual circuit remains intact. All traffic is directed over the remaining good path. When the wire is repaired, the repair is detected automatically by port polling, and normal operations resume on all ports. • Virtual circuit. If no path works between a pair of ports, the virtual circuit fails and is closed. A path failure is discovered as follows: For the Ct when polling fails, or when attempts are made to send normal traffic, and the port reports that neither path yielded transmit success. For the Ethernet, when no multicast HELLO message or incoming traffic is received from another node. When a virtual circuit fails, every SCS connection on it fails. The software automatically reestablishes connections when the virtual circuit is reestablished. Normally, reestablishing a virtual circuit takes several seconds after the problem is corrected. C.4.2.1 • CI port. If a port fails, all virtual circuits to that port faiL and all SCS connections on those virtual circuits fail. If the port is successfully reinitialized, virtual circuits and connections are reestablished automatically. Normally, port reinitialization and reestablishment of connections take several seconds. • Ethernet adapter. If an Ethernet adapter device fails, attempts are made to restart it. If repeated attempts faiL all virtual circuits time out, and their connections are broken. • SCS connection. When the software protocols fail or, in some instances, when the software detects a hardware malfunction, a connection is terminated. Other connections are normally unaffected, as is the virtual circuit. Breaking of connections is also used under certain conditions as an error recovery mechanism-most commonly when there is insufficient nonpaged pool available on the system. • System. If a system fails because of operator shutdown, bugcheck, or halt and reboot all other systems in the cluster record the failure as failures of their virtual circuits to the port on the failed system. Verifying Cl Port Functions Before you boot in a cluster a CI-connected system that is new, just repaired, or suspected of having a problem, you should have DIGITAL Field Service verify that the system runs correctly on its own. To diagnose communication problems, you can invoke the Show Cluster Utility and tailor the SHOW CLUSTER report by entering the SHOW CLUSTER command ADD CIRCUIT CABLE_ST. This command adds a class of information about all the virtual circuits as seen from the system on which you are running SHOW CLUSTER. Primarily, you are checking whether there is a virtual circuit in the OPEN state to the failing system. Common causes of failure to open a virtual circuit and keep it open are the following: C-11 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems • Port errors on one side or the other • Cabling errors • A port set off line because of software problems • Insufficient nonpaged pool available on both sides • Failure to set correct values for the SYSGEN parameters SCSNODE, SCSSYSTEMID, PAMAXPORT, PANOPOLL, PASTIMOUT, and PAPOLLINT. Run SHOW CLUSTER from each active system in the cluster to verify whether each system's view of the failing system is consistent with every other system's view. If all the active systems have a consistent view of the failing system, the problem may be in the failing system. If, on the other hand, only one of several active systems detects that the newcomer is failing, that particular system may be experiencing a problem. If no virtual circuit is open to the failing system, check the bottom of the SHOW CLUSTER display for information on circuits to the port of the failing system. Virtual circuits in partially open states are shown at the bottom of the display. If the circuit is shown in a state other than OPEN, communications between the local and remote ports are taking place, and the failure is probably at a higher level than in port or cable hardware. Next, check that both Paths A and B are good to the failing port. The loss of one path should not prevent a system from participating in a cluster. C.4.2.2 Verifying Cl Cable Connections Whenever the configuration poller finds that no virtual circuits are open and that no handshake procedures are currently opening virtual circuits, the poller analyzes its environment. It does so by using the send-loopback-datagram facility of the CI port. The send-loopback-datagram facility tests the connections between the CI port and the Star Coupler by routing messages across them. The messages are called loopback datagrams. (The port processes other self-directed messages without using the Star Coupler or external cables.) The configuration poller makes entries in the error log whenever it detects a change in the state of a circuit. Note, however, that it is possible for two changed-to-failed-state messages to be entered in the log without an intervening changed-to-succeeded-state message. Such a series of entries means that the circuit state continues to be faulty. The following paragraphs discuss various incorrect CI cabling configurations and the entries made in the error log when these configurations exist. Figure C-1 shows a two-node configuration with all cables correctly connected. Figure C-2 shows a CI cluster with a pair of crossed cables. C-12 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems Figure C-1 A Correctly Connected Two-Node Cl Cluster RA TA Local Cl Port TB Star Coupler RB RA TA RB TB Remote Cl Port ZK-1924-84 Figure C-2 Crossed Cl Cable Pair ' x/ TA / Local Cl Port TB RA ' Star Coupler RB RA TA RB TB Remote Cl Port ZK-1925-84 If a pair of transmitting cables or a pair of receiving cables is crossed, a message sent on TA is received on RB, and a message sent on TB is received on RA. This is a hardware error condition from which the port cannot recover. An entry is made in the error log to say that a single pair of crossed cables exists. The entry contains the following lines: DATA CABLE(S) CHANGE OF STATE PATH 1. LOOPBACK HAS GONE FROM GOOD TO BAD If this situation exists, you can correct it by reconnecting the cables properly. The cables could be misconnected in several places. The coaxial cables that connect the port boards to the bulkhead cable connectors can be crossed, or the Ethernet cables can be misconnected to the bulkhead or the Star Coupler. The information in Figure C-2 can be represented more simply. Configuration 1 shows the cables positioned as in Figure C-2, but it does not show the star coupler or the nodes. The letters LOC and REM indicate the pairs of transmitting ( T) and receiving ( R) cables on the local and remote nodes, respectively. C-13 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems Configuration 1 R = =R =T LDC REM T x The pair of crossed cables causes loopback datagrams to fail on the local node, but succeed on the remote node. Crossed pairs of transmitting cables and crossed pairs of receiving cables cause the same behavior. Note that only an odd number of crossed-cable pairs causes these problems. If an even number of cable pairs is crossed, communications succeed. An error log entry is made in some cases, however, and the contents of the entry depends on which pairs of cables are crossed. Configuration 2 shows two-node clusters with the combinations of two pairs of crossed-cable pairs. These crossed pairs cause the following entry to be made in the error log of the node that has the cables crossed: DATA CABLE(S) CHANGE OF STATE CABLES HAVE GONE FROM UNCROSSED TO CROSSED Loopback datagrams succeed on both nodes, and communications are possible. Configuration 2 T = x R Rx =R =T R = x T LDC REM LDC REM T x Configuration 3 shows the possible combinations of two pairs of crossed cables that cause loopback datagrams to fail on both nodes in the cluster. Communications can still take place between the nodes. An entry stating that cables are crossed is made in the error log of each node. Configuration 3 T x =R T= x R = x T Rx =T LDC REM LDC REM R Configuration 4 shows the possible combinations of two pairs of crossed cables that cause loopback datagrams to fail on both nodes in the cluster, but allow communications. No entry stating that cables are crossed is made in the error log of either node. Configuration 4 C-14 T x x R T = =R R = =T Rx x T LDC REM LDC REM VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems Configuration 5 shows the possible combinations of three pairs of crossed cables. In each case, loopback datagrams fail on the node that has only one crossed pair of cables. Loopback datagrams succeed on the node with both pairs crossed. No communications are possible. Configuration 5 T x x R T x =R T = x R T x x R Rx = T Rx x T Rx x T R = x T LOC REM LOC REM LOC REM LOC REM If all four cable pairs between two nodes are crossed, communications succeed, loopback datagrams succeed, and no crossed-cable message entries are made in the error log. Such a condition might be detected by noting error log entries made by a third system in the cluster, but only if the third node has one of the crossed-cable cases described. C.4.2.3 Repairing Cl Cables This section describes some ways in which DIGITAL Field Service can make repairs on a running system. This information is provided to aid system managers in scheduling repairs. For cluster software to survive cable-checking activities or cable-replacement activities, you must be sure that either Path A or Path B is intact at all times between each port and between every other port in the cluster. You can, for example, remove Path A and Path Bin turn from a particular port to the Star Coupler. To make sure that the configuration poller finds a path that was previously faulty but is now operational, follow these steps: 1 Remove Path B. 2 After the poller has discovered that Path Bis faulty, reconnect Path B. 3 Wait two poller intervals, and then enter the DCL command SHOW CLUSTER to make sure that the poller has reestablished Path B. Or, enter the DCL command SHOW CLUSTER/CONTINUOUS followed by the SHOW CLUSTER command ADD CIRCUITS, CABLE_ST. Wait until SHOW CLUSTER tells you that Path B has been reestablished. 4 Remove Path A. 5 After the poller has discovered that Path A is faulty, reconnect Path A. 6 Wait two poller intervals to make sure that the poller has reestablished Path A. If both paths are lost at the same time, the virtual circuits are lost between the port with the broken cables and all other ports in the cluster. This condition will in turn result in loss of SCS connections over the broken virtual circuits. However, recovery from this situation is automatic after an interruption in service on the affected node. The length of the interruption varies, but it is usually approximately two poller intervals (or 10 seconds) at the default SYSGEN parameter settings. C-15 VAXcluster Troubleshooting Information C.4 Diagnosing V AXport Device Problems C.4.3 Analyzing Error Log Entries for VAXport Devices To anticipate and avoid potential problems, you must monitor events recorded in the error log. From the total error count, displayed by a DCL command in the format SHOW DEVICE device-name, you can determine whether errors are increasing. If so, you should examine the error log. The DCL command ANALYZE/ERROR_LOG invokes the Error Log Utility to report the contents of an error log file. (For more information on the Error Log Utility, see the VMS Error Log Utility Manual.) Note that some error log entries are informational only, and require no action. For example, If you shut down a system in the cluster, all other active systems that have open virtual circuits between themselves and the system that has been shut down make entries in their error logs. Such systems record up to three errors for the event: Path A received no response; Path B received no response; the virtual circuit is being closed. These messages are normal and reflect the change of state in the circuits to the system that has been shut down. On the other hand, some error log entries are made for problems that degrade operation, or for nonfatal hardware problems. The VMS operating system might continue to run satisfactorily under these conditions. The purpose of detecting these problems early is to prevent nonfatal problems (such as loss of a single CI path) from becoming serious problems (such as loss of both paths). C.4.3.1 Error Log Entry Formats Errors and other events on the CI or Ethernet cause VAXport drivers to enter information in the system error log. The two formats used for error log entries are the device-attention format and the logged-message format. Sections C.4.3.2 and C.4.3.3 describe those formats. Device-attention entries for the CI record events that, in general, are indicated by the setting of a bit in a hardware register. For the Ethernet, deviceattention entries typically record errors on an Ethernet adapter device. Logged-message entries record the receipt of a message packet that contains erroneous data or that signals an error condition. C.4.3.2 C-16 Device-Attention Entries Examples C-1 and C-2 show device-attention entries for the CI and Ethernet, respectively. The left column gives the name of a device register or a memory location. The center column gives the value contained in that register or location, and the right column gives an interpretation of that value. VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems Example C-1 Cl Device-Attention Entry **************************** ENTRY ERROR SEQUENCE 10. DATE/TIME 15-APR-1988 11:45:27.61 DEVICE ATTENTION KA780 SCS NODE: MARS 83. **************************** 0 LOGGED ON: SID 0150400A SYS_TYPE 01010000 @ • CI SUB-SYSTEM, MARS$PAAO: - PORT POWER DOWN CNF GR 00800038 ADAPTER IS CI ADAPTER POWER-DOWN PMCSR OOOOOOCE MAINTENANCE TIMER DISABLE MAINTENANCE INTERRUPT ENABLE MAINTENANCE INTERRUPT FLAG PROGRAMMABLE STARTING ADDRESS UNINITIALIZED STATE PSR 80000001 RESPONSE QUEUE AVAILABLE MAINTENANCE ERROR PFAR PESR PPR UCB$B_ERTCNT 00000000 00000000 03F80001 0 32 50. RETRIES REMAINING UCB$B_ERTMAX 0 32 50. RETRIES ALLOWABLE UCB$L_CHAR OC450000 SHAREABLE AVAILABLE ERROR LOGGING CAPABLE OF INPUT CAPABLE OF OUTPUT UCB$W_STS 0010 UCB$W_ERRCNT OOOB ONLINE 11. ERRORS THIS UNIT 0 The first two lines are the entry heading. These lines contain the number of the entry in this error log file, the sequence number of this error, and the identification number (SID) of this system's CPU. Each entry in the log file contains such a heading. 8 The next line contains the date and time, and the system type. 9 The next two lines contain the entry type, the processor type (KA780), and the system's SCS node name. 0 The line CI SUB-SYSTEM, MARS$P AAO: - PORT POWER DOWN contains the name of the subsystem and the device that caused the entry, and the reason for the entry. The CI subsystem's device P AAO on node MARS was powered down. The next 15 lines contain the names of hardware registers in the port, their contents, and interpretations of those contents. See the appropriate CI hardware manual for a description of all the CI port registers. C-17 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems The CI port can recover from many errors, but not all. When an error occurs from which the CI cannot recover, the port notifies the port driver. The port driver logs the error and attempts to reinitialize the port. If the port fails after 50 such initialization attempts, the driver takes it off line, unless the system disk is connected to the failing port or this system is supposed to be a cluster member. If the CI port is required for system disk access or cluster participation and all 50 reinitialization attempts have been used, then the system bugchecks with a CIPORT-type bugcheck. Once a CI port is off line, you can put the port back on line only by rebooting the system. 8 The UCB$B_ERTCNT field contains the number of reinitializations that the port driver can still attempt. The difference between this value and UCB$B__ERTMAX is the number of reinitializations already attempted. 0 The UCB$B_ERTMAX field contains the maximum number of times the port can be reinitialized by the port driver. 8 The UCB$W_ERRCNT field contains the total number of errors that have occurred on this port since it was booted. This total includes both errors that caused reinitialization of the port and errors that did not. Example C-2 Ethernet Device-Attention Entry **************************** ENTRY 80. **************************** ERROR SEQUENCE 26. LOGGED ON: SID 08000000 DATE/TIME 15-APR-1988 11:30:53.07 SYS_TYPE 01010000 DEVICE ATTENTION KA630 SCS NODE: PHOBOS NI-SCS SUB-SYSTEM, PHOBOS$PEAO: FATAL ERROR DETECTED BY DATALINK STATUS! STATUS2 DATALINK UNIT DATALINK NAME 0000002C 00000000 0001 41515803 00000000 00000000 00000000 REMOTE NODE 00000000 00000000 00000000 00000000 00000000 0000 000400AA 4C07 0 f) •e • • 0 0 DATALINK NAME= XQA1: REMOTE ADDR LOCAL ADDR ETHERNET ADDR = AA-00-04-00-07-4C ERROR CNT CB 0001 1. ERROR OCCURRENCES THIS ENTRY UCB$W_ERRCNT 0007 7. ERRORS THIS UNIT C-18 0 The first two lines are the entry heading. These lines contain the number of the entry in this error log file, the sequence number of this error, and the identification number (SID) of this system's CPU. Each entry in the log file cdntains such a heading. f) The next line contains the date and time, and the system type. VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems 0 The next two lines contain the entry type, the processor type (KA630), and the system's SCS node name. 0 This line shows the name of the subsystem and component that caused the entry. 0 This line shows the reason for the entry. The Ethernet driver has shut down the datalink because of a fatal error. The datalink will be restarted automatically if possible. 0 STATUSl and STATUS 2 show the IjO completion status returned by the Ethernet driver. If a message transmit was involved, the status applies to that transmit. 8 DATALINK UNIT shows the unit number of the Ethernet device on which the error occurred. C) DATALINK NAME is the name of the Ethernet device on which the error occurred. 0 REMOTE NODE is the name of the remote node to which the packet was being sent. If zero, no remote node was available or no packet was associated with the error. 41> REMOTE ADDR is the Ethernet address of the remote node to which the packet was being sent. If zero, no packet was associated with the error. 4D LOCAL ADDR is the Ethernet address of the local node. Q) C.4.3.3 ERROR CNT-Because some errors can occur at extremely high rates, some error log entries represent more than one occurrence of an error. This field indicates how many. The errors counted occurred in the 3 seconds preceding the time stamp on the entry. Logged-Message Entries Logged-message entries are made when the CI or Ethernet port receives a response that contains either data that the port driver cannot interpret or an error code in the status field of the response. Example C-3 shows a CI logged-message entry with an error code in the status field PPD$B_STATUS. C-19 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems Example C-3 Cl Logged-Message Entry **************************** ENTRY 3. *************************** 0 LOGGED ON SID 01188542 ERROR SEQUENCE 3. ERL$LOGMESSAGE, 15-APR-1988 13:40:25.13 KA780 REV #3. SERIAL #1346. MFG PLANT 15. CI SUB-SYSTEM, MARS$PAAO: DATA CABLE(S) STATE CHANGE - PATH #0. WENT FROM GOOD TO BAD LOCAL STATION ADDRESS, 000000000002 (HEX) LOCAL SYSTEM ID, 000000000001 (HEX) REMOTE STATION ADDRESS, 000000000004 (HEX) REMOTE SYSTEM ID, OOOOOOOOOOA9 (HEX) UCB$B_ERTCNT 32 UCB$B_ERTMAX 32 UCB$W_ERRCNT 0001 PPD$B_PORT 04 PPD$B_STATUS A5 50. RETRIES REMAINING 50. RETRIES ALLOWABLE 1. ERRORS THIS UNIT REMOTE NODE #4. FAIL PATH #0., NO RESPONSE PATH #1., "ACK" OR NOT USED NO PATH PPD$B_OPC 05 PPD$B_FLAGS 03 IDREQ RESPONSE QUEUE BIT SELECT PATH #0. "CI" MESSAGE 00000000 00000000 80000004 OOOOFE15 4F503000 00000507 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 C-20 0 The first two lines are the entry heading. These lines contain the number of the entry in this error log file, the sequence number of the error, and the identification number (SID) of the system's CPU. Each entry in the log file contains a heading. 8 The next line contains the entry type, the date, and time. VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems 0 The next line contains the processor type (KA780), the hardware revision number of the CPU (REV #3), the serial number of the CPU (SERIAL #1346), and the plant number ( 15 ). 9 The line CI SUB-SYSTEM, MARS$PAAO: contains the name of the subsystem and the device that caused the entry. 0 The next line gives the reason for the entry (one or more data cables have changed state), and a more detailed reason for the entry. Path 0, which the port used successfully before, cannot be used now. Note: ANALYZE/ERROR-LOG uses the notation path 0 and path 1; cable labels use the notation path A ( =O ) and path B ( =1 ). 0 The local 0 and remote 0 station addresses are the port numbers (range 0-15) of the local and remote ports. The port numbers are set in hardware switches by field service. The local 0 and remote C> system IDs are the SCS system IDs set by the SYSGEN parameter SCSSYSTEMID for the local and remote VAX systems. For HSCs, the system ID is set with the HSC console. 41> The rest of the entry, which consists of the entry fields that begin with UCB$, gives information on the contents of the unit control block (UCB) for this CI device. The following fields, which begin with PPD$, are fields in the message packet that the local port has received. 4D PPD$BJORT contains the station address of the remote port. In a loopback datagram, however, this field contains the local station address. 48 The PPD$B_STATUS field contains information on the nature of the failure that occurred during the current operation. When the operation completes without error, ERF prints the word NORMAL beside this field; otherwise, ERF decodes the error information contained in PPD$B_ STATUS. Here a NO PATH error occurred because of a lack of response on path 0, the selected path. G) The PPD$B_OPC field contains the code for the operation that the port was attempting when the error occurred. The port was trying to send a request-for-id message. 4D The PPD$BJLAGS field contains bits that indicate, among other things, the path that was selected for the operation. ~ C.4.3.4 The "CI" MESSAGE is a hexadecimal listing of bytes 16 through 83 (decimal) of the response (message or datagram). Since responses are of variable length depending upon the port opcode, bytes 16 through 83 may contain either more or fewer bytes than actually belong to the message. Here the request-for-id contains no information in bytes 16 through 83. Error Log Entry Descriptions This section describes error log entries for the CI and Ethernet ports. Each entry shown is followed by a brief description of what the associated VAXport driver (PAD RIVER, PBDRIVER, PED RIVER) does, and the suggested action a system manager should take. In cases where Software Performance Reports with crash dumps are requested, it is important to capture the crash dumps as soon as possible after the error. For CI entries, note that path A and path 0 are the same path, and that path B and path 1 are the same path. C-21 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems BIIC FAILURE Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device offline. User Action: Call DIGITAL Field Service. CI PORT TIMEOUT Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device offline. User Action: First, increase the SYSGEN parameter P APOLLINTERVAL. If the problem disappears and you are not running privileged user-written software, submit an SPR. Otherwise, call DIGITAL Field Service. 11/750 CPU MICROCODE NOT ADEQUATE FOR PORT Explanation: The VAXport driver sets the port off line with no retries attempted. In addition, if this port is needed because the system is booted from an HSC or is participating in a cluster, the system bugchecks with a UCODEREV code bugcheck. User Action: Read the appropriate section in the current VAXcluster SPD for information on required CPU microcode revisions. Call Field Service if necessary. PORT MICROCODE REV NOT CURRENT, BUT SUPPORTED Explanation: The VAXport driver detected that the microcode is not at the current level, but will continue normally. This error is logged as a warning only. User Action: Contact Field Service when convenient to have the microcode updated. PORT MICROCODE REV NOT SUPPORTED Explanation: The VAXport driver sets the port off line without attempting any retries. User Action: Read the VAXcluster SPD for information on the required CI port microcode revisions. Contact Field Service if necessary. DATA CABLE(S) STATE CHANGE CABLES HAVE GONE FROM CROSSED TO UNCROSSED Explanation: The VAXport driver logs this event. User Action: No action needed. DATA CABLE(S) STATE CHANGE CABLES HAVE GONE FROM UNCROSSED TO CROSSED Explanation: The VAXport driver logs this event. User Action: Check for crossed-cable pairs. See Section C.4.2.2. C-22 VAXcluster Troubl.eshooting Information C.4 Diagnosing VAXport Device Problems DATA CABLE(S) STATE CHANGE PATH 0. WENT FROM BAD TO GOOD Explanation: The VAXport driver logs this event. User Action: No action needed. DATA CABLE(S) STATE CHANGE PATH 0. WENT FROM GOOD TO BAD Explanation: The VAXport driver logs this event. User Action: Check path A cables to see that they are not broken or improperly connected. DATA CABLE(S) STATE CHANGE PATH 0. LOOPBACK IS NOW GOOD, UNCROSSED Explanation: The VAXport driver logs this event. User Action: No action needed. DATA CABLE(S) STATE CHANGE PATH 0. LOOPBACK WENT FROM GOOD TO BAD Explanation: The VAXport driver logs this event. User Action: Check for crossed-cable pairs or faulty CI hardware. See Sections C.4.2.1 and C.4.2.2. DATA CABLE(S) STATE CHANGE PATH 1. WENT FROM BAD TO GOOD Explanation: The VAXport driver logs this event. User Action: No action needed. DATA CABLE(S) STATE CHANGE PATH 1. WENT FROM GOOD TO BAD Explanation: The VAX port driver logs this event. User Action: Check path B cables to see that they are not broken or improperly connected. DATA CABLE(S) STATE CHANGE PATH 1. LOOPBACK IS NOW GOOD, UNCROSSED Explanation: The VAXport driver logs this event. User Action: No action needed. DATA CABLE(S) STATE CHANGE PATH 1. LOOPBACK WENT FROM GOOD TO BAD Explanation: The VAXport driver logs this event. User Action: Check for crossed-cable pairs or faulty CI hardware. See Sections C.4.2.1 and C.4.2.2. C-23 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems DATAGRAM FREE QUEUE INSERT FAILURE Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Call DIGITAL Field Service. This error is caused by a failure to obtain access to an interlocked queue. Possible sources of the problem are CI hardware failures or memory, SBI (11/780), CMI (11/750), or BI (8200, 8300, and 8800) contention. DATAGRAM FREE QUEUE REMOVE FAILURE Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Call DIGITAL Field Service. This error is caused by a failure to obtain access to an interlocked queue. Possible sources of the problem are CI hardware failures, or memory, SBI (11/780), CMI (11/750), or BI (8200, 8300, and 8800) contention. FAILED TO LOCATE PORT MICRO-CODE IMAGE Explanation: The VAXport driver marks device off line and makes no retries. User Action: Make sure console volume contains the microcode file CI780.BIN (for the Cl780, CI750, or CIBCI) or the microcode file CIBCA.BIN for the CIBCA-AA, then reboot the system. HIGH PRIORITY COMMAND QUEUE INSERT FAILURE Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Call DIGITAL Field Service. This error is caused by a failure to obtain access to an interlocked queue. Possible sources of the problem are CI hardware failures or memory, SBI (11/780), CMI (11/750), or BI (8200, 8300, 8800) contention. MSCP ERROR LOGGING DATAGRAM RECEIVED Explanation: On receipt of an error message from the HSC, the VAXport driver logs the error and takes no other action. It is recommended that you disable the sending of HSC informational error log datagrams with the appropriate HSC console command. Informational error log datagrams take considerable space in the error log data file. User Action: They are useful to read only if they are not captured on the HSC console for some reason (for example, the HSC console ran out of paper.) This logged information is a duplicate of the messages logged on the HSC console. INAPPROPRIATE "SCA" CONTROL MESSAGE Explan.ation: The VAXport driver closes the port-to-port virtual circuit to the remote port. User Action: Submit a Software Performance Report to DIGITAL including the error logs and the crash dumps from the local and remote systems. C-24 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems INSUFFICIENT NON-PAGED POOL FOR INITIALIZATION Explanation: The VAXport driver marks device off line and makes no retries. User Action: Reboot the system with a larger value for NPAGEDYN or NPAGEVIR. LOW PRIORITY CMD QUEUE INSERT FAILURE Explanation: The VAX port driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Call DIGITAL Field Service. This error is caused by a failure to obtain access to an interlocked queue. Possible sources of the problem are CI hardware failures or memory, SBI (11/780), CMI (11/750), or BI (8200, 8300, and 8800) contention. MESSAGE FREE QUEUE INSERT FAILURE Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Call DIGITAL Field Service. This error is caused by a failure to obtain access to an interlocked queue. Possible sources of the problem are CI hardware failures or memory, SBI (11/780), CMI (11/750), or BI (8200, 8300, and 8800) contention. MESSAGE FREE QUEUE REMOVE FAILURE Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Call DIGITAL Field Service. This error is caused by a failure to obtain access to an interlocked queue. Possible sources of the problem are CI hardware failures or memory, SBI (11/780), CMI (11/750), or BI (8200, 8300, and 8800) contention. MICRO-CODE VERIFICATION ERROR Explanation: The VAXport driver detected an error while reading the microcode that it just loaded into the port. The driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Call DIGITAL Field Service. NO PATH-BLOCK DURING "VIRTUAL CIRCUIT" CLOSE Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Submit a Software Performance Report to DIGITAL including the error log and a crash dump from the local system. NO TRANSITION FROM UNINITIALIZED TO DISABLED Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Call DIGITAL Field Service. C-25 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems PORT ERROR BIT(S) SET Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: For CI microcode version 7 or later, a maintenance timer expiration bit may mean that the PAS TIM OUT SYSGEN parameter is set too low, especially if the local node is running privileged user-written software. For all other bits, call DIGITAL Field Service. PORT HAS CLOSED "VIRTUAL CIRCUIT" Explanation: The VAXport driver closes the virtual circuit that the local CI port opened to the remote port. User Action: Check the PPD$B_STATUS field of the error log entry for the reason the virtual circuit was closed. This error is normal if the remote system crashed or was shut down. PORT POWER DOWN Explanation: The VAXport driver halts port operations, and then waits for power to return to the port hardware. User Action: Restore power to the port hardware. PORT POWER UP Explanation: The VAXport driver reinitializes the port and restarts port operations. User Action: No action needed. RECEIVED "CONNECT" WITHOUT PATH-BLOCK Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Submit a Software Performance Report to DIGITAL including the error log and a crash dump from the local system. REMOTE SYSTEM CONFLICTS WITH KNOWN SYSTEM Explanation: The configuration poller discovered a remote system with SCSSYSTEMID and/or SCSNODE equal to that of another system to which a virtual circuit is already open. User Action: Shut the new system down as soon as possible. Reboot it with a unique SCSYSTEMID and SCSNODE. Do not leave the new system up any longer than necessary. If you are running a cluster and two systems with conflicting identity are polling when any other virtual circuit failure takes place in the cluster, then systems in the cluster may crash with a CLUEXIT bugcheck. C-26 VAXcluster Troubleshooting Information C.4 Diagnosing V AXport Device Problems RESPONSE QUEUE REMOVE FAILURE Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Call DIGITAL Field Service. This error is caused by a failure to obtain access to an interlocked queue. Possible sources of the problem are CI hardware failures or memory, SBI (11/780), CMI (11/750), or BI (8200, 8300, and 8800) contention. SCSSYSTEMID MUST BE SET TO NON-ZERO VALUE Explanation: The VAXport driver sets the port off line without attempting any retries. User Action: Reboot the system with a conversational boot and set the SCSSYSTEMID to the correct value. At the same time, check that SCSNODE has been set to the correct nonblank value. SOFTWARE IS CLOSING "VIRTUAL CIRCUIT" Explanation: The VAX port driver closes the virtual circuit to the remote port. User Action: Check error log entries for the cause of the virtual circuit closure. Faulty transmission or reception on both paths, for example, causes this error and may be detected from the one or two previous error log entries noting bad paths to this remote node. SOFTWARE SHUTTING DOWN PORT Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Check other error log entries for the possible cause of the port reinitializa tion failure. UNEXPECTED INTERRUPT Explanation: The VAXport driver attempts to reinitialize the port; after 50 failing attempts, it marks the device off line. User Action: Call DIGITAL Field Service. UNRECOGNIZED "SCA" PACKET Explanation: The VAXport driver closes the virtual circuit to the remote port. If the virtual circuit is already closed, the port driver inhibits datagram reception from the remote port. User Action: Submit a Software Performance Report to DIGITAL, including the error log file that contains this entry and the crash dumps from both the local and remote systems. C-27 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems VIRTUAL CIRCUIT TIMEOUT Explanation: The VAXport driver closes the virtual circuit that the local CI port opened to the remote port. This closure occurs if the remote node is running CI microcode version 7 or later, and the remote node has failed to respond to any messages sent by the local node. User Action: This error is normal if the remote system halted, crashed, or was shut down. This error may mean that the local node's PASTIMOUT SYSGEN parameter is set too low, especially if the remote node is running privileged user-written software. INSUFFICIENT NON-PAGED POOL FOR VIRTUAL CIRCUITS Explanation: The VAX port driver closes virtual circuits because of insufficient pool. User Action: Enter the DCL command SHOW MEMORY to determine pool requirements, and then adjust the appropriate SYSGEN requirements. Note: The following descriptions apply only for Ethernet devices. FATAL ERROR DETECTED BY DATALINK Completion status: SS$_A.BORT (0000002C) Explanation: The Ethernet driver has shut down the device because of a fatal error and is returning all outstanding transmits with SS$_0PINCOMPL. The Ethernet device is automatically restarted, and all the aborted transmits are logged in the error log. User Action: Infrequent occurrences of this error are probably not a problem. If they occur frequently, or are accompanied by connections to remote nodes being lost and reestablished, there is probably a hardware problem. Check for the proper Ethernet adapter revision level or call DIGITAL field service. TRANSMIT ERROR FROM DATALINK Completion status: SS$_0PINCOMPL (000002D4) Explanation: The Ethernet driver is in the process of restarting the datalink, because there was an error that forced the driver to shut down the controller and all users (see FATAL ERROR DETECTED BY DATALINK). Completion status: SS$_DEVREQERR (00000334) Explanation: The Ethernet controller tried to transmit the packet 16 times and failed because of defers and/or collisions. This condition indicates that Ethernet traffic is very heavy. Completion status: SS$_DISCONNECT (0000204C) Explanation: There was a loss of carrier during or after the transmit. User Action: The Port Emulator automatically recovers from any of these errors, but excessive numbers of them indicate either that the Ethernet controller is faulty or that the Ethernet is overloaded. If you suspect either of these conditions, contact DIGITAL Field Service. C-28 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems INVALID CLUSTER PASSWORD RECEIVED Explanation: A node is trying to join the cluster using the correct cluster group number for this cluster, but an invalid password. The Port Emulator discards the message. The probable cause is another cluster on the Ethernet using the same cluster group number. User Action: Provide all clusters on the same Ethernet with unique cluster group numbers. NISCS PROTOCOL VERSION MISMATCH RECEIVED Explanation: A node is trying to join the cluster using a version of the cluster Ethernet protocol that is incompatible with the one in use on this cluster. User Action: Install a version of the VMS operating system that uses a compatible protocol, or change the cluster group number so that the node joins a different cluster. C.4.4 OPAO Error Messages VAXport drivers detect certain error conditions and attempt to log them. Under some circumstances, attempts to log the error to the error logging device may fail. Such failures may occur because the error logging device is not accessible when attempts are made to log the error condition. Because of the central role that the VAXport device plays in clusters, the loss of error-logged information in such cases makes it difficult to diagnose and fix problems. A second, redundant method of error logging captures at least some of the information about VAXport device error conditions that would otherwise be lost. This second method consists of broadcasting selected information about the error condition to OPAO, in addition to the port driver's attempt to log the error condition to the error logging device. The VAXport driver attempts both OP AO error broadcasting and standard error logging under any of the following circumstances: • The system disk has not yet been mounted. • The system disk is undergoing mount verification. • During mount verification, the system disk drive contains the wrong volume. • Mount verification for the system disk has timed out. • The local system is participating in a cluster, and quorum has been lost. Note the implicit assumption that the system and error logging devices are one and the same. This second method of reporting errors is also not entirely reliable. Because of the way OP AO error broadcasting is performed, some error conditions may not be reported. This situation occurs whenever a second error condition is detected before the VAX port driver has been able to broadcast the first error condition to OPAO. In such a case, only the first error condition is reported to OPAO, because that condition is deemed to be the more important one. Certain error conditions are always broadcast to OPAO, regardless of whether the error logging device is accessible. In general, these are errors that cause the port to shut down either permanently or temporarily. C-29 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems One OPAO error message for each error condition is always logged. The text of each error message is similar to the text in the summary displayed by formatting the corresponding standard error log entry using the Error Log Utility. (See Section C.4.3.4 for a list of Error Log Utility summary messages and their explanations.) Many of the OP AO error messages contain some optional information such as the remote port number, CI packet information (flags, port operation code, response status, and port number fields), or specific CI port registers. Following is a list of OP AO error messages, subdivided by error type. See the CI hardware documentation for a detailed description of the CI port registers (CNF= Configuration Register; PMC = Port Maintenance and Control Register; PSR = Port Status Register), which are optionally displayed for certain of the error conditions. The codes, always file accessible, specify whether the message is always logged on OP AO or is logged only when the system device is inaccessible. Software Errors During Initialization (Always Logged on OPAO) %Pxxn, Insufficient Non-Paged Pool for Initialization %Pxxn, Failed to Locate Port Micro-code Image %Pxxn, SCSSYSTEMID has NOT been set to a Non-Zero Value Hardware Errors (Always Logged on OPAO) %Pxxn, BIIC failure - BICSR/BER/CNF xxxxxx/xxxxxx/xxxxxx %Pxxn, Micro-code Verification Error %Pxxn, Port Transition Failure - CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx %Pxxn, Port Error Bit(s) Set - CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx %Pxxn, Port Power Down %Pxxn, Port Power Up %Pxxn, Unexpected Interrupt - CNF/PMC/PSR xxxxxx/xxxxxx/xxxxxx %Pxxn, CI Port Timeout %Pxxn, CI port ucode not at required rev level. RAM/PROM rev is xxxx/xxxx %Pxxn, CI port ucode not at current rev level. RAM/PROM rev is xxxx/xxxx %Pxxn, CPU ucode not at required rev level for CI activity Queue Interlock Failures (Always Logged on OPAO) %Pxxn, Message Free Queue Remove Failure %Pxxn, Datagram Free Queue Remove Failure %Pxxn, Response Queue Remove Failure %Pxxn, High Priority Command Queue Insert Failure %Pxxn, Low Priority Command Queue Insert Failure %Pxxn, Message Free Queue Insert Failure %Pxxn, Datagram Free Queue Insert Failure C-30 VAXcluster Troubleshooting Information C.4 Diagnosing VAXport Device Problems Errors Signaled with a Cl Packet %Pxxn, Unrecognized SCA Packet - FLAGS/OPC/STATUS/PORT (ALWAYS) xx/xx/xx/xx %Pxxn, Port has Closed Virtual Circuit - REMOTE PORT xxx (ALWAYS) %Pxxn, Software Shutting Down Port (ALWAYS) %Pxxn, Software is Closing Virtual Circuit - REMOTE PORT xxx (ALWAYS) %Pxxn, Received Connect Without Path-Block - FLAGS/OPC/STATUS/PORT xx/xx/xx/xx (ALWAYS) %Pxxn, Inappropriate SCA Control Message - FLAGS/OPC/STATUS/PORT (ALWAYS) xx/xx/xx/xx %Pxxn, No Path-Block During Virtual Circuit Close - REMOTE PORT xxx (ALWAYS) %Pxxn, HSC Error Logging Datagram Received - REMOTE PORT xxx (INACCESSIBLE) %Pxxn, Remote System Conflicts with Known System - REMOTE PORT xxx (ALWAYS) %Pxxn, Virtual Circuit Timeout - REMOTE PORT xxx (ALWAYS) %Pxxn, Parallel Path is Closing Virtual Circuit - REMOTE PORT xxx (ALWAYS) %Pxxn, Insufficient Non-paged Pool for Virtual Circuits (ALWAYS) Cable Change-of-State Notification %Pxxn, Path #0. Has gone from GOOD to BAD - REMOTE PORT xxx (INACCESSIBLE) %Pxxn, Path #1. Has gone from GOOD to BAD - REMOTE PORT xxx (INACCESSIBLE) %Pxxn, Path #0. Has gone from BAD to GOOD - REMOTE PORT xxx (INACCESSIBLE) %Pxxn, Path #1. Has gone from BAD to GOOD - REMOTE PORT xxx (INACCESSIBLE) %Pxxn, Cables have gone from UNCROSSED to CROSSED - REMOTE PORT xxx (INACCESSIBLE) %Pxxn, Cables have gone from CROSSED to UNCROSSED - REMOTE PORT xxx (INACCESSIBLE) %Pxxn, Path #0. Loopback has gone from GOOD to BAD - REMOTE PORT xxx (ALWAYS) %Pxxn, Path #1. Loopback has gone from GOOD to BAD - REMOTE PORT xxx (ALWAYS) %Pxxn, Path #0. Loopback has gone from BAD to GOOD - REMOTE PORT xxx (ALWAYS) %Pxxn, Path #1. Loopback has gone from BAD to GOOD - REMOTE PORT xxx (ALWAYS) %Pxxn, Path #0. Has become working but CROSSED to Path #1. - REMOTE PORT xxx (INACCESSIBLE) C-31 VAXcluster Troubleshooting Information C.4 Diagnosing V AXport Device Problems %Pxxn, Path #1. Has become working but CROSSED to Path #0. - REMOTE PORT xxx (INACCESSIBLE) Note that if the port driver can identify the remote SCS node name of the affected system, the driver replaces the "REMOTE PORT xxx" text with "REMOTE SYSTEM X... ", where X... is the value of the SYSGEN parameter SCSNODE on the remote system. If the remote SCS node name is not available, the port driver uses the existing message format. Two other messages concerning the CI port appear on OP AO. They are as follows: %Pxxn, CI port is reinitializing (xxx retries left.) %Pxxn, CI port is going off line. The first message indicates that a previous error requiring the port to shut down is recoverable, and that the port will be reinitialized. The 'xxx retries left' information specifies how many more reinitializations are allowed before the port must be left permanently off line. Each reinitialization of the port (for reasons other than power fail recovery) causes approximately 2K bytes of nonpaged pool to be lost. The second message indicates that a previous error is not recoverable, and that the port will be left off line. In this case, the only way to recover the port is to reboot the system. C-32 Index BOOT_CONFIG.COM command procedure A Adding a Cl-connected node• 3-7 Adding a satellite node• 3- 7 Alias node identifier See DECnet-VAX network Alias operations See DECnet-VAX network Allocation class• 5-5 to 5-9 assigning value to HSCs • 5-6 assigning value to nodes• 5-6 device name• 5-5 rules for specifying• 5-5 sample configurations• 5-6 Allocation class identifier• 5-5 Allocation class value determining in mixed-interconnect V AXcluster configuration • 3-4 Authorize Utility (AUTHORIZE) • B-1, B-2 A UTOGEN running with feedback option in mixedinterconnect V AXcluster configuration• 3-25 AUTOGEN.COM command procedure executed during CLUSTER_CONFIG.COM ADD phase•3-2 B Batch queue • 4-6 to 4-8 assigning unique name to• 4- 7 clusterwide generic • 4- 7 to 4-8 initializing• 4- 7 sample configuration• 4-6 setting up• 4- 7 to 4-8 starting• 4- 7 SYS$BATCH•4-7 Boot node See Boot server Boot server function in Local Area VAX cluster configuration •1-5 functions • 1-5 sample interactive ADD session• 3-21 Broadcast messages controlling • 3-12 disabling while adding nodes• 3-6 OPCOM messages• 3-12 shutdown messages • 3-1 2 Building a cluster• 3-1 to 3-24 c Cl (computer interconnect) analyzing error log entry• C-16 communication path failure• C-11 communication path hierarchy• C-10 error log entry• C-16, C-21 port•C-9 loopback datagram facility• C-12 polling•C-9 Cl Cable repairing• C-15 Cl-connected node adding•3-6 Cl Port verifying function• C-11 CLUEXIT bugcheck diagnosing • C-8 Cluster-accessible disk • 1-12, 5-1, 5-1 to 5-5 and MSCP Server• 5-1 , 5-2 MASSBUS disk• 5-1, 5-2 setting up • 5-1 UDA disk• 5-1, 5-2 UNIBUS disk• 5-1 , 5-2 Cluster authorization file (CLUSTER_ AUTHORIZE.DAT) See Security functions function in Local Area V AXcluster configuration •1-9 function in mixed-interconnect V AXcluster configuration • 1-9 Cluster common files • 1-5 Cluster queues• 1-12 Cluster SYSGEN parameters• A-1 to A-2 CLUSTER_CONFIG.COM command procedure adding nodes• 3-6 lndex-1 Index CLUSTER_CQNFIG.COM command procedure (cont'd.) converting standalone node to cluster node• 3-21 functions• 3-2 modifying satellite Ethernet hardware address• 3-14 preparing to execute• 3-5 removing satellite nodes • 3-13 required information• 3-5 sample interactive CREA TE session• 3-21 system files created during ADD phase for satellite node• 3-2 Common command procedures coordinating• 2-9 to. 2-11 creating• 2-10 executing • 2-10 on cluster-accessible disks• 2-9 setting up• 2-10 SYLOGIN.COM • 2-11 Common-environment cluster• 2-1 creating• 2-9 preparing environment• 2-10 preparing operating environment• 2-1 Common file coordinating for multiple boot servers• 2-14 coordinating for multiple system disks• 2-14 job controller• 4-1 , 4-9 mail database• 2-13 NETPROXY.DAT•2-12 rights database• 2-14 RIGHTSLIST .DAT• 2-14 system• 2-11 SYSUAF .DAT• 2-12 VMSMAIL_PROFILE.DAT A• 2-13 Common system disk directory structure• 2-2 Computer interconnect (Cl)• 1-2 Connection manager restoring quorum after unexpected node failure • 3-26 Connection Manager• 1-9 to 1-11 Conversational bootstrap See Security functions Convert Utility (CONVERT) and exceptions file• B-2 to merge SYSUAF.DAT files•B-1 Crossed cable• C-12 lndex-2 D DECnet-VAX network alias node identifier, defining for cluster• 2-6 alias operations, enabling for satellite nodes• 2-8 circuit service, enabling for cluster boot server• 2-6 cluster functions• 1-9 configuring using NETCONFIG.COM command procedure• 2-6 copying remote node databases in V AXcluster environments• 2-8 making databases available clusterwide • 2-7 maximum address value, defining for cluster boot server• 2-6 modifying satellite Ethernet hardware address• 3-14 NETCONFIG.COM command procedure, sample interactive session• 2-6 NETNODE_REMOTE.DAT file, renaming to SYS$COMMON directory• 2-7 Network Control Program (NCP) • 2-7 remote node data, making available clusterwide •2-6 restoring satellite configuration data• 3-12 restoring satellite network configuration data• 3-12 starting the network• 2-7 tailoring• 2-6 Device cluster setting up• 5-10 disk managing • 5-1 to 5-12 naming conventions• 5-5 to 5-9 Device driver loading• 2-9 Device name• 5-5 to 5-9 allocation class • 5-5 and allocation • 5-5 to 5-9 Directory structure on common system disk• 2-2 Disk See also Dual-pathed disk See also Dual-ported disk cluster-accessible• 5-1, 5-1 to 5-5 storing common procedures on• 2-9 command procedures for setting up• 2-10 device naming conventions • 5-5 to 5-9 Index Disk (cont'd.) directory structure on common system disk• 2-2 HSC • 5-1 , 5-6 managing• 5-1 to 5-12 MASSBUS • 5-1, 5-2 dual-ported• 5-4 mounting • 5-10 MSCP-served • 5-1 paths•5-5 quorum• 1-11 restricted access • 5-1 setting up•2-10, 5-10 UDA•5-1, 5-2 UNIBUS• 5-1, 5-2 Disk class driver• 1-3 Disk controller• 1-2 Distributed file system • 1-3 Distributed job controller• 1-3 Distributed lock Manager• 1-3 Distributed processing• 1-12, 4-1 DSA disk dual-ported • 5-4 failover • 5-4 Dual-pathed disk• 5-2, 5-3 to 5-5 DSA•5-4 HSC•5-3,5-6 MASSBUS • 5-4 Dual-ported disk• 5-2 MASSBUS • 5-4 setting up• 2-9 Duplicate cluster system disk creating• 3-21 E Environment creating common-environment cluster• 2-1, 2-9 multiple-environment cluster• 2-1 user defining• 2-11 Ethernet error log entry • C-2 1 monitoring activity• 3..:....26 port•C-10 communication• C-10 Ethernet hardware address See Satellite node Exceptions file and CONVERT• B-2 use of•B-2 F Failover dual-ported DSA disk• 5-4 Failure of node to boot or join the cluster• C-1 File access controlling• 2-11 File system coordinating• 2-11 to 2-12 G Generic queue clusterwide batch• 4-7 to 4-8 clusterwide printer• 4-3 to 4-5 establishing local• 4-3 H Hang condition diagnosing • C-7 Hardware component computer interconnect (Cl)• 1-2 Ethernet • 1-2 hierarchical storage controller• 1-2 HSC• 1-2 optional• 1-2 star coupler• 1-2 V AXcluster • 1-2 VAX processor• 1-2 Hierarchical Storage Controller (HSC) changing allocation class values• 3-24 HSC disk• 1-2, 5-1, 5-2 dual-pathed • 5-3, 5-6 I INITIALIZE/QUEUE/BATCH command• 4-7 lndex-3 Index J JBCSYSQUE.DAT as common file• 2-10 sharing• 2-11 specifying location of• 4-1 Job controller• 1-3 Job-controller queue file • 1-12, 2-10, 4-1 , 4-9 K Known images installing• 2-10 L Local Area V AXcluster configuration boot server• 1-5 creating cluster security database• 1-8 monitoring Ethernet activity• 3-26 Local disk setting up• 2-9 Logical name defining• 2-10 defining for NETPROXY .DAT• 2-12 defining for SYLOGIN.COM • 2-9 defining for SYSUAF.DAT•2-12 defining for VMSMAIL_PROFILE.DAT A• 2-13 Login controlling• 2-11 M MAIL Database preparing common file• 2-13 Mail Utility (MAIL) controlling• 2-11 preparing common database• 2-13 MASSBUS disk• 5-1 as cluster-accessible device• 5-1 , 5-2 dual-pathed • 5-4 dual-ported• 5-4 Mixed-Interconnect V AXcluster configuration• 1-7 to 1-8 lndex-4 Mixed-Interconnect V AXcluster configuration (cont'd.) changing allocation class values on HSCs • 3-24 creating cluster security database• 1-8 determining allocation class value• 3-4 monitoring Ethernet activity• 3-26 MSCP-served HSC disk• 1-7 running AUTOGEN with feedback option• 3-25 updating MODPARAMS.DAT files• 3-23 volume shadowing• 5-10 to 5-12 MODPARAMS.DAT updating in mixed-interconnect VAX cluster configuration• 3-23 MODPARAMS.DAT file created during CLUSTER_CONFIG.COM ADD phase•3-2 Mounting disks • 5-10 MSCP Server• 1-3 for cluster-accessible disks• 5-1 , 5-2 initializing• 5-2 MSCP_LQAD parameter• 5-2 MSCP_SERVE_ALL parameter• 5-2 Multiple-environment cluster• 2-1 creating• 2-9 operating environment• 2-1 setting up operating environment• 2-11 N Naming devices• 5-5 to 5-9 NETCONFIG.COM command procedure See DECnet-VAX network NETNODE_REMOTE.DAT sharing• 2-11 NETNODE_UPDATE.COM command procedure See DECnet-VAX network NETPROXY.DAT building common version• 2-12 to 2-13 defining logical name for• 2-12 setting up• 2-12 sharing• 2-11 Network See DECnet-VAX network Network Control Program (NCP) See DECnet-VAX network Node HSC• 1-2 passive• 1-2 Node-specific startup functions• 2-11 Index Queue (cont'd.) 0 OP AO: workstation operator console terminal See Workstation node OPCOM messages See Broadcast messages Operating system coordinating files• 2-11 to 2-12 installing• 2-4 upgrading• 2-4 p Page file (PAGEFILE.SYS) created during CLUSTER_CONFIG.COM ADD phase•3-2, 3-3 Partitioning of cluster• 1-9, C-9 Port select button • 5-3 Preparation of common-environment cluster• 2-1 of common MAIL Database• 2-13 of common Rights Database• 2-14 of multiple-environment cluster• 2-1 Preparing cluster operating environment • 2-1 to 2-15 Preparing operating environment multiple-environment• 2-1 Printer queue• 4-1 to 4-5 assigning unique name to• 4-2 clusterwide generic• 4-3 to 4-5 establishing local generic• 4-3 initializing • 4-3 sample configuration• 4-2 setting up• 4-1 to 4-3 starting • 4-3 SYS$PRINT • 4-5 Proxy login controlling• 2-11 records• 2-12 Q controlling• 1-12, 4-1 job controller• 2-10 queue file• 1-12 job controller queue file• 4-1 printer See Printer queue setting up• 2-10 sharing• 2-10 single-node and cluster• 4-1 to 4-14 Quorum equation • 1-10 loss of quorum causes cluster hang condition• C-7 lowering value• 3-27 reasons for loss• C-7 QUORUM.DAT file• 1-11 Quorum disk • 1-11 Quorum disk mounting• 1-11 Quorum disk watcher• 1-11 Quorum file• 1-11 Quorum Scheme• 1-10 R RD series disk See Satellite node Recovering from failure satellite node fails to boot • C-4 Remote network node data controlling• 2-11 Remote node databases copying• 2-8 Removing a satellite node • 3-13 Resource sharing in cluster• 1-9 Restricted access disk• 5-1 Rights Database preparing common file• 2-14 RIGHTSLIST.DAT preparing common version of• 2-14 sharing• 2-11 RMS VMS RMS distributed file system • 1-3 Rules for allocation classes • 5-5 Queue batch See Batch queue command procedures• 2-10, 4-9 to 4-14 lndex-5 Index s Satellite node adding•3-6 disabling conversational bootstrap operations• 3-31 functions• 1-6 maintaining network configuration data• 3-12 modifying Ethernet hardware address• 3-14 obtaining Ethernet hardware address• 3-5 RD series disk used for local paging and swapping • 1-6 removing • 3-13 restoring network configuration data• 3-12 shutting down before removing from cluster• 3-13 system files created during CLUSTER_ CONFIG.COM ADD phase• 3-2 SCS (System Communications Services)• C-10 SCS SYSGEN parameters• A-2 to A-4 Security functions cluster authorization file (CLUSTER_ AUTHORIZE.DAT)• 3-30 Cluster_Authorize Utility (CLUSTER_ AUTHORIZE) sample interactive session• 3-30 controlling conversational bootstrap operations on satellite nodes • 3-3 1 overview• 3-29 SYSMAN Utility altering cluster security data• 3-30 SET CLUSTER/EXPECTED_VOTES command• 3-27 SET DEVICE/DUAL_PORT command• 5-4 Setup procedure coordinating cluster common files for multiple boot servers• 2-14 coordinating cluster common files for multiple system disks• 2-14 SHADOWING parameter setting on Cl-connected nodes in mixedinterconnect V AXcluster configuration• 5-10 setting on satellite nodes in mixed-interconnect V AXcluster configuration • 5-10 Shared command procedure files• 2-9 Shared disk volume • 5-9 for job controller queue file• 4-9 mounting• 5-9 Shared file JBCSYSQUE.DAT • 2-11 lndex-6 Shared file (cont'd.) NETPROXY.DAT•2-11, 2-12 RIGHTSLIST.DAT• 2-11 SYSUAF.DAT•2-11, 2-12 VMSMAIL_PROFILE.DATA • 2-11 Shared queues• 4-1 to 4-14 Show Cluster Utility (SHOW CLUSTER)• 3-26 Shutdown messages See Broadcast messages Shutting down the cluster• 3-27 Site-specific startup command file elements• 2-11 Standalone node converting to cluster node• 3-21 Star coupler• 1-2 ST ART /QUEUE/MANAGER command• 4-1 Startup node-specific function• 2-11 Startup command file building common version• 2-10 coordinating• 2-9 to 2-11 creating common version• 2-1 O site-specific elements• 2-11 Swap file (SWAPFILE.SYS) created during CLUSTER_CQNFIG.COM ADD phase• 3-2, 3-3 SYLOGIN.COM building common version• 2-11 coordinating• 2-9 to 2-11 creating common version of• 2-10 defining logical name for• 2-9 SYS$BATCH redefining• 4-7 SYS$PRINT redefining for local generic queues• 4-5 SYSGEN parameters Cluster parameters• A-1 to A-2 SCS parameters• A-2 to A-4 SYSMAN Utility See Security functions SYSTARTUP.COM to set up queues • 4-9 System command procedures coordinating• 2-9 to 2-11 System communications services See SCS System disk directory structure on common system disk • 2-2 Index System file VAXcluster (cont'd.) building common versions• 2-11 coordinating• 2-11 to 2-12 SYSUAF.DAT building common version• 2-12 to 2-13 defining logical name for• 2-12 printing listing of• 8-1 setting up• 2-12 sharing• 2-11 using CONVERT to merge• B-1 communication mechanisms• 1-9 configuration data recording• 3-25 Connection Manager• 1-3 devices• 5-1 to 5-12 diagnosing CLUEXIT bugcheck • C-8 diagnosing cluster hang condition• C-7 distributed file system • 1-3 Distributed Job Controller• 1-3 Distributed Lock Manager• 1-3 error log entries for V AXport device• C-16 failure of node to boot• C-1 failure of node to join the cluster• C-1, C-6 hang condition • C-7 to C-8 overview• 1-1 to 1-12 planning configuration functions• 3-1 preparing operating environment• 2-1 to 2-15 queues•4-1 to 4-14 Quorum reasons for loss • C-7 recording configuration data• 3-25 recovering from startup procedure failure• C-7 resource access • 1-3 resource locking • 1-3 satellite node boot failure • C-4 System Communication Services• 1-3 troubleshooting• C-1 to C-32 V AXport device error log entries • C-16 V AXport driver• 1-3 VAXCluster local configuration monitoring Ethernet activity• 3-26 mixed-interconnect configuration monitoring Ethernet activity• 3-26 VAXVMSSYS.PAR file created during CLUSTER_CQNFIG.COM ADD phase•3-2 Virtual circuit• C-9 VMSMAIL_PROFILE.DAT A defining logical name for• 2-13 preparing common version of• 2-13 sharing• 2-11 Volume label modifying for satellite's local disk• 3-3 Volume shadowing in mixed-interconnect V AXcluster configuration • 5-10 to 5-12 T Terminal setting up• 2-9 Troubleshooting• C-1 to C-32 u UDA disk• 5-1 as cluster-accessible device• 5-1 , 5-2 UNIBUS disk• 5-1 as cluster-accessible device• 5-1, 5-2 Upgradedsy~ems•2-4 User accounts comparing• B-1 coordinating• 2-12 to 2-13, 8-1 group UIC • B-1 User environment defining• 2-11 User identification code changing for directories• 8-1 changing for files • 8-1 coordinating• B-1 coordination • B-1 v VAXcluster boot events• C-1 building• 3-1 to 3-24 changing configuration type• 3-19 changing from Cl-only to mixed-interconnect configuration • 3-1 9 changing from local area to mixed-interconnect configuration • 3-20 lndex-7 Index w Workload balancing• 1-12, 4-1 Workstation node controlling broadcasts to operator console terminal (OPAO:) • 3-12 lndex-8 Reader's Comments VMS VAXcluster Manual AA-LA27A-TE Please use this postage-paid form to comment on this manual. If you require a written reply to a software problem and are eligible to receive one under Software Performance Report (SPR) service, submit your comments on an SPR form. Thank you for your assistance. I rate this manual's: Accuracy (software works as manual says) Completeness (enough information) Clarity (easy to understand) Organization (structure of subject matter) Figures (useful) Examples (useful) Index (ability to find topic) Page layout (easy to find information) Excellent Good Fair Poor D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D I would like to see more /less What I like best about this manual is What I like least about this manual is I found the following errors in this manual: Page Description Additional comments or suggestions to improve this manual: I am using Version _ _ _ of the software this manual describes. Name/Title Dept. Date Company Mailing Address Phone I ·-;;~~;;:d Het Ta~ ------------------~lllr-------;~;~--_d in the United States BUSINESS REPLY MAIL FIRST CLASS PERMIT NO. 33 MAYNARD MASS. POSTAGE WILL BE PAID BY ADDRESSEE DIGIT AL EQUIPMENT CORPORATION Corporate User Publications-Spit Brook ZK01-3/J35 110 SPIT BROOK ROAD NASHUA, NH 03062-9987 111 ..... 11.11 .... 11 .... 1.11.1 .. 1.1 .. 1•• 1.1 ••• 1.11 .. 1 ·- Do Not Tear - Fold Here - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - I I I I I Reader's Comments VMS VAXcluster Manual AA-LA27 A-TE Please use this postage-paid form to comment on this manual. If you require a written reply to a software problem and are eligible to receive one under Software Performance Report (SPR) service, submit your comments on an SPR form. Thank you for your assistance. I rate this manual's: Accuracy (software works as manual says) Completeness (enough information) Clarity (easy to understand) Organization (structure of subject matter) Figures (useful) Examples (useful) Index (ability to find topic) Page layout (easy to find information) Excellent Good Fair Poor D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D I would like to see more /less What I like best about this manual is What I like least about this manual is I found the following errors in this manual: Page Description Additional comments or suggestions to improve this manual: I am using Version ___ of the software this manual describes. Name/Title Dept. Company Mailing Address Phone I --;;~t;;;:d Here ~d Ta~ ------------------~lllf-------~~£.~~~--in the United States BUSINESS REPLY MAIL FIRST CLASS PERMIT NO. 33 MAYNARD MASS. POSTAGE WILL BE PAID BY ADDRESSEE DIGIT AL EQUIPMENT CORPORATION Corporate User Publications-Spit Brook ZK01-3/J35 110 SPIT BROOK ROAD NASHUA, NH 03062-9987 111 ..... 11.11 .... 11 .... 1.11.1 .. 1.1 .. 1•• 1.1 ••• 1.11 .. 1 -- Do Not Tear - Fold Here - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - I I I I I I
Home
Privacy and Data
Site structure and layout ©2025 Majenko Technologies