Digital PDFs
Documents
Guest
Register
Log In
XX-AFCC3-76
August 1981
59 pages
Original
6.4MB
view
download
OCR Version
3.1MB
view
download
Document:
Proposal for enhancement of UNIX on the VAX
Order Number:
XX-AFCC3-76
Revision:
0
Pages:
59
Original Filename:
joy2.pdf
OCR Text
Proposals for enhancement of UNIX* on the VAX July 21, 1981 Revised August 31, 1981 Filliam Joy and Robert Fabry Computer Systems Research Group Computer Science Division Department of Electrical Engineering and Computer Science University of California, Berkeley Berkeley, CA 94720 (415) 842-7780 ABSTRACT This report describes several proposals for enhancements to the UNIX system on the VAX to meet the needs of the users in the ARPA research community. The areas covered in this report include inter-process com- munication and networking facilities, segmentation and shared-file access, file system facilities and performance improvements, systern support for large software projects and software distribution, standardization of system facilities, operational support, and ongo- ing software efforts. An appendix provides a index to the document in a summary of proposed systemn facilities. We welcome comments on these proposals, either by U.S. Mail to the address given above, or electronically. Our ARPANET addresses are wnj@berkeley and fabry®berkeley. Our uucp addresses are ucbvaxiwnj and ucbvaxifabry. Electronic mail is preferred. ¢ UNIX is & trademark of Bell Laboratories. Proposals for UNIX =i= Contents TABLE OF CONTENTS 1. Introduction hbb}bbv 2lntgrprowss communications and networking oals Assumptions Addresses and sockets Datagram facilities Circuit facilities Multiplexing facilities .8. Providing network accessible services .9. Non-blocking and interrupt-driven i/o .7 1. .7.2. .7.3. Portals Portal protocols Portal activation Portal examples .8. More details about circuits Record mode .B.1. Urgent data .8.2. Failure of circuits .8.3. Circuits simulating pipes .B.4. Closing .8.5. .10. .11. .11, .12. Watermarks, options and status inquiries Extensions being considered Status of the implementation Alternatives and comparison PpNpLrpE 3 Hemory management facilities Standard UNIX facilities Previous VAX enhancements Goals Motivations for segments Allocating segments Segment sizes and rounding Segment protections Freeing segments Giving the system advice .10. Special segments .11. How exec can be written .12. Simulating copy-on-write .13. Special requirements: growing stacks .14. Huge processes and page table sizes .15. Page replacement algorithms for VAX .18. Status and related changes .17. Alternatives and comparison CSRG TR/4 — August 31, 1981 ~ Joy/Fabry Proposals for UNIX -1 - Contents LobNpbk Rk 4. File system performance enhancements Standard UNIX flle system Previous VAX enhancements Goals Major problems Description of approach Policies for new flle system Measurements of program speeds Estimates of file systemn performance Buffering and page caching .10 Fragmentationin the new organization .11. Status .12. Alternatives and comparison @@L 5. New file syst.em facilities Symbolic links Narmng directories - Locking primitives Append access and no-delay opens Truncate Rename Per-flle cache flushing Status BRPImmRPPE 8. Software projects and distribution support Current UNIX facilities Goals Components of the proposal CMU project notion Strong naming support for projects Makefile standards Reviving the UNIX group facility Source revision control Notification/update facilities .10 Role of unique-identifiers for files .11. Towards site-independent programs .12. Status . (X B ¥ - T RSP 7. Randards Manuel format Libraries Mail Signals Terminal driver interface Control; cleaned up ioctls Debugging information format Screen environment support Other areas CSRG TR/4 - August 31, 1981 — Joy/Fabry Proposals for UNIX - iii - Contents DN@ A B. Operatlonal support Standard UNIX facilities Current VAX facilities Overview of needs Operator notion Clean localization of system Error logging Dump/restore needs Archive/retrieve design NpuRrLN 9. Miscellaneous topics Software census and contribution to standard system Electronic forum for system users Hardware support; new and dual processors Debuggers Fortran 77 Detaching jobs UNIX and VMS: performance and facilities 1L Index and summary of proposed system facilities CSRG TR/4 - August 31, 1981 - Joy/Fabry Proposals for UNIX -1- Introduction 1. Introduction This report presents our proposals for enhancements to UNIX on the VAX. Succeeding sections describe proposals for various parts of the system. The rest of this section outlines these proposals. Section 2 describes a proposal for interprocess communication on UNIX and an interface using these IPC facilities to networks, both local and long haul. We expect that there will be many different network s interfaced to UNIX and that the facilities described here can be used to easily interfac e to these different networks. Section 3 describes the proposed extensions to UNIX memory management. Current large scale Al and image processing programs are generally limited by architectural or system constraints to a few Megabytes of address space; by the end of the decade we expect that similar large programs may routinely use address spaces as large as a Gigabyte. VLSI design programs for large designs may likewise use enormous amounts of both space and time. The proposals in this section address the management of extremely large address spaces and propose a segment based view of virtual memory. Facilities to provide segment reference control and copy-on-write like facilities are also described. Special needs of programs that do involved stack manipulations are also addressed. Section 4 describes proposed changes to the UNIX file system organizat ion to provide greater throughput. The file system design focuses on information organization for maximum locality of access and high data throughpu t across a range of mass storage technologies. Section 5 describes file system facilities that are needed for various appli- cations but not provided by the current file systemn. Examples include locking of flles to control concurrent access and symbolic links. Section 6 describes system support for software projects and software distributions. It builds on the CMU project implementation, combinin g it with other facilities: source revision control, strong naming of projects, enhanced UNIX groups, standards for Makeflles, and automated distribution facilities. posed facilities provide for convenient distribution of large bodies The pro- of software. Section 7 describes areas of the system where standardization on a single set of facilities will benefit the user community. New standards are suggested to cover the format of the system documentation. contents of systemn libraries, mail processing protocols and formats, the primitives for handling software signals, the interface of the terminal driver, the format of informati on used by debuggers, and the environment for screen management support. Section 8 describes issues in operational support of the systemn. new facilities to be integrated or provided in the standard system Several are described: the notion of an operator, clean localization of the system (making more of the binaries cpu site independent), error logging, enhancements to dump and restore procedures, and provision of new archival and retrieval facilities. Section 9 covers miscellaneous topics including the construction of a software availability database, hardware support, and the status of various systern programs that are being worked on including debuggers and the FORTRAN 77 system. We conclude in section 10 with a table of the proposed kernel facilities. CSRG TR/4 — August 31, 1981 ~ Joy/Fabry Proposals for UNIX -2- .IPC and networking 2 Interprocess communications and networldng - This section describes our proposed inter-proces s communications facilities for UNIX. Our proposal constructs an IPC frame work that can be used to build a number of different protocols for commu nication, and to support different dis- tributed operating systems and applications. Initially we intend to add the facilities described here to UNIX. We will then begin to implement portions of UNIX itself using the IPC as an implementation tool. This will involve layering structure on top of the IPC facilities. The eventual result will be a distributed UNIX kernel based on the IPC framework. The IPC mechanism is based on an abstraction of a space of communicating entities communicating through one or more sockets. Each socket has a type an address. Information is transmitted between socket s by send and receive operations. Sockets of specific type may provide other control operations related to the particular protocol of the socket. and In providing access to the communcations space, we will initially support The first version of the IPC facilities for UNIX will support an IPC address only three socket types, but have specifically designe d the facilities so that new socket types may be easily added. The initially proposed socket types provide virtual circuits and datagrams. Circuits are two-way reliable data streams, and datagrams are unreliable one-way messages that are sent without explicit acknowledgment and often with limitations on length. These facilities admit simple and eflicient implementations both in the single machine case and when interfacing to network protocols, and this is why they were chosen initially. -space that is an extension of the TCP/IP address space, a comparitively flat 32 bit address space with additional addressing availabl e at each node. We expect to add generic addressing, broadcasting and multiplexing as needed and to experiment with the amount of late binding in the ‘‘addressing’* scheme. The flexibility to allow this is explicitly provided by our basic model. We expect that in constructing a distributed UNIX system on top of the basic model we will provide services such as migration of processes, but we do not insist that the address space underlying the IPC have the ability to directly and transparently support migration; we will layer it on while implem enting UNIX if necessary. When we use the facilities described here to implement networked versions of the UNIX system we will build on the IPC address space to derive resource identifiers (larger objects that contain addresses, rights and authentication) and use encryption and other well-known techniques to create protection domains and do authentication. niques. The reader is assumed to be familar with such tech- To support multiplexing of communications in UNIX both a synchronous facility based on the ADA select statement and an asynchronous softwareinterrupt (signal) based facility are provided. These facilitie s are not part of the basic IPC model, but of its embedding in the UNIX system. tals, The IPC facilities are integrated into the current UNIX name space by por- entries in the flle system that invoke server process es when accessed. These entries are designed to be used by naive processes that are unaware of the use of communication. The basic IPC communications facilities and portals may be used to provide services on a single machine and in a networked environment. A more complete description of the motivation of the IPC architecture described here, measurements of a prototype implemen tation, comparisons with other work and a complete biblicgraphy are given in CSRG TR/3: “‘An IPC Architecture for UNIX''. CSRG TR/4 — August 31, 1881 — Joy/Fabry Proposals for UNIX ] -3- IPC and networking 2.1. Goals We see at least four distinct areas where UNIX IPC will be important: b * In supporting inter-process communication within a single machine. In supporting access to the facilities of the available local and long-haul networks. * In constructing services on a tightly coupled set of machines to make the facilities of all machines available to users. * In constructing servers for autonomous machines, which allow access to resources while retaining local administrative control. * To provide uniform access to IPC objects and current UNIX objects. In meeting these needs we wish to keep, as at present, the UNIX kernel largely as an i/o multiplexor. We wish to place facilities unrelated to the basic IPC mechanisms (such as name servers and authenticators) outside the kernel. 2.2. Assumptions Our design is based on the layered models for distributed systems, such as the ISO Open Systems Architecture. We assume that the system facilities are built on services provided by network layers in that model and make assumptions in our design about the internetwork: * The internetwork provides datagram services and perhaps virtual circuits. * The-internetwork provides origin and destination addresses in all messages. * Al entities with which we wish to communicate can be given internetwork addresses. ) The facilities to be provided by the kernel to the users processes include: + Datagram and virtual circuit access to the network. + Buffering and multiplexing of communications. + Creation of servers when they are referred to, so that they need not pre- + Translation of access to names in the UNIX name space into accesses to exist. sServer processes. + Translations of system calls into protocol when communicating with servers that simulate UNIX objects such as flle and directory hierarchies. Facilities not to be provided by the kernel are: = A network name server. — Control of information access and protection in the network. — Transmission of structured information and data representation conversion. Such facilities are desirable, but will be implemented outside the kernel so that application-specific and site-specific facilities can be created. 2.3. Addresses and sockets We assume that the transport layer of the systemn provides us with an internetwork wide address space. Each message to be sent includes source and destination addresses. The type in_addr will be used to refer to an internetwork address. We expect, but do not require, that such addresses be of fixed length. For definiteness the reader may assume that an in_gddr has the following form: CSRG TR/4 -~ August 31, 1981 - Joy/Fabry Proposals for UNIX -4- tyfiedef struct in_addr { int int ipaddr; moreprecise; IPC and networking /® internet address */ /* sub-addressing at destination */ { in_addr; We expect that some internetwork addresses will be generic and some will be location independent. The resources available in this way will vary from network to network. Our proposal uses a socket abstraction in both the circuit and datagram implementations. Sockets are the destination of all internetwork communica- tion. If a socket is not active (no process is servicing it) when communication is attempted to the socket then the information may be discarded or a server may be created to service the socket. The types of sockets available are represented by the type in_profo: typedef enum in_proto § SOCK_DG, SOCK_CALL, SOCK_VC { in_proto: Each socket has some buffering associated with it. SOCK_DG datagram sockets bufler incoming datagrarms; SOCK_CALL call director sockets buffer incoming and outgoing calls; SOCK_VC virtual circuit sockets have a queue for incoming data on their circuit and logically reference a matching SOCK_VC socket where transmitted data is stored. Active sockets are referenced by small integer '‘file descriptors’”. A set of file descriptors is represented by the type fd_set that is represented by a bit string and is used in the select primitive for synchronous i/o multiplexing. 2.4. Datagram facilities A datagram is a short piece of data sent to a specific socket address. No guarantee of reliable delivery is made for datagrams, and they are typically limited in length to just over 512 characters per datagram. A socket for receipt of datagrams may be created by using the socket sys- tem call: in_addr addr; in_addr pref; int s; 8 = socket{(SOCK_DG, &addr, &pref); The returned s is a descriptor for a socket, and the returned addr is the address of the created socket. If the third argument to the socket call is a 0, then the system chooses an address for the created socket. You can specify pref if you wish to set up a specific, well-known socket, e.g. for a server. If an error occurs ‘ then a —1 value is returned for s as is normal in UNIX. To send a datagram from a socket the system provides a send primitive, which is invoked in_addr dest; char *msg; int len; ... tnitialize values of s, dest, msg, len... send(s, &dest, msg, len); to send msyg of len bytes to dest. The value of dest must be initialized before this call from well known data (e.g. the network equivalent of ‘411" and “555-1212" or *15.000Mhz") or by obtaining it from ancther process. - CSRG TR/4 — August 31, 1981 - Joy/Fabry Proposals for UNIX -5- IPC and networking A datagramn can be received by a receive system call: int d; in_addr source; char msg[MAXMSG]; int len; ... tnitialize socket d with addr dest as above... len = receive(d, &source, msg, MAXMSG); that returns, in the supplied message buffer msg, len bytes from the source address returned in source. If the datagram would not fit in the supplied bufler, then the remainer is discarded and the len gives the length of the datagram before truncation. Each receive call removes a single datagram from the buffer space associated with the socket. The following example shows a time server program that creates an inter- network datagram socket to which a message can be sent causing a message with the time to be returned. It could be used by a small computer on a network to obtain the time of day from a central server. #include <inet.h> /* defines in_addr, SOCK_DG, etc. */ #include <types.h> #include <wellknown.h> /* defines WWV_ADDR and others */ /* tsaddr is the well-known-address of the time server */ in_addr tsaddr = WWV_ADDR; main() char buf{1]; int len; in_addr addr; int s; char *ctime(), timestr; time_t t; = socket(SOCK_DG, 0, &tsaddr); if (s < 0) { printf("can't get socket\n"); exit(1): } for (i:) § /t * We receive a datagram and discard its contents, ® to get the address of the sender. A more sophisticated * time server might handle several requests based * on the contents of the received datagram. */ receive(s, &addr, buf, sizeof (buf)); time(&t): timestr = ctime{&t); /* get binary time */ /* convert to string form */ send(s, &addr, timestr, strlen(timestr)); } Here the socket call associates this process with the time server socket whose address is specified, returning —1 if there is sornething wrong with ¢s_addr (i.e. not providable on this machine) or if the socket is already in use (e.g. by another instance of the time server). If the socket is openable the server loops reading a packet from the socket for the sole purpose of obtaining the address it came from and sending back the time without further ado. CSRG TR/4 - — August 31, 1981 — Joy/Fabry Proposals for UNIX 2.5. -8- IPC and networking Circuit facilities To use a virtual circuit one first obtains a SOCK_C ALL call director socket that is associated with a specific network address answered at this SOCK_VC socket. . Calls may be placed from and Each call placed or answered yields a distinct new virtual circuit socket that allows for the reliable, flow-controlled transmission of arbitrary.amounts of data to and from the party at the other end of the circuit. Circuits allow specially marked urgent information to be give out-of-band notification of the presence of urgent data, and allow sent, record boundaries to be described in section 2.8. marked in the stream. Processes can send and receive data on a circuit These circuit options are with the normal UNIX read and write calls. Conversations are flow control led by the underlying mechanisms; if the sender writes data faster than the receive r can accept it, the sender will block. If the receiver reads data when none is available, it will block pending receipt of more data. In the default stream mode, a read returns as soon as data the system does not preserve any boundaries within record oriented mode for data transmission will is available and the information stream. be describe in section 2.8. A So that incoming and outgoing calls may be queued, a process must h&ve access to a call director socket to place or receive created with a socket call: a call. A SOCK_CALL socket is int s; in_addr addr, pref; s = socket(SOCK_CALL, &addr, &pref); The returned s is a “‘file” descriptor for a socket for establis hing virtual circuits, by calling and receiving calls. When calls are placed or answered additional descriptors are obtained for the SOCK_VC virtual circuit sockets corresponding to the calls. A call is received by doing: int t; in_addr caller; t = answer(s, &caller); This returns a descriptor for the new SOCK_VC socket for the conversation with socket exists as s created as above, a call could be placed by: caller. Several answer calls may be done on a single call director socket; each yields a SOCK_CALL virtual circuit socket representing a single conversation. To place a call establishing a circuit one must first have access to a SOCK_CALL call director socket at some address. Assumin g the SOCK_CALL int t; in_addr callee; ... tnitialize callee ... t = call(s, &callee); After placing a call, a new descriptor is obtained correspo nding to the new SOCK_VC virtual circuit socket. If the call fails then a value of —1 is returned. When the conversation with callee is complete, the virtual circuit socket ¢ can be closed. . CSRG TR/4 = August 31, 1981 - Joy/Fabry Proposals for UNIX -7- IPC and networking Both call and answer may be done at a single SOCK_ CALL socket. The following example uses the circuit facilit ies build a telnet server creat- ing server processes (login commands) each time someone connects to the tel- net socket: #include <inet.h> #include <signal.h> #include <wellknown.h> in_addr teladdr = TELNET_ADDR; main() void reaper(); int s = socket(SOCK_CALL, 0, &teladdr); - if (s < 0) { printf("can't get socket\n"); exit(1); sigset(SIGCHLD, reaper); for (:;) § | int t = answer(s, 0); if (fork() == 0) § dup2(t, 0); dup2(0, 1); dup2(0, 2); close(s); close(t); close(p); execl(”/etc/tellogin”, 0); exit(1); close(t); J #include <wait.h> /* reaper() allows all children which have died to exit, ./ void reaper() { while (wait3(0, WNOHANG, 0) >= 0) Here the basic server answers to the telnet socket connection is made to the virtual circuit socket continue: ] it created. Each time a a new instance of a special login server /etc/tellogin is created. When a login is complet e, the child exits and the Teaper routine is called with a signal; it collect s the terminated children. 2.8. Multiplexing facilities In writing communications oriented programs it is often desirab cess inforrnation arriving from more than one source. le to pro- The proposed IPC facili- ties provide three mechanisms for use in bandlin g communication with more than one party: a synchronous facility based on the select statement, a facility for preventing i/o operations from blocking, and an asynchronous facility based on software interrupts. The latter two facilities will be described in section 2.9. We here describe multiplexing with selec?, Multiplexing facilities are generally useful for UNIX and we expect they will be gradual ly made available for more system services and devices. We expect to provide them for terminals with the first release of the IPC. ‘ To support synchronous processing of information from more than one source we provide a select call, of the form: CSRG TR/4 — August 31, 1981 — Joy/Fabry Proposals for UNIX _ -8- » IPC and networking int nfds, nready; fd_set reads, writes; nready = select(nfds, &reads, &writes, timeou t); The select call is provided with a structure descri bing flle descriptors that are interesting; reads for descriptors where readab ility is interesting and writes for descriptors writability is interesting. The system examines each specified descriptor to see if there is an input or output operation possible on it, and returns in reads and writes sets of all such descriptors representable by type Jd_set descriptors. Nfds gives the count of so that the size of the second and third arguments to select need not be fixed in the system, but may vary from program to program. Either reads or writes may be specified as 0 to denote that no descriptors are interesting to read or write. If no descriptor comes ready within timeo ut milliseconds, the select returns, returning a value of 0. Timeout may be 0 for immediate return or —1 to not return prematurely. The name select is chosen from the name of the statement in the ADA language whose semantics are similar. The select statement is also similar to the gwaif mechanism provided in extensions to UNIX at BEN. The difference is the way that the interesting sockets are descri bed and returned. With ewait the system keeps a list of interesting file descrip tors internally, instead of having it specified at each call, and the return value is an array of integers instead of a bit mask. Await does not provide the timeout facility . Library routines to simulate await could easily be implemented using the facilities of select. An important point in the semantics of select is that it imposes no bias. The mechanism for selecting among sockets that can be processed is left to the user. The previous example program made use of an asynchronous facility for bandling process termination. A reasonable extensi on to UNIX would be to provide a record on a special circuit when child process es terminate. This program could then be written using select to service the two circuits synchronously. Assume that a call waitsocket yields a socket on which messages of type child_status are placed when child processes terminate. previous example is shown below. A revised version of the Here we have used standard library routines setfd that bit-set of type fd_set and a routine getfd that destruc adds an element to a tively removes an element from one of these sets returning the value —1 when the set is empty. 2.7. Portals The mechanism whereby services may be created in the UNIX file system name space involves creating a bridge betwee n the file system name space an IPC socket called a portal. asymmetric. and Portals are client/server links and as such are The client accessing the portal may well be unaware that the object referenced is not a traditional UNIX object; in all but the most trivial cases, the server of the portal is interpreting a protocol and is cognizant of the existence of the portal. A portal is created by the call CSRG TR/4 = August 31, 1981 ~ Joy/Fabry Proposals for UNIX -9- IPC and networking #include <inet.h> #include <signal.h> #include <weilknown.h> #define FOREVER -1 in_addr teladdr = TELNET_ADDR; fd_set sandp, choose; main() int s = socket(SOCK_CALL, 0, &teladdr); int p = waitsocket(); int t; if (s <Ol p <0) { printf("can't get socket\n"); exit(1); } setfd(&sandp, s); setfd(&sandp, p); for (i:) { choose = sandp; select(NOFILE, &choose, 0, FOREVER); while ((i = getfd(&choose)) >= 0) § if i==p){ child_status chstatus; read(p, &chstatus, sizeof (chstatus)); continue; t = answer(s, 0); if (fork() == 0) { dup2(t, 0); dup2(0, 1); dup2(0, 2); close(s); close(t); execl("/etc/tellogin”, O); exit(1); t,:lose(t): J typedef enum portal_kind { PORTAL_CALL, PORTAL_FILE, PORTAL_DEV, PORTAL_DIR; | portal_kind; portal_kind kind; char *name; int mode; char *server; int s; = portal(kind, name, mode, server); where name is the pathname for the portal, mode is the UNIX protection mode for name, and server is a string specifying for the server to be invoked when the portal is accessed. The kind specifies the type of portal, and thereby specifies the protocol generated by the kernel for operations by client processes on it. The s returned is a descriptor for a SOCK_CALL call director socket to which the kernel will place calls when opens are done on name. CSRG TR/4 : - August 31, 1981 — Joy/Fabry Proposals for UNIX -10- IPC and networking UNIX protection modes are used to control access to the sockets associated with a portal. The call director socket for a portal is not accessible using internetwork addresses. Itis therefore accessible only using a reference through the file system name space. 2.7.1. Portal protocols The portal types are implemented by the kernel by transiating system calls applied to the flle descriptors returned from opens on a portal into protocol records on the SOCK_VC sockets the server receives when it answers cails. The exact specification of these protocols is beyond the scope of this paper, but we outline the basic nature of the protocols here. A PORTAL_CALL portal acts like a virtual circuit socket, and sunply passes calls onto the underlying SOCK_CALL socket. A PORTAL FILE translates reads and writes on the underlying SOCK_VC resulting from an open into a record-oriented request packet to the server. The kernel expects an appropriate reply to complete the operation for the client. Operations fstat and lseek are also possible on descriptors obtained by clients by opening a PORTAL_FILE. A PORTAL_DEV is like a PORTAL_FILE, but aiso allows control operations, a generalization of ioct! to be described in section 7.6. A PORTAL_DEV thus can be used to simulate a general UNIX device, such as a terminal. A PORTAL_DIR can be used to simulate a UNIX directory, as calls such as open, unlink and creat are translated into appropriate protocol. A result of such a call is often another connection to a service process to provide a file interface via the PORTAL_FILE or PORTAL_DEV protocol. The systemn call chdir to remote directories can be supported by allowing the current directory to be a connection to a server implementing the The service process need not exist when a portal is first referenced. If it PORTAL_DIR protocol. 2.7.2. Portal activation does not, a socket is created and associated with the in-core information about the file system entry for the portal. The server string is taken as a path name of the server program and that server is created in the environment of the process referencing it, receiving as descriptor 0 the socket associated with the portal, inheriting the current directory and user-id of the accessing process. The server process may be set-user-id to allow it to run in a different protection domain. The server process created has as parent the process that created it but is marked to not notify the parent when it finishes execution, since the accessing process is not aware of its presence. The portal process may service more than one request on the descriptor or exit at any time. Processes accessing a portal may wait for the server to service thern much as callers wait for an answer to occur on a virtual circuit. When a portal is created the portal call returns a descriptor for the portal. Portals thus are created lve. If the pointer to the server in a portal call is 0, this portal is accessible only while it is live; the portal will be closed if the server dies. A process may thus establish a portal that it will serve and bypass the server creation mechanism. CSRG TR/4 — August 31, 1881 — Joy/Fabry Proposais for UNIX 2.7.3. -11- IPC and networking Portal examples The example given below shows a mail server utility that looks up forward- ing addresses: ;nain() int p; char *lookup(); unlink("'forwarding"); p = portal(PORTAL_CAll, "forwarding”, 0668, 0); for (i;) ¢ int s, len; char name([128]; char *addr; s = answer{p, 0); } { recordmode(s, 1); len = read(s, name, sizeof (name)); addr = lookup(name); write(s, addr, strien{addr)); close(s); The server creates a portal named forwarding of virtual circuit type. If you want to look up a forwarding address you can do: FILE *f = fopen("forwarding", "rw"); recordmode(fileno(f), 1); fprintf(f, "jones\n"); fgets(f, buf); We could also write a server to be created automatically instead of manually. We would create the portal using a call: portal(PORTAL_CALL, " /etc/forwarding”, 0666, "/etc/forwarder”); Then when the file /etc/forwarding is first referenced, a /etc/forwarder will be created to service it. This portal would normally be created by a shell com- mand: $§ portal call /etc/forwarding /etc/forwarder The server /etc/forwarder would be created with descriptor O referring to the portal /etc/forwarding, and would be written: CSRG TR/4 - = August 31, 1981 - Joy/Fabry Proposals for UNIX ' -12- IPC and networking main() char *lookup(); for (;;) § int s, len; char name[128]; char *addr: s = answer(0, 0); recordmode(s, 1); len = read(s, name, sizeof (nam e)); addr = lookup(name); write(s, addr, strien(addr)); close(s); j A server could be created in inter network space by using a socke t instead of a portal, or automatically creat ed on reference in internetwork address space using a association. These facilities are discussed in the next section. 2.8. Providing network accessible servi ces Recall that portals are not accessible using the internetwork addre ssing mechanisms, so that UNIX prote ction applies to them. It is thus necessary to provide a separate facility to allow servers to be dynamically created as a result of internetwork address space refer ences. ' The call in_addr addr; in_proto kind; char *server: associate(&addr, kind, server); specifies that a server of type kind is to be provided for internetwork addre ss addr, the address must be on the current machine. A reference to the addre ss addr causes the specified server to be created and given access to the newly creat ed socket of type kind, either SOCK _DG or SOCK_CALL The created process will be run with user-id and group -id of the user who supplied the associ a- tion, from the root directory of the file process as parent. system, and with the system initializati on The power to create associatio ns may be limited administra - tively on a particular machine. 1t is likely that certain internetwork addre sses will be reserved to privileged user-i d's, and that normal users would not be allowed to specify these addresses for associ ations. An association may be removed by a disassociate(&addr); As an example of the use of associations , assume that an internetwo rk registry exists on the local netwo rk and we wish to create a servi ce program that will be known to the registry. The program given below creates an associ ation for the server and registers it with the registry. This program could be invoked as $ register servicename program to register servicenams to access program. CSRG TR/4 ' = August 31, 1881 — We assume that the registry Joy/Fabry - 13- Proposals for UNIX ' IPC and networking operates by accepting a call from the program followed by three records on the connection: the operation type as the first record, consisting of the word register for registration requests. For registrations the second record is the name to be registered, and the third record is the internetwork address. Note: in this exampie we use printf to print error messages; in a production program we would use the C library routine perror that looks up an error mes- sage, and can yield more precise system characterizations of the error. We use printf here since the error messages in the source can help understand the program while calls to perror would all have the form perror(x); where z would be s or £. This is not enlightening to the code reader. #include <inet.h> #include <wellknown.h> in_addr registry = REGISTRY_ADDR; in_addr char /* well-known */ serviceaddr; response[128]; /a * register servicename program =/ main(arge, argv) int argc: char *argv(]; int s, t; char ®*servicename, *program. if (arge != 3) § printf("'usage: register servicename program\n"); exit(1); ! ‘servicename = argv{1]; program = argv{2}; L § * Get a socket to call the registry with. * Since both this and the socket to be registered * are assumed to be call director sockets we simplify * the program by just registering the socket we are talking on. . s = socket(SOCK_CALL, &serviceaddr, 0); if (s < 0) § printf("no sockets available\n"); exit(1); ] t = call(s, ®istry); if (t < 0) { printf("registry doesn't answer\n"); exit(1); | if (associate(&serviceaddr, SOCK_CALL, program) < 0) printf("can’t associate service\n"); exit(1); ! recordmode(t, 1); write(t, "register”, 8); write(t, servicename, strien(servicename)); write(t, &serviceaddr, sizeof (serviceaddr)); closesend(t); CSRG TR/4 — August 31, 1981 — Joy/Fabry Proposals for UNIX -14 - IPC and networking if (read(t, &response, sizeof (response)) < 0) § printf("no response from registry\n"); exit(1); if (stremp(t, “ok”) != 0) § printf("error registering: %s\n", response); disassociate(&serviceaddr); exit(1); } We note in passing that the placement of the service name in the registry and the placement of the association of the name in the local association table would ideally be done as a single distributed atomic operation . 2.9. More details about circuits We now describe the rest of the facilities and attributes of virtual circuits that were not yet described. The calls described in the following sub-sections are written as library routines, and will use the ioctl-like system control inter- face (see also section 7.8). 2.9.1. Record mode Circuits support a record mode, where each piece of data written on the circuit is considered a single record, and reads return complete records. allows records to be read and written conveniently. The call This recordmode(s, 1); sets a virtual circuit socket to be in record mode. cuit socket is not in record mode. A newly created virtual cir- Record mode may be disabled by doing recordmode(s, 0); If you read only part of a record while in record mode because the buffer supplied to read or the read buffering of the socket is insufficiently large to contain the entire record, then the remainder of the record made available on successive reads. The call recordbetween(s); returns 1 if the specified stream is at a boundary between records, or 0 if it is not. If only the writer is in record mode, then reads will never return data across record boundaries. If only the reader is in record mode then data will normally be aggregated to requested lengths before being presented to the reader. A record may be created from data presented in multiple wrife calls by turning record mode off, writing data as required, and turning record mode on just before the last write in the record. 2.9.2. Urgent data Circuits support a notion of urgent data. mode by doing A circuit can be set into urgent urgentmode(s, 1); or disabled by specifying a second argument 0. Data transmitted while in urgent CSRG TR/4 — August 31, 1981 ~ Joy/Fabry Proposals for UNIX -15- IPC and networking mode is marked, and causes the recipient of the data to process it specially. By default, urgent data arriving on a circuit causes generation of a signal SIGURG. This signal may be ignored if urgent data is to be processed synchronously. The set of channels with urgent data may be determined by doing td_set whichareurg; ... initialize whichareuryg to interesting sockets ... urgentsockets(NOFILE, &whichareurg); This selects out of the sockets in the bit-mask whichareuryg those with pending urgent data; all other bits are cleared. While a socket has pending urgent data the urgentpending(s); call will return true. When the next byte to be read is part of urgent data the predicate urgentnext(s); will return true. The normal way of processing urgent data is to read out records frorn the input until the urgentpending flag drops. Then the last piece of urgent data will remain in the input buffer. A single read call never returns both urgent and non-urgent data; it therefore suffices to check urgentnezt before each call to read to determine the type of the data to be read. 2.9.3. Failure of circuits If a permanent failure occurs in a circuit the circuit will be marked invalid. A process that attempts to read from or write to a failed circuit will be given an error indication and then sent a signal indicating a broken connection if further reads or writes are attempted. When processing circuits asynchronously a notification is sent immediately when a circuit fails; see section 3.5.3. 2.9.4. Circuits simulating pipes A circuit can be used to simulate a pipe directly as the semantics are upward compatible; the reverse direction of the circuit will not be used, and can be severed to prevent accidental use. U the circuit fails, the signal sent on the next access to the circuit performs the same function as the SIGPIPE signal for pipes. 2.9.5. Closing The call closesend(t); reports to the other party in a call that the call is no longer needed by sending an end-of-file on the connection. The call will continue while the other party is sending, and more data can be received on ¢, but no more data may be sent. When all copies of the descriptor ¢ created in fork or by dup have been destroyed, the circuit will be shut down after allowing the write buffers to drain. Calls pending when a call director socket closes cause a new server to be created to service it if the socket has a server via a portal or a association; CSRG TR/4 — August 31, 1981 - Joy/Fabry Proposals for UNIX -16- ' IPC and networking 2.10. Non-blocking and interrupt-driven i/o To support servers and other processes that wish to not block in doing com- munications processing, a call to set a socke? or other UNIX file descrip tor into a non-blocking mode is provided: nonblocking(s, 1); After setting a socket non-blocking, operations that would block because of insufficient buffering on output or lack of availabl e data on input will return a new error ENBLOCK. This is normally returned to a caller in C as a -1 return from a system call, with the global variable errno set to ENBLOCK. The operation can be retried later, as select will report the socket ready when it becomes unconstipated. A call placed on a non-blocking call director socket will immediately return a SOCK_VC virtual circuit socket descriptor, even though the call is not complete. The returned flle descriptor will selact as ready for writing when the call completes or fails to connect. At that point a sockefstatus operation can be done on the circuit socket to determine the status used with the select to limit the length of time plete. of the call. A timeout may be spent waiting for a call to com- Certain applications may require that they be notified immediately whenever input/output is possible. If such asynchronous operations are required, this can be enabled by doing: asynchronous(s, 1); Then when input is available or output becomes possible after a blockage the process that is doing asynchronous process ing on the socket is notified with a SIGIO signal. A select with a timeout of O can be used to identify the the asynchronous sockets that need service. subset of Asynchronous can also be used in addition to nonbloc king when placing and receiving calls. The sequence: in_addr addr, dest: int s, c; ... tnitialize dest in sorne manner ... s = socket(SOCK_CALL, &addr, 0); nonblocking(s, 1); asynchronous(s, 1); ¢ = call(s, &dest); places a call on the socket s and immediately returns a descriptor ¢ because the Because s is marked asynchronous, a SIGIO is posted when the call to dest succeeds or fails and the call socket ¢ will appear in a select as ready for writing. A sockefstatus call, describ ed below, can be used socket s is marked non-blocking. to determine whether the call succeeded or failed. A similar technique can be used with answer; if a call were placed to socket & in the example above then a SIGIO would also be generat ed, and the socket s would show as being readable, the data being the be used establish connection. incoming call. A answer could SOCK_VC virtual circuit sockets marked asynchronous cause SIGIO to be sent immediately when the circuit fails. . Because of the specialized nature of asynchronous i/o and to avoid difficult semantic and implementation difficulties only one process may mark a socket asynchronous at a time. CSRG TR/4 = August 31, 18981 — Joy/Fabry Proposals for UNIX -17- IPC and networking 2.11. Status inquiries, watermarks, and options A socketstatus operation can be used to get information about a socket:+ in_status state: socketstatus(s, &state); in the following structure: typedef struct in_status § in_proto protocol; in_addr source; in_addr dest; in_state state; /* SOCK_DG, SOCK_CALL or SOCK_VC */ /*® socket address */ /* destination address, for circuits */ /* state of the connection */ fd_waterm srcwm; fd_waterm rcvwm; . /* watermarks for sending */ /* watermarks for receiving */ { in_status; The protocol fleld tells the protocol the socket supports; the currently defined protocols are SOCK_DG for datagram protocols, SOCK_CA LL for call director sockets where call and answer are possible, and SOCK_VC for the virtual circuit sockets resulting from call and answer. The fleld addr is the address of this socket. The fleld dest is used only for SOCK_VC sockets, where sockets obtained by call or answer report peer addresses. The fleid state shows the state of a call in a SOCK_VC, and has the values: IN_CALLING IN_CALLFAILED IN_OPEN IN_CLOSING IN_CLOSED IN_BROKEN Call is pending Call failed Call has succeeded and circuit is open Call is closing Call has closed Call broke due to some failure The watermark flelds specify the amount of transmit and receive buffering in this file descriptor. Each has the following structure: . typedef struct td_waterm | int int int lowat; hiwat; timeout; { fd_waterm; The hiwat watermark reflects the total amount of buflering available. The lowat and fimeout are used in non-blocking input/output. On output, a non-blocking sender will receive an error when the high water mark is reached and the data is not transmissible within timeout milliseconds. The sender will be notified when the amount of output pending drops to the lowat watermark. A receiver will be notified if lowaf data accumulates, or if any data has accurnulated and timeout time has elapsed. The lowat and hiwat are in bytes, and the fimeout is measured in mil- liseconds. Reasonable defaults for the various flelds are set by the system. The watermarks may be set by the user by 1 This call is implemented as a iocil. CSRG TR/4 — August 31, 1981 — Joy/Fabry Proposals for UNIX -18- IPC and networking fd_waterm rdwm, wtwm; watermarks(s, &rdwm, &wtwm); where either the second or third argument may be specified as 0 to specify that the read or write watermarks are not to be changed.} The interpretation of options for data transmissions such as priority and security classifications varies from network to network and tends to be inter- preted in ways that are hard to generalize to different networks. This is akin to device control, where different devices will allow different operations. Instead of specifying all possible options with each message to be sent, which would involve complicated processing for each message, we will use per-socket state to jocalize most of the option setting to the socket setup phase. UNIKX currently provides an ioctl operation to deal with device specific con- trol operations. and we wish to use a similar mechanism for socket option specification. See section 7.7 for a discussion of some problems with ioctl, and a description of the confrol operation to be used here. tions on sockets to set options. For example: We define control opera- control(f, "precedence”, "high", —1, 0, 0); could set the precedence of the circuit fto be high and char security[32]; int slen; slen = control(f, "security”, 0, 0, security, sizeof (security)); might return the current security of f as a character string to security. by: The watermarks primitive of the previous section might be implemented watermarks(s, rdwm, wtwm) int s; fd_waterm *rdwrn, *wtwm; . § if (rdwrn) control(s, “readwm", (char *)rdwm, sizeof(*rdwm), 0, 0); if (wtwm) control(s, "writewm"”, (char *)wtwm, sizeof(*wtwrn), 0, 0); We intend to study the appropriate standard set of control operations for sockets and provide suggestions for such a set at a later date. 2.12. Extensions being considered The facilities described here provide basic access to the communications mode! described at the beginning of section 2. They can be used to provide higher-level facilities such as location-independent resources and resource access with different naming, protection and error-recovery strategies. The facilities can also be extended in two ways: by extending the communi- cations facilities (more sophisticated addressing; more protocols), or by extending the interface provided by the UNIX kernel to application processes (building higher level facilities than provided by the communications facilities). $ The waterrnarks call is implemented as an foctl. CSRG TR/4 . — August 31, 1981 — Joy/Fabry Proposals for UNIX -19- IPC and networking We expect that additional socket types corresponding to different communi- cation models will be desirable. For example a reliably-delivered-message abstraction seems useful, independent of the connection implied by a virtual cir- cuit. This abstraction could be provided by a SOCK_RDM socket type given a definition of the semantics of failure to deliver. At the UNIX level we expect to provide additional facilities for controlling and debugging communcations. We expect that it will be desirable to be able to control all aspects of selected processes input/output behavior to debug them or simulate any desired environment. We expect to provide hooks for a control- ling process to monitor the requests made by a process and to be able to interpose itself in communcations to take traces or redirect data. The ability for processes to exchange access to existing sockets seems desirable to many systems builders. This can be provided by allowing processes to yield sockets to other processes wish to take them. We believe that this facility is properly part of UNIX, not part of the underlying communcations mechan- ism. We intend to provide such facilities in the network operating system version of UNIX. Similarly, we believe that the migration of processes can be pro- vided without the aid of special mechanisms in the communications media. 2.13. Status of the implementation We have implemented a prototype of the mechanism described here that supports single-machine pipes and datagrams, and have been using it on our development machine for a several months. It is significantly faster than the older IPC mechanisms of UNIX (mpx and pipes) and simple to implement. We are working a full implementation of this IPC that we will interface to TCP/IP running on the ARPANET and also to our local area networking hardware (3M ETHERNET). We expect that this implementation will be in a form suitable for testing at other sites in the fall of 1981. 2.14. Alternatives and comparison We are considering alternatives to the urgent data handling mechanism here. A reader of an earlier version of this proposal pointed out that a more convenient mechanism rnight be a non-blocking readurgent call. Rashid at CMU has implemented a message-based IPC for UNIX that also serves as the basis for the SPICE machine operating system on the PERQ. The CMU IPC differs from our proposal in several ways: * It provides reliably delivered messages rather than datagrams and circuits. The messages have attributes as being either reliable or unreliable and have headers that contain many of the flelds found in the TCP protocol. With the mechanisms proposed here messages can be constructed by applications either based on datagrams or on top of circuits. A new socket type could be added to implement reliable messages in the primitives layer. * The targets of message transmission are not fixed in location, but may be moved from machine to machine in a way transparent to user processes. In our proposal, such migrations are the responsibility of the application programs, that communicate about such movements using the internetwork address space for reference. * The CMU IPC will do data representation conversions and scatter and gather data to and from the process address space when messages are sent or received. In our proposal such facilities are the function of application libraries, not of the UNIX kernel. CSRG TR/4 — August 31, 1881 — Joy/Fabry Proposals for UNIX * ' -20- Selection facilities are built into several IPC calls. IPC and networking In our proposal they are available as a separate select facility that can be used with other UNIX file descriptors. We expect to compare the facilities, performance, and usage of the CMU and Berkeley IPC proposals more in the near future. CSRG TR/4 — August 31, 1981 -~ Joy/Fabry Proposals for UNIX -21- Memory management 3. Memory management facilities In this section we describe proposed ment facilities of UNIX to allow enhancements to the memory manage- UNIX applications programs to the large address space available in the VAX architecture. take advantage of 3.1. Standard UNIX facilities The standard version 7 UNIX system has ties. simple memory management facili- Each process has four areas of mermo ry: a pure code area known as the “text'” segment, a private area filled with initialized data values known as the “data’” segment, a private area filled with zero known as the "'bss’’ segment, and a stack in its own "'stack’’ segme nt. Most UNIX implementations provi de these four areas using only two base-bound s memory management regions: the text segment is placed-before the data and then the bss segment in one region, and the stack in the other. The only use of shared memory in stand ard area shared by default among all users. UNIX is the pure code ‘‘text" Processes may grow by expanding their stack region when making calls and by allocating stack-local variab les, or by ellocating more memory beyond the end of the “'bss"” segment. 3.2 Prevmus VAX enhancements The current VAX system pages the regio ns described in the previous section in a way transparent to application progr ams. It also demand-loads the initial contents of the pure code ‘‘text’ and initialized “data" segments, makin of the pages of the files from which these ence. g copies segments are initialized on first refer- Facilities are provided in the current syste the copy-on-reference fashion used by the m for users to read from files in system to set up newly executing pro- grams. This vread facility has not, howev er, proved useful or popular, and it and the vwrite and vadvise facility will be deleted in the new systemn and their function replaced by mechanisms described here. 3.3. Goals A strong motivation for use of the VAX is the large address space available. Each process can have up to 2~30 bytes of data in each of two regions available to it, giving a maximum per-process addres s space of 2 Gigabytes. To use such a large address space it is necessary to avoid making copies of the data in the space. It is necessary that the system obtain the data from and share it with file data whenever appropriate. Good performanc e from the system algorithms is necessary if extremely large address space programs are to be run. The major goals of our memory management * space on machines with as little as * facility design are: To support the extremely large addre ss spaces possible with the VAX hardware. We would like to be able to run a 2 Gigabyte process address 2 Megabytes of physical memory. To support shared access to data and the special requirements of the large VAX applications such as image processing and LISP systems. Such programs often need special treatment from the paging algorithms in the sys- tem and want to gain control and recove as stack overflows and protection violat r after otherwise fatal errors such ions. CSRG TR/4 — August 31, 1981 — Joy/Fabry Proposals for UNIX * -22- Memory management To have reasonable performance on huge virtual jobs. This will require sup- port from the flle system, which must provide high bandwidt h access to file data, and support from the user, who can help by organizing his process to have as well-behaved virtual memory behavior as possible, and by giving the system advice about the behavior of his program. * To develop facilities that are portable to different machine s with possibly different memory management architectures. We expect that the demand- ing nature of research applications will cause them to be run a wide variety of processors, some of which can run this version of UNIX if its primitives are portable. 3.4. Motivations for segments To achieve the goals described above and manage an extremel y large address space, we are basing our memory management design primitives, not on page level primitives. for at least two major reasons: + on segment level Segment based facilities seem desirable Programs written using segments can be ported easily to machines that have only page level memory-management control. The VAX is an examnple; it does not have segmentation, so this will be simulated. Programs written using extensive page-level controls tend to be less portable. thus attempts to encourage a portable programurming style. + Our design Segments provide a clean structuring of the address space with useful granularity, and offer useful places for placement of instrumentation to gather page-reference information. Memory usage is likely to break down naturally and somewhat independently into usage in different segments. 3.5. Allocating segments Segments are represented by their base virtual addresses. On a machine with a uniform address space this will just be some number in the address- space-range of the machine. On a machine with segmentation hardware the address will be a (segment,offset) pair. The basic segment allocation primitive takes as argument a file descriptor and a range of locations in that “file"” and returns a virtual address that is the base of the mapped range. The primitive segalloc is invoked: int fd; off_t offset; int len; enum seg_share { SEG_PRIVATE, SEG_SHARED; } share; caddr_t pref; caddr_t va; va = segalloc(fd, offset, len, share, &pref); The argument fd specifies the file or special device to be mapped into. the address space of the calling process. The arguments offset and len give the offset into fd and count of bytes to be mapped. If fd describes a file then its length is made to be at least offset+len bytes by extending it with 0 data if necessary. If share is SEG_SHARED addresses to bytes starting at the returned address refer to the contents of the flle or device represented by fd starting at offset. For shared segments, writing to these bytes is permitted if the file Jd is available for writing, end is equivalent to writing on the associated file or device. I share is SEG_PRIVATE the returned space refers to private data storage that is initialized from the corresponding file CSRG TR/4 - August 31, 1981 - or device data. The virtual Joy/Fabry Proposals for UNIX -23- Memory management memory returned from a segalloc of SEG_PRIVATE space is, by default, readable and writable. The final argument pref may be used to give the address of a variable con- taining a preferred address for the segment. If the argumen t is 0 then the system chooses a location for the segment in a way not specifie d externally. The use of pref arguments is machine specific, and is regularl y used only by system specific routines and special applications. 3.8. Segment sizes and rounding Memory management hardware on most machines does not permit exact, bit-length control over how much address space is available to processes. Thus the system does not promise that exactly and only the range [va,va+len) will be accessible after a call to segalloc returns a value va. There may be some extra - locations accessible outside this range, but accessing them should be considered an error. In our proposed VAX implementation, memory will be available to a 1024 byte boundary on both ends of the mapped region for SEG_PRIVATE data, and to a 85536 byte boundary for SEG_SHARED data. To take advantage of the memory management hardware on a particular machine, the system may have to align the mapped data, e.g. on page boundaries. Because the VAX has no indirect page table entries, and to simplify the system, reduce the amount of work involved in running large programs, and to make sharing of page-table-pages possible, the VAX implementation mapped regions on 85536 byte boundaries so that: will align all (va & Ox1IfT) == (offset & OxfIf) That is, the low 16 bits of the returned address from segalloc will agree with the low 18 bits of the offset mapped. This allows the '‘second-level"’ page tables of the VAX to be used to achieve page-table-page sharing. As we will see below, page table size for large processes can be substantial, so making page-table- page sharing possible is a desirable goal. 3.7. Segment protections The default protection mode for a shared segment is inherited from that of the file descriptor fd. On the VAX, this must either be read or read-write since the VAX does not support write-only memory, and users cannot be permitted to map files to be readable simply because they have write access to them. The protection assigned to a segment may be changed with a segcAmod, call segchmod(va, mode) where rode is chosen from: SEG_NA SEG_R SEG_W SEG_X no access read access write access execute access ' The last three accesses may be combined, as in SEG_R|SEG_V to give read-write access. All machines are expected to support SEG_NA, SEG_R, and SEG_RISEG _W. On machines that do not support execute-only access, SEG_X will be folded to SEG_R access. The VAX has a restriction that SEG_W access is not permitted without SEG_R, since the hardware does not support write-only access. CSRG TR/4 — August 31, 1981 -~ Joy/Fabry Proposals for UNIX -24 - Memory management 3.8. Freeing segments To free the address space occupied by a segment a prograrmn can issue the segfree call: segfree(va) passing the address returned by segalloc. The address space previously allo- cated to the segment is then returned and made availabie for allocation by future segalloc calls. 3.8. Giving the system advice Large virtual memory programs often have repetitive or predictable behavior. Authors of such programs are often aware of this behavior. We provide a segadvise call, of the form: segadvise(va, advice) The advice to be given to the system about the segment at va is required to have no semnantic effect on the resuilt of the program.* Typical calls to segadvise might instruct the system that pre-fetching of a set of pages seemn desirable, that the program is finished using a particular section of virtual memory and that it can be reasonably swapped out, or that the program will be referencing many pages quickly with little rereferencing (e.g. LISP garbage collection.) A facility similar to segadvise called vadvise has been successfully used in the current system. 3.10. Special segments Calls to allocate segments may access two special flles. The first is normally available as /dev/text. which is a special device that indirects to yield a handle on the file containing the program that is running. This makes it possible to re-map pages of the running program conveniently. The other special file is /dev/zero which is a special interface to swap space, and that will give a distinct piece of swap space to be initialized with zeros each time it is mapped in. 38.11. How exec can be written Using the facilities above we can now give code showing how the exec systern call creates a new process image. First we should explain that process images in the new systern will have a 85536 byte hole between the end of each segment and the start of the next. The first 85536 bytes of process address space are not mapped, and serve to catch indirect references through uninitialized pointers. After this 85536 byte gap comes the beginning of the process image file, starting with the process header and continuing through the process pure “text’ space. There is then another 85536 byte gap before the *‘data’ space, another 85536 byte rounding virtual hole, and then the *'bss’ uninitialized vari- ables.t The following C code could be used in the system to set up these segments,. starting in an empty process virtual memory. The ezec code here is very VAX * This excludes timing-dependent programs, whose output may differ from run to run, and may notice the timing improvements obtained when good advice is interpreted properly. t The virtual holes preserve alignment between the data file and the address space it is mapped to. CSRG TR/4 — August 31, 1981 - Joy/Fabry Proposals for UNIX -25- Memory management specific, and uses macros defined in the syste m header file <a.out.h>. The sym- bol SEG_TEXTFD stands for an instance of the file /dev/text, and SEG_ZEROFD stands for an instance of the file /dev/zero. #define SEGRND 85538 /* rounding to segment boundary */ caddr_t pref; /* allocate program data (text segment) starting at SEGRN D */ pref = SEGRND: ‘ segalloc(SEG_TEXTFD, 0, N_TXTOFF(e)+e.a_text, SEG_SH ARED, &pref); /* allocate initialized (data) segment, after text and SEGRND pref += SEGRND + N_TXTOFF(e) + e.a_text; hole */ . segalloc(SEG_TEXTFD, N_DATAOFF(e), N_SYMOFF(e)~N_DATAOFF (e), SEG_PRIVATE, &pref) /* allocate uninitialized (bss) segment, after another hole pref += SEGRND + e.a_data; s/ segalloc(SEG_ZEROFD, 0, e.a_bss, SEG_PRIVATE, &pref); The system would also have to set up the stack for the new process, but this operation is not shown here. 3.12. Simulating copy-on-write A user program can build a “copy-on-write” like facility at the segment level if the hardware permits restartable instructions, or with more work if it does not. The facility can be implemented by establi shing a handler for the “Memory fault’” and *'Bus error' signals. If a fault then occurs on a protection violation, the signal handling routine will get control. It can modify the accessibility of the referenced data by re-mapping the segment to be modified as SEG_PRIVATE data, and return to the code that was interru pted. This style of copy-on-write support makes it possibl e to build copy-on-write like facilities even on machines where instructions are not restartable, provided the code that can fault is writtenin a way that the user-su pplied signal handling routine can backup. A user program may also monitor both references and ments by using access modes. modifications to seg- For example, after a garbage collection, a LISP system may mark its segments read-only, and make them writable only after a writing is noted. Then, when the next garbage collecti on is to be done, the system can know that certain sections of address space have not been referenced or modifled respectively and avoid garbage collection overhead. 3.13. Special requirements for stacks Some VAX applications will need to maintain comple x stacks. For instance, INTERLISP uses a spaghetti stack and wishes to regain control if the piece of stack being used is exhausted. This requires that the system deliver a signal to the process on a different stack when the first stack overflows . A similar need arises in languages that support multiple tasks and that provide a fixed size stack per task. If the system were to deliver signals to such a process on the per-task stack, then the size of stack needed would depend on system parameters, an undesirable situation. To support these applications, we are proposing to extend the system to ellow specification of a stack for delivering signals. The call CSRG TR/4 - August 31, 1981 - Joy/Fabry Proposals for UNIX - 26 - Memory management caddr_t asp; int onsigstack; sigstack(asp, onsigstack) provides the system with a stack pointer to be used in delivering signais asp. The call also informs the system wheth er the process wishes to consider itself “on" the signal stack, using the integ er parameter onsigstack. When a signal is to be dispatched, the systern first checks to see if the pro- cess is on its signal stack. If not, then the current stack point er value and the system arranges for it to be routine. restored on return from the signal is saved handling The stack pointer is set to the signal stack location and the kerne l remembers that the user proce ss is on the signal stack. In normal usage, a process will take a signal on the signal stack, run a small amount of code, and then return to the pre-signal frame. signal handler resets the signal stack automnatically. The return from the If the process wishes to take a non-local exit from the signal routine, then it must inform the syste m of the restoration of the signal stack to be performed using a sigstack call. If the process wishes to invoke code from the signal stack that uses a different stack, then the process shoul d provide the so that signals can be delivered there during system with a new sigstack the nested invocation; this is necessary because the system would otherwise have no way of finding the top of the signal stack.* 3.14. Huge processes and page table sizes In running huge processes on the VAX an important concern is the amoun t of physical memory required for proce sses that use large amounts of virtua l memory. certain Whether the virtual memory is used or not, it is required to have a of physical memory allocated to page tables for resident ) amount processes. In the current UNIX system, the kernel keeps all the page tables for resident processes in non-paged memor y. Large VAX systems currently see as much as 16 Megabytes of active virtua l memory, and since 1 byte of page tables is needed for every 128 bytes of reside nt virtual memory, this means that as much as 512k bytes of memory is occup ied by user page tables. While this is acceptable for running virtual loads of 18 Megabytes, it will certainly not be acceptable when processes as large as a Gigabyte are run, since a Gigab yte process will require 8 Megabytes of page tables. The new UNIX system on the VAX will consi posed of 855368 byte virtual pieces. der the address space to be com- A single process address space will have 32768 of these pieces, that can be alloc ated to its various segments. The syste m will control page table space at this granularity. Only the descriptive infor mation required to locate and manage the page table pages describing the 85536 byte pieces of virtual memory need be resident with a process. It is conservatively estimated that each of these 85538 byte virtual pieces will require 18 bytes of physical memory when the assoc iated process is resident. Thus a Gigabyte process will require roughly a quarter Megabyte of resident informatio n describing these second level page table entries . * Since, unlike ?.hehardvmintemxptnackpdnter.t Tegister separate from the normal stack CSRG TR/4 pointer. hcligndltackpoinmhnotkcptma — August 31, 1981 — Joy/Fabry Proposals for UNIX -27- Memory management 3.15. Page replacement algorithms for VAX The VAX lacks the reference gathering hardware needed to gather the information used by many page replacement algorithms. This forces the system to use software to gather reference information and makes such information gathering much more expensive. A variant of the clock global replacement algorithm is being used in the current system to do replacement with minimal amounts of reference information, and a good deal of experience with this algorithm has been obtained. We are experimenting with a special low-level coding of the reference gathering code in the system, which may make the cost of reference gathering several times cheaper. If this works out, then it may be possibie to experiment with some other page replacement algorithms. We have taken traces of programs typical of image processing and other scientific work Many of the programs that run on large data sets exhibit regu- lar patterns in their virtual memory behavior. The segadvise call can be used to inform the system of the presence of such behavior. We hope to experiment with algorithms in the system to detect patterns of behavior and to adapt the page replacement and pre-fetching algorithms accordingly. In particular, we have already experimented with giving the system advice that a program is sequential, and with advice that a program is likely to have lit- tle re-reference to its pages. The former is true of multi-dimensional FFT's run-ning on large data sets, and the latter is true of & LISP system running a large, non-compacting garbage collection. In both cases we observe substantial improvemnents in running times and reduced overheads in the system because of the advice from the user programs. We expect to experiment with such advice for other large programs. In the 4.1bsd release of the system we fixed a problem with the placement of pre-paged pages. In the new release, pre-paged pages are placed at the bot- tom of the '"free list", not in the clock loop. This allows us to pre-page more pages, and to use the pre-paged pages more eflectively. We have measured the 4.1bsd system on the benchmarks that Dave Kashtan ran of UNIX and VMS paging. The 4.1 system and the VMS measurements are nearly identical for all benchmarks, with the 4.1 systemn faster on benchmarks that are inherently sequential if the system is told to expect sequential behavior. 3.18. Status and related changes Implementation of these proposals will proceed in parallel with the higher- performance file system effort (described in section 4), which is currently . underway. We expect that we will have a prototype system with a higherperformance file system and the new memory management facilities sometime in late 1981. There are some related changes that will have to be made to support the new memory management facilities: + A new load format will have to be created that allows for the segrnent placement implied by the new primitives. + The debuggers will have to be changed to understand the mappings and the new segmentation. + The core flle images will have to be changed to include segment data. CSRG TR/4 — August 31, 1881 — Joy/T'abry Proposals for UNIX + -28 - Memory management The file system performance enhancements will need to be in place to take full advantage of the new memory management facilities. We will use instrumnentation facilities already in place in the 4.1bsd system to measure and analyze system performance using the new facilities. We have sample programs that are large VAX applications that will be measured under the new facilities to tune and debug them. 3.17. Alternatives and compearison We considered using a TENEX ‘“‘prnapTM like facility for controlling virtual memory. Such a facility has been implemented for UNIX on VAX by John Reiser of Bell Laboratories. We decided that the needs of programs could be met without the additional internal complexity of pmap, that was felt to be a hindrance when such enormous address spaces are to be supported. If individual mapping of 512 byte pages were permitted in a 2 Gigabyte address space, then the system would bave 4 million pages to deal with for a single process. Thus we went to the 85536 byte granularity in memory manage- ment, as this will allow us to handle these gigantic programs even on small machines. We have considered providing different page-replacement algorithms for the system, including a working-set dispatcher, but feel that the data consumptive nature of the most demanding applications will be satisfied only by algorithms that can be told of or adapt to trends in memory referencing. We feel that the current global replacement algorithm will work adequately in the large process environment and admits the hooks that are needed for exploitation of patterns of reference. CSRG TR/4 ' - August 31, 1981 — Joy/Fabry Proposals for UNIX -29- File system performance 4. File system performance enhancements This section describes the proposed changes to the file system organization and aigorithms to increase performance. We defer discussion of changes to the user interface to the file system to the next section. 4.1. Standard UNIX file systemn The traditional UNIX system, that runs on the PDP-11, has simple and elegant flle system facilities. File system input/output is buffered by the kernel so that there are no alignment constraints on data transfers and all operations are made to appear synchronous. All transfers to the disk are in 512 byte blocks, which may be placed arbitrarily within the data area of the file system. No constraints other than available space are placed on file growth. 4.2. Previous VAX enhancements The current VAX systemn has improved the standard UNIX file system in two notable ways: ¢ The file system has been made crash-recoverable by changing it so that all modifications of critical information are staged so that they can either be completed or abandoned cleanly by a repair program after a crash. * The flle systemn performance has been improved by nearly a factor of 2 by changing the basic block size from 512 to 1024 bytes. 4.3. Goals We expect that large virtual memories will be constructed by mapping files from the file system, using the mechanisms described in the previous section. Paging of data in and out of the flle system is likely to occur frequently. We therefore need a file system that provides higher bandwidth than the current one which provides only about 40k bytes per second per arm. The primary means for improving flle system performance are to improve the locality of reference to minimize seek latency and to improve the layout of data to make larger data transfers possible. 4.4. Major problems A typical 150 Megabyte UNIX file system consists of 4 Megabytes of file system indexing information and 148 Megabytes of file systern data. A major problem with this organization is that the indexing "‘inode” information is segregated from the data by being at one end of the disk space allocated to the file system. Thus accessing a flle almost certainly involves long seeks. Files in a single directory are not typically allocated slots in consecutive locations in the 4 Megabytes of indexing information, causing many non-consecutive blocks to be accessed in executing common hierarchical operations, such as gathering information about or data from a flles in a single directory. The allocation of data block to files is also a major problem. The current file system never transfers more than 1024 bytes per disk read or write, and often finds that the next sequential data is not on the same cylinder, causing seeks between these 1024 byte transfers. The combination of the small block size, limited read-ahead in the system, and many seeks severly limits flle system throughput. CSRG TR/4 — August 31, 1881 — Joy/Fabry Proposals for UNIX -30 - File system performance 4.5. Description of approach We propose to reorganize the file syste m by dividing the space for a file system into areas called cylinder group s each of which contains a few cylind ers. Each cylinder group will have some inode slots for files and a bit map and other surmmary information describing the usage of data blocks within that group of cylinders. Performance will be increased by laying out the hierarchical file system data so that related information is in distance. the same cylinder group, minimizing seek Data will be laid out so that larger blocks can be read in single reads, greatly increasing file system throughput. As an example a file system of 300000 sectors (150 Megabytes) could be divided into 100 cylinder groups of 1.5 Megabytes each. Each cylinder group would have about 256 inode slots and a bit-map describing availability of its blocks and inodes. The flle system data storage would be divided into 4098 byte data blocks. Small flles will receive only a fraction of one of these blocks . In large flles several 4096 byte blocks could be allocated consecutively so that large data transfers are possible. 4.6. Policies for new flle system The system will provide on-line layout policies that try to limit seeks. Direc- tories, which can be allocated in any of the inode slots, will normally be allocated in the cylinder group that has the most free space, extrapolating a mean size for each of the directories currently in the cylinder groups. File indexing “inode’ slots will normally be allocated in the cylinder group where their directories are located; if there is no room there, then they will be allocated using an overflow policy similar to that used in a hash table with internal rehash. Blocks will be allocated in a device-depen dent way. On most devices we prefer to place newly allocated blocks adjace nt to the previous block in the same file. If this adjacent block is not available, then the new block will be located rotationally well-positioned on the same cylinder as the previous block. If no blocks are found on the same cylinder as the tem will look somewhere else in the same an allocatable block then the system will & reasonable amount of space to locate previous block, then the sys- cylinder group. If this aiso fails to find look in another cylinder group that has another free set of blocks. 4.7. Measurements of program speeds To formulate performance goals for the file syste m it is important to under- stand the speed of various programs consur ning data, and the limiting performance of the current flle system organization using differing block sizes. Basic times for operations on the VAX 11/78 0 with a single memory controller and currently available disk hardware are given in the following table: Procedure call Examine 512 bytes Trivial system call Copy 512 bytes Context switch Write system call Disk rotation time Seek time 20 usec 110 usec 140 usec 220 usec 220 usec 1 msec 18 msec 10-50 msec The limiting overhead in data intensive operations is often the memory bandwidth. When no inpu t /output CSRC TR/4 is taking place data can be fetc hed from = August 31, 1981 — ' Joy/Fabry Proposals for UNIX -31- File system performance memory at 4.5 Mb/second, using the VAX string instructi ons. If any processing is to take place on the data, or if any input/output is taking place on the machine, then the available bandwidth is reduced. Measure ments of basic operations and common programs are given in the following table: Operation Fetch data Fetch with mba active Fetch with 2 mbas active CRC Loader id Cat program egrep program Data rate 4.5 Mb/cpu sec 3.5 Mb/cpu sec 2.6 Mb/cpu sec 300 Kb/cpu sec 100 Kb/cpu sec 42 Xb/cpu sec 38 Kb/cpu sec ed read/write 23 Kb/cpu sec make of system Jorep /grep programs 22 Kb/cpu sec 20 Kb/cpu sec Assembler as Compiler cc Peephole optimizer c2 Lisp compiler liszt Troff running —me macros 15 Kb/cpu sec 10 Kb/cpu sec 8 Kb/cpu sec 8 Kb/cpu sec 3 Kb/cpu sec The measurements of fetching of data from memory in blocks show the effect of running high bandwidth devices during memory-intensive cpu opera- tions, where each active i/o device reduces the available bandwidth by about 1 Mb/sec. The CRC instruction timing shows the speed of a data intensive microcode loop that involves a fair amount of calculation. This program runs at 1/3 the speed of most currently available disks. The fastest standard UNIX program we could find, aside from the file copying programs, was the UNIX loader. When loading large programs the loader does not process.each byte of data individually. This leads to much higher bandwidth than the cat program, that is the simplest possible program that uses the character at a time primitives of the standard i/o library. The cat program is a loop: int c; while ((c = getchar()) != NULL) putchar(c); The egrep program is the fastest example we could find of a program that non-trivially processes all its input data. It is a program for scanning a file for any of a set of patterns, written using a powerful algorithm. More typical of UNIX utility speeds are the programs ed, make remaking a large program (the system), the more simple pattern searching programs fgrep and grep, and the assemblers and compilers as, cc, c2 and 4szt. grams range in speed from about 8 to 25 Kb/cpu second on a 11/780. These proSlowest of all are programs that do substantial processing on each input character, such as the typesetting program froff. Troff is further slowed by extensive macro interpretation. - CSRG TR/4 = August 31, 1981 —~ Joy/Fabry Proposals for UNIX -32- File system performance 4.8. Estimates of file system performance The observed performance of the constant block size file the next table, and extrapolated form the 2048 and Block size Throughput 512 bytes 1024 bytes 2048 bytes 20 Kb/sec/arm 40 Kb/sec/arm 80 Kb/sec/arm 4096 bytes systems is given in 4096 byte block sizes: 180 Kb/sec/arm We can estimate the performance of our new file system size of 4096 bytes and with some pessimistic assumpt using a basic block ions about data layout. We assume that the flle system will be unabie to allocate consecutive 4098 byte blocks, but will be able to place an average of 4 consecu tive blocks in a cylinder before a seek is required. We assume that the seek to be required is a long seek. Under these assumptions and in the sequential access case new file system Kb/sec/arm. will provide 35-40% disk utilization we expect that the and about 300-350 The degree to which this file system organization will improve on the 4086 block version of the current file system organization will depend on whether the patterns of flle access allow the locality of layout under the new byte organization to be beneficial. Large applications are expected to benefit greatly if their data requirements have locality. lated requests under any organization. There is little we can do for uncorre- 4.9. Buffering and page caching The current version of UNIX transfers data from the disk into buffers in the kernel address space and then copies these buffers to user address space. If the buffers in both address spaces are properly aligned, then this transfer can be eflected without copying using the memory management hardware. especially desirabie when large amounts of data are to be transferred. This is If the buffers in the process address space are properly aligned (on 1024 byte boundaries) we intend to transfer the data to the user programs without copying. Further, even in the absence of copy-on-write, we can remember that pages in user address space are copies of pages from a file and, if the pages are still in core and not modified when we need that file page again, can reuse the pege. If the user issues another read request specifying the same buffer we can reclaim unmodified pages from the user and place them cache. - in a kernel file system 4.10. Fragmentation in the new organization In this section, for definiteness, we assume that the desired file system block size is 4096 bytes and that the disk sector size is 512 bytes; these are variables in the file system design, but it is easier to use the numbers for reference. In UNIX, each flle has an array of indices of file system blocks. For the pur- poses of this section, assumne that the first 8 blocks of the file are described to by the besic file indexing (inode) structure.* The inode structur e also contains other pointers to indirect blocks containing further block indices. In a file system with a 512 byte basic biock size, a singly indirect block contains 128 further block addresses of four bytes each, a double indirect block contains 128 "nnactuglmmbermquuvfmmqstemtoqflzm.butisumflyhths rmes-la. f CSRG TR/4 ~ August 31, 1961 - Joy/Fabry Proposals for UNIX -33- File system performance addresses of further single indir ect blocks, etc. The following table shows the effect of increasing the file system block size on the amount of wasted space in the file system. The machine measured to obtain these figures was our large st time sharing system, and had roughly 1 Gigabyte of on-line storage. The active user file systems containing roughly 500 Megabytes of formatted space were measu red. Space used % waste 421.3 Mb 439.0 Mb 0.0 4.2 450.4 Mb 470.9 Mb 515.5 Mb 813.2 Mb 8.9 11.8 22.4 45.8 Organization Raw data 512 byte rounding of data 512 byte block UNIX file system 1024 byte block UNIX file system 2048 byte block UNIX file system 4096 byte block UNIX file system Here we measure the space wasted as the percentage of space on the disk not containing file system data, ignor ing the fixed amount of space for the inodes. As the block size on the disk is increased, the fragmentation rises quickly, to an intolerable 45.8% waste with 4098 byte flle system blocks, since there are so many small files. To avoid the fragmentation in storing srmnall files, we allow the file system space allocator to divide a single flle system block into a few fragments. Our file systemn block size is 4098 bytes compo sed of 4 1024 byte fragments, the size of the blocks in the current file system. We allow the space allocator to break file system block and allocate these smalle r pieces to files. up a It suffices to allocate fragments only to file that are less than 8 file system block long (the files that require no indirect blocks). On the system measu red ebove, fully 97% of all files were in this category, and they used about 1/2 of the space in the flle systems. Such a small file is represented by tem blocks of data and then possibly blocks are represented in the normal some additional data. up to 7 full file sys- The tull file system way. If there remains data that will fit in 3 or fewer 1024 byte pieces, we find a unallocated fragment of a file system block and store the data there. If we have to fragment a file system block to obtain the space for this small amount of data, another file may yet use the remaining fragments. The fragmentation in this organization is less than that the current 1024 byte file system organization, and only slightly more than the 512 byte block UNIX file system: 8.2%. A 512/4096 byte hybrid file system keeps more indexing information, but uses even less space than the 512 byte block traditional UNIX file system: 5.4%. The new organization is efficient because it uses little space for small files and also uses little indexing inform ation. 4.11. Status We have done a good deal of measurem ent of the static characteristics of current flle systems and examined the dynamic characteristics of applic ations programs. We have constructed utiliti es to build file systems in the new forma t and are working on a user-level implementa tion of the new file system format. After our development 11/750 arrives in late July 1981, we intend to convert it to the new file system format and to debug the new system algorithms on this machine. Integration of the new memory manag ement facilities deseribed in section 3 will then take place in a syste m supporting the new file system organization. CSRG TR/4 — August 31, 1881 —- Joy/Fabry Proposals for UNIX -34 - TFile system performance 4.12. Alternatives and comparison We considered converting UNIX to an extent based file system much like the DEMOS file systern. This approach was rejected because it did not seem necessary to get the performance we needed, and because we expected that some sites might wish to experiment with file organizations that allowed data pages to be shared between flles. This is much more easily handled under a block level organization than a extent based organization. Similarly if a copy-on-write facility were ever to be implemented for UNIX it would benefit greatly from a block at a time indexing scheme. We are planning to compare the performance of this flle system with the VMS flie system and other file systems for similar machines. The current com- parison shows that the UNIX file'system is slower than the VMS flle system, but we expect that the new version of the UNIX file system will be faster. CSRG TR/4 - August 31, 1981 — Joy/Fabry. Proposals for UNIX -35- File system facilities 5. New file system facilities This section describes new facilities to be provided by the flle system in support of the other facilities proposed in this report and to soive other minor problems. 5.1. Symbolic links The current UNIX system supports multiple *'links" system. to files in the same file This link concept is fundamental; files do not live in directories, but exist separately and are referenced by links. When all the links are removed, the file is deallocated. This style of links does not support references across nor does it support inter-machine linkage. to support such usage. physical file systems, We propose to include symbolic links A special file type, the ‘symbolic link" file will contain a pathname. When the system encounters this file while interpreting a name, the contents of the symbolic link file will be prepended to the rest of the pathname, and this name will be interpreted to yield the resulting full pathnam e. If the symbolic link file contains an absolute pathname, then this absolute pathname will be used. The symbolic link will otherwise be taken starting at the location of the link in the file hierarchy.* We are currently investigating the best way to implem ent symbolics in UNIX, looking especially at systems for other machines which implement links (notably MULTICS). Symbolic links have previously been implemented for UNIX by Jim Kulp at IIASA in Austria. To incorporate them he also provided a way for system utilities to refer to the links themselves as well as the object referenced Incorporating them also involves some changes to utilities such as du, ls, and find, so that they can treat such links in a desirable way. To gain by the links. access to the link itself, not the file object referen ced by the link, a special quot- ing convention can be provided. We could say that a file name that ends with the character ‘#' refer to the symbolic link itself. It also might be useful to provide a mode in which the system does not interpret symbolic links. Thus a program that wishes to transverse a hierarchy without taking indirections can disable symbolic links. One set of possible calls for symbolic link routines would be: symlink(namel, name2) char *namel, *name2; that creates a symbolic link name2 whose content s are the string name! symunlink(name?2) char *name?2; that removes the symbolic link name2 and not the name] specified when the link was created. This can also be used with non-sym bolic links in a program that wishes to remove the links themselves, not the linked to files. syrnfollow({wanted) int wanted; that can be called with 0 or to disable following of symbolic ¢ Naming directary references, described in the next section, are pathnames. CSRG TR/4 i - August 31, 1981 —~ links considered to be absohute Joy/Tabry Proposals for UNIX -36 - ’ File system facilities 5.2. Naming directories To support the project notion (to be described in section 6), and as a base for communication between proce sses in a single session we prop ose to add a per-process “‘na ming directory”. This will be a normal UNIX very short name '®'", a prefi x-character to pathname /" which refers to the root directory. directory with a s much like the character It represents a third point in the file file system from which names sprin g, augmenting the current “cur rent directory’” and “‘root directory’’ notions. The naming directory concept is derived from the similar one in the Apollo DOMAIN operating system and from the uses of logical name tables in systems. VMS and device translations in various PDP-10 based opera ting The naming directory will support the project notion descr ibed in section 8. A project is a hierarchy of sour ce and binary programs, libra ry routines and documnentation. The proposed norm al way of accessing such a hiera rchy is to place a symbolic link from your naming directory to the root of the project. Thus the project “visi" might have one would place a symbolic link its root directory '/h2/visi" in whic named **visi" in ones naming ceforth reference the project files as “@visi/..."". The neming directory will support screen-oriented and front ends through convention s on communication. h case directory and hen- command interpreters For instance, the write command can be changed to look in the target users nami ng directory for a file named "writeportal’ and to open that file to communicate if it exists. In this way a write command can communicate with a screen manager process (such as, say, the CMU emacs edito r) to obtain window space. This is greatly preferable to the current state where such writes greatly disrupt the state of the screen, The naming directory implementation is simple: If a path name begins with the character “®" the search begin s not at the current directory but at the nami ng directory. directory. A new system call chnamdir changes the current naming For backwards compatibility, the use of naming directories in suppo rt of projects can be simulated by a set of library routines that inter pret the UNIX system calls that take flle names. Other uses of naming directories to support screen-oriented programming environments are possiblie only on the newer ver- sion of UNIX supporting IPC facilities. 5.3. Locking primitives Many sites have expressed the desire for desirable that it be possible to lock flles prohibited to maintain consistenc y. some flle locking primitives. It is so that other concurrent access can The new UNIX 3.0 system from Bell Labs implemnents a flag to the open that causes a file creation to fail if be call the file already exists. This allows testi ng for locks by attempting to create them to work. In the current system, the lock setting has bugs when used by the level super-user unless the link primitive Mike Accetta at CMU has implemented locking. There are ioctls to return processes reading and writi ng the flle, prohibiting a structure giving the count of to set the file in exclusive write mode, further attempts at access to write, update mode, prohibiting further access to read ———— ® Le. a Ale is created whose name is unique is used.* a set of ioctl calls that provide file to set the flle in exclusive or to write, and to clear the to the current process and the ewrrent tries to link it to the lock flle. The link operation is atomic. process ‘ CSRG TR/4 = August 31, 1981 — Joy/Fabry Proposals for UNIX -37- File systemn facilities exclusive locks. John Bass at ONYX Systems has implemented granular file locking. This allows sequences of bytes within flles to be locked, and detects deadlock conditions. The deadlock detection in Bass's scheme cannot work in a distributed system, and thus we feel that this aspect of the scheme should be avoided. This scheme could yet be implemented by timing out requests. We are continuing to investigate the form of locking which should be integrated into the kernel of a distributed system. We have so far found no locking primitives which seern suitable. 5.4. Append access and no-delay opens To atomically append to flles, the append mode of access supported by most operating systems has been added to UNIX 3.0. A further open option to allow opening of communications lines without waiting for carrier has also been provided. We feel that these facilities are, indeed, useful, and propose to adopt the UNIX 3.0 open mode (as extended by the open locking options described above) into the standard VAX system. 5.5. Truncate The current UNIX system lacks a primitive to truncate the logical length of a file. This makes implementation of certain FORTRAN 77 facilities expensive. Also, a convenient way of modifying files with mapping is to allocate a segment for themn and then write data into the segment, and unmap and truncate the file. This is possible only if there is a system call | truncate(name, length); char *name; that removes portions of the file after the specified length. This can be simulated (albeit slowly) on older UNIX systems as it is currently in the FORTRAN 77 ' . i/o library. 5.6. Rename Programs that create new versions of data flles typically create the new version-in another flle and then do unlink("cur"); link("'new”, "cur"); unlink("new’); This sequence of operations leaves a window where there is no instance of the file cur, causing occasional mysterious anomalies. This can be solved by providing a system primitive: rename(newname, oldname); char *newname, *oldname; that does what the preceding sequence does, but atomically, so that there is always an instance of newname. We propose to add this to the standard version of UNIX. 5.7. Perflle cache flushing The current system makes no provisions for flushing the flle system cache of blocks from a file. This makes it difficult to write application programs that attempt to be certain to leave data bases in a consistent state. We feel that an operation to flush all the buffers associated with a particular file would be CSRG TR/4 — August 31, 1981 — Joy/Fabry Proposals for UNIX valuable. -38- File system facilities This will involve remembering, in the bufler cache, which file each buffered block belongs to and also identifying such blocks in the virtual memory of processes. This operation can either be an ioctl or a new system call of the form: int fd; syncfd(fd) 5.8. Status Symbolic links have been implemented for UNIX before and are also implemented in a other operating systems. They require changes to a few programs that are concerned with traversing the file system hierarchy and other than that affect only one routine in the kernel: nami. Naming directories are extremely simple to implement. They will affect a few user programs that use file names beginning with @ (e.g. Rand's MH pro- gram that names a file just that *'@"'), and a few programs that do detailed manipulation of path names (e.g. ‘‘csh’* which attempts to figure out what directory you are in after a ‘‘chdir’ will have to understand the effect of **®""). The fruncate systern call implementation is tricky, since the operation has to be carefully staged so that no duplicate blocks appear if the system crashes during a truncate. The operation is a superset of a creat system call, and the code can be combined. Per-file cache flushing can be added easily when the system is changed over to the new file system organization described in the preceding section. " CSRG TR/4 — August 31, 1981 - Joy/Fabry Proposals for UNIX -39- Software projects 8. Software projects and distribution support This section describes a set of changes that extend the conventions for use of UNIX to simplify software interchange. The underlying structure for the proposal was proposed and implemented by Steve Shafer at CMU: the project notion. The proposal defined here integrates the ideas proposed by CMU with some changes based on experience with the project notion at Berkeley. It also include other facilities and standards useful in software support and distribu- tion. 6.1. Current UNIX facilities Developing large software projects on the current UNIX systemn requires establishment of conventions for locating parts of the project within the file system hierarchy. Special conventions are often developed per-project, and much “bailing-wire" is needed to hold the project files together. Cries of anguish are often heard if file system hierarchies are moved from disk to disk to alleviate space shortages, as users scurry to convert absolute, and now invalid, path names into new and no more robust names. With the current system, large software modules to be distributed to other sites often require local custornization. Header files have to be edited to reflect true path names where software is or will be stored. It is difficult to install software that is finicky about the locations of commands. While it is possible for each software eflort to develop their own set of conventions and procedures for dealing with this environment, it seems extremely desirable to develop system support and tools for a more robust and portable notion. We will call organized groups of related programs to be managed and ported "projects”, following the work done at CMU by Steve Shafer. 8.2. Goals The goals of this proposal are: * To support the development of large packages of software by providing a framework for development, based on the framework used by the developers of UNIX. * To support maintenance of software by adopting conventions for building executable versions of software and for storing the source code and docu- mentation that make this accessible to standard utilities. * To support distribution of software by making it easy to install software modules in different parts of the file system hierarchy while retaining short and significant names for the various filles. Support for co-existence of several versions of a single package (old, current, new, experimental, etc.) for use by different users or at different times is important. 6.3. Components of the proposal The basis for this proposal is a hierarchy of directories and flles called a “project'”. Projects will be supported by conventional use of the naming directory and symbolic link facilities described in the previous section, which give them mnemonic names, and allow different versions to co-exist with different instances selected by different users. Conventions for makefile's and the use of source revision control facilities will allow reconstruction of the programs in a project to be done automatically and allow information to be obtained that describes the current state or history of any file in a project. Facilities for dis- tribution of notification of changes to projects and autornatic update of remote copies of CSRG TR/4 software over networks can be developed - August 31, 1981 - based on ' standard Joy/Fabry Proposals for UNIX ( -40- Software projects descriptions of project structure. 6.4. CMU project notion Following Shafer, we create a UNIX hiera grams or project. rchy for each group of related pro- This hierarchy mimics the normal /usr file system subdi rec- tories in function and includes direc tories bin containing binaries of project programs. exp include containing directories for users in the projec containing header file for use in the project. t. lib containing subroutines and shared data man containing manual entries for project components src containing source code for proje ct comm files. . Each project also has a normal UNIX group with it. ands and a bulletin board associated The addition of commands and syste m facilities to help maintain such hierarchies and the large efforts assoc iated with them is the topic of the rest of this section. 6.5. * Strong naming support for projects There are several important naming requirements for projects: It should be easy for users to choose the projects to include in their working environments, and to name files in these hierarchies. * References from flles and libraries in a multi-project environment shoul d clearly denote the projects they are referencing. Thus if a script needs a special version of a standard progr am, this should be clearly marked in the script. * Projects should be located in a way that is independent of their absolu te placement in the UNIX hierarchy, so that they can be easily transported from machine to machine. The current CMU project implementation uses of the UNIX “‘environment’’ and interpreted search paths, which are part by special library routines, located commands in projects. This has the probiem that the components to refer- enced in source code, scripts and makefi les are not explicit even when exactly one component is desired, and that there are no non-absolute names for projec components. t : We propose to use the naming direc tory facility and symbolic 1links, described in the previous section, to suppo rt strong naming for projects. Users would place symbolic links in their narnin g directories to projects that they wished to use. Thus a entry *visi” in my directory on the "‘ucbarpa’ VAX mmight be a symbolic link to **/ra/visi"’, while someo ne who was developing a new version of this project might have ''visi"’ linked to */ra/visi.new”. If each of us ran & program “mkpla” written by a individual that referenced “@vis i/bin/plot ", then we would get the versions of the plot routine that we desired: ] would get the current version, while the developer could get the newest experimental version. This facility is similar in usage to the name table translations on other sys- tems, but since the naming directory is accessible in the UNIX file system it requires much less system mechanism. It is advantageous to put naming support for these directories into the operat ing systemn so that it will work in all programs. This provides much stronger suppor t for the project notion. CSRG TR/4 -~ August 31, 1981 — Joy/Fabry Proposals for UNIX -41- Software projects 8.6. Makefile standards Maintenance and distribution is made substantially easier when all project programs and data bases can be reconstructed by standard makefile descrip- tions. The current system distribution makefile descriptions support: make install Build a new version of the components in this directory and install them. make clean Remove unnecessary binaries from this directory, to minimize space usage. make Just make the new components, don’t install them. We propose that all distributed commands should be controlled by makefiles that accept these standard entry points. These constitute a minimum acceptable set of controls for all components. We find the use of these standard makefile entry points preferable to manual operation of commands and manual installation. 8.7. Reviving the UNIX group facility The UNIX group mechanism is designed to support work among groups of users. Thus all the developers in a project could belong to the same project group. Currently, however, a user may only be in one group at a time and must lose command context when changing groups. Steve Zimmerman at CCA has implemented a version of the group mechanism that allows users to be in all their groups at the same time. Files created are then placed in the group of the containing directory, not the group of the current user (which is no longer uniquely defined!). This change enhances the group facility and makes groups much more useful with projects. We propose that in the next version of the system users be allowed to be in multiple groups at a time. 8.8. Source revision control It is important to have facilities to retain records of old versions of pro- grams and changes made to them. The current CMU project implementation uses a whist command to annotate source code with commentary about changes. This is useful, but the inclusion of SCCS-like facilities for control of versions is also needed. Walter Tichy at Purdue is completing a new “Revision Control System' (RCS) which has facilities like SCCS. We propose that both SCCS and RCS should be integrated into the project mechanism. It should be possible to distribute RCS to all users of UNIX on the VAX; SCCS is less widely available because of licensing constraints. Both SCCS and RCS should be modified to include facilities like the current whist. 8.9. Notification/update facilities A standard method of providing notification of changes to project software is desirable. CMU uses a post command that puts messages on bulletin boards, and has software for distribution changes on a local network. We propose that methods for automatic distribution in large and local nets be developed and be standardized. Methods of notification should be supported by databases associated with mailers and should allow different ways of storing news associated with projects to be used, including: . CSRG TR/4 — August 31, 1981 - Joy/Fabry Proposals for UNIX ~42- Software projects news a derivative of the standard msgs progr netnews a program developed for the USENET, a phone network of UNIX systemns post as used at CMU mhnews am, developed at LBL a news system based on the Rand MK program It is important that projects be able to retain information about software that has been distributed, and be provi ded some support for taking bug repor ts and suggestions (e.g. standard mail boxes for projects at sites where they are installed that can be set up to forward suggestions.) 6.10. Role of unique-identifiers for files A difficult problem in distributing large software systems is identifying files " and making sure that the correct pieces are available for construction of a system. The system can aid this by provid ing unique identifiers for files that can be preserved when the flles are copied from machine to machine. It is also usetul for the source code to be stamped with revisi grams such as the what program of on numbers to be retrieved by pro- the current system. - So that systems that maintain software versions can be constructed for a distributed environment we propose that all incarnations of UNIX files be assigned identifiers unique in space and time that can be retained when the files are copied and restored by the source code management utilities when older versions of flles are reconstructed. This is not used by any current progr ams, but current research in automatic construction of distributed software by Eric Schmidt at PARC suggests that such identif iers are valuable. We also propose that a systern call be provided to return such a unique identifler. 6.11. Towards site-independent programs One difficulty with current programs is that they tend to build in site dependencies. A particularly bad example is mail programs that deal with multiple networks, which tend to have a good deal of local knowledge built into them, and hence must be modified and recomp iled each time they are moved from cpu to cpu. 1t is extremely valuable for programs to be site-i ndependent, and to make system databases available for program inspection at each site to allow site- specific program actions. We propose (in the section on operations below) to make the standard programs in the system more machine independent by making information such as the current syste m name and network connections, information about users and information about locally available resources available in standard flles accessed by library routine s. We propose that projects should develop similar site-specific data bases so project binaries and libraries are as cpu-independent as possible. 8.12. Status A version of projects is running at CMU and on the PDP-11 UNIX systems at Berkeley. We expect to consult with the staff at CMU about the proposal in this section, and to work with both the people at CMU, Waliter Tichy at Purdue and the people at CCA to integrate and evaluate the new project proposal. We propose to provide naming directories and symbolic links soon so that these can be tested with the new project implem entation at CMU, CCA and Purdue. We propose to provide the unique-id facilit y for files with the first release of the new flle system organization. CSRG TR/4 — August 31, 1981 — Joy/Fabry Proposals for UNIX -43- Software projects We also propose to work with CMU to develop a new document describing the enhanced notion of projects described here and develop notification and update standards and procedures based on those used at CMU. CSRG TR/4 = August 31, 1881 — Joy/Fabry Proposals for UNIX -44 - Standards 7. Standards This section describes areas of the system where there are nagging problems that will get worse if some attempt at standardization is not made. The problems are not unique to the VAX system - all versions of UNIX could benefit from standardizing on solutions to problems such as those discussed here. The typical alternative here is to continue with the status quo. This has the advantage of backwards compatibility but will tend to create more problems than it solves in this way. We prefer to adopt clear improvements on the current approaches, getting a simpler and cleaner system in the long run in exchange for some short term revisions. 7.1. Manual format There are several goals in proposing a new standard for the manuals. There is the obvious desire to keep the manual stable, as the costs of printing the manuals are prohibitively expensive for some. On the other hand, we desire to keep manuals up to date, and quickly include new facilities in the manual. Our proposal is to deflne a base system that is represented in the manual and to set up facilities for the additions of sections of project documentation to the manual. The commands key and toc that CMU implemented as part of their project implementation provide some needed facilities. CMU also printed abridged manuals by default, treating maintenance commands such as the games as projects. This seems reasonable. A useful form of an abridged manual would include a tabie of contents for all available documentation, so omitted pages could be run off on line and later obtained separately. We propose that a new format be adopted with a release of the system in early 1982, with advance notification of the format change. This will allow docu- mentation to be prepared for projects to be distributed with this version of the system. We expect that a preliminary version of the project system can be made available to sites in late 1981 to allow shared software projects and their documentation can be put in a suitable format. 7.2. lLibraries It is important that the contents of the standard libraries contain only a prescribed set of functions so that programs do not have hidden dependencies on locally modified routines. We propose to develop a list of what is in the stan- 'dard C library and to put new facilities to be added to the ARPA standard system in an ARPA standard library so that the dependencies of newly developed programs on facilities of the ARPA standard system will be explicit. We feel that it is important to support convenient naming of project specific libraries, and propose that the loader support the project general library notion by taking the form *-l@X'' to be the library *©®X/lib/libX.a”, and the form “=1@X/Y" to be the library *“©X/lib/libY.a". 7.3. Mail UNIX mail is confusing because of the presence of many mailers, mail systems, and network interfaces. Several important new standards need to be handled, such as the new Internet Mail formats, the new Mail transfer protocol, interface of the mail system to UUCP, and to CSNET, etc. Currently, there are 4 low-level mail handling systemns in general use on UNIX: CSRG TR/4 - August 31, 1981 ~ Joy/Fabry Proposals for UNIX MMDF -45- Standards Developed at Delaware and that is the basis for Phonenet. This system has a good architecture for mail services. We don't have any experience with using this program but intend to learn more about it soon. It currently does not handle uucp traffic. delivermail Developed at manages mail Berkeley, this going different networks. to is a mail routing It program can handle that the ARPANET, uucp and local network mail simultaneously. BBN MAIL The new mail system at BBN handles the new MTP protocol, as well RAND MH The low level facilities as local net mail forwarding. underiying Rands MH system provide groups, aliases and mail transmission facilities. Each of these programs currently provides facilities provided by none of the others. On the other hand, the programs all provide similar facilities and it is clearly disadvantageous for all four of these systems (and perhaps others) to be developed independently to meet the same needs. We hope that the persons responsible for these systems will investigate the facilities of the other systems. It would be valuable to standardize on a single mail delivery system, a single format for storing incoming mail, and a single data base format for mail forwarding and mail groups. The many existing mail readers interfaces should be changed to work with the new standard delivery programs. Many of them inadequately process the header information. Fixes for many of these are available in the community (e.g. from CMU and CCA for the Mail program), and should be incorporated as part of the changeover to a new standard mail system. i We intend to pursue the selection of a single standard low-level mail system for the VAX. 7.4. Signals The signal handling mechanisms of UNIX version 7 are inadequate for safe processing of asynchronous events, having race conditions in them that make them unsafe. Newer mechanisms were provided in the 4bsd release of the VAX system that give clean and safe semantics to signals, treating them as software interrupts that are blocked while they are being processed. We propose that the newer implementation of the signal handling mechanism be incorporated as the standard one in the VAX system. There are some minor incompatibilities in the way in which interrupted system calls are restarted, but these incompatibilities are felt to be less bothersome than continuing to use a standard implementation of signals that is neither safe to use nor tlean. 7.5. Terminal driver interface The current system supports two different terminal drivers, one that is standard from version 7 UNIX and one a more fully functional terminal driver typical of PDP-10 systems. The new UNIX to be released by Bell Laboratories, UNIX 3.0, bas yet another terminal driver interface. The UNIX 3.0 terminal driver interface is clean, and could be adopted as a standard interface. Programs that wish to use the older version 7 terminal driver interface can use a compatibility interface package. We propose to provide the facilities of the current new terminal driver and the needs of the INTERLISP implementors for special hooks in the terminal driver with extensions to the UNIX 3.0 driver. CSRG TR/4 — August 31, 1981 — Joy/Fabry Proposals for UNIX 7.6. - 48 - Standards Control; cleaned up ioctls The current UNIX ioctl system call suffers from a lack of specification of the lengths of the control information being exchanged. We propose to define a new operation that has ioctl’s semantics but with full parameter specification. This control operation will have the form int f; char *request; char *idata; int ilen; char *odata; int olen; int reslen; reslen = control(f, request, idata. ilen, odata, olen); Here f is a UNIX file descriptor, regquest is a null-terminated string specifying the request, idata is a string containing input for the request of length ilen, and odatg provides a place for storing the corresponding result value of maximum length olen. The returned resien is the length of the shorter than olen. result, which may be To allow for the easy use of null-terminated strings in idata, a ien of —1 will be interpreted by the C library as indicating that idata is a null-terminated string. We believe that this control primitive, with its much cleaner interface, will provide a much more stable base for definition of device-specific controls than toctl. 7.7. Debugging information format The information present in the current symbol table in the UNIX executable files is inadequate for construction of symbolic debuggers. It does not contain enough information about variable types. A new debugger is being written by a student at Berkeley, and is suffering from the lack of this information. We feel it is desirable to have a symbol table format for UNIX that includes adequate information. We propose to work with other interested parties to define a new symbol table format that permits the representation of all information about the stan- dard languages C, Pascal and FORTRAN 77. It is expected that the ADA implementations for the VAX will require significantly greater complexity in the symbol table information, and we do not propose to handle ADA, although input from ADA implementors would be valuable in defining the new format. The new format should be portable to machines other than the VAX, and should work, for exam- ple, alsoc on PDP-11, C/70 and 88000 based UNIX systems. The new debugger will not be constrained by VAX licensing and should be easy to port to work on these machines as well. 7.8. Screen environment support Programs that wish to build screen oriented command environments are rudely interrupted by current UNIX comrnands for inter-user communication such as write, wall and the mail arrival notification daemon. Programs that are to run in windows also need a commmunications path to a screen manager. The naming directory can be used by programs such as write and wall to locate a hook for sending information through a screen manager to the terminal. Conventional hooks could be placed there for processes that wish to communicate to the user "' @writeportal”, '“@mailportal”, ete. CSRG TR/4 — August 31, 1981 - Joy/Fabry Propo'sals for UNIX -47 - Standards We propose to investigate an appropriate set of conventions for these pro- grams to use and to develop these conventions in cooperation with other sites that are working on screen oriented programs. We also propose to investigate a facility whereby the messages that the kernel sends to user providing processes are sent to a place other than the current *‘/dev/tty’’. Such mes- sages include messages that tape devices are offline and that file systems are full, and also corrupt the screen of screen managers. 7.9. Other areas There are undoubtedly other areas where development of new standards and interfaces can benefit the users of UNIX, and we welcome input about and proposals for such standards. CSRG TR/4 - August 31, 1981 — Joy/Fabry Proposals for UNIX -48 - Operational support 8. Operational support This sections discusses needs for operation support of the system, including file systemn backup and retrieval procedures and error logging. 8.1. Standard UNIX facilities The standard UNIX/32V system provides dump and restore procedures for file system backup and accounting gathering for login time and process resource usage. The system must be rmanually rebooted after a crash and manual procedures instituted to reconstructed any file systems that are damaged. The standard system does not handle bad media and does not record error messages that are printed on the console. 8.2. Current VAX facilities The VAX system has been enhanced substantially from the standard version 7 UNIX system. A new installation and setup guide exists for the VAX system that clearly explains the operational procedures. The dump program has access to a table describing how often file systems should be backed up, and it is thus much easier to tell when the flle systems need to be backed up. The system automatically restarts after a crash, and runs an automatic repair program. The system performs critical disk operations in a careful way, doing some disk operations synchronously so that the post-crash repair program can either finish or back out each incomplete operation. A simple description flle describes each VAX CPU and can be used to load a system containing exactly those drivers required. The system supports multiple instances of all standard devices and placement of devices on multiple MASSBUS or UNIBUS adapters. Full ECC recovery and DEC standard bad block handling on disks are supported. Systern sizing is simplified by automatic extrapolation of needed table size from a constant “maximum active users’ in the description file. Error messages printed on the systemn console are in a more readable for- mat than those printed by standard version 7. They are saved in a buffer in memory that is retrieved after a system crash and stored in a disk file for later examination. Device error bits are decoded symbolically in the error messages. 8.3. Overview of needs Most operational needs are addressed in the current version of the system. Remaining needs include support for an *“operator’”’, who can execute mainte- nance functions but with less privilege than the “super-user’, clean localization of site-specific information to make the systemn binaries more portable, stan- dard error logging for quicker repair, improvements to the dump/restore pro- gram and provisions for user archival and retrieval of flles to relieve pressure on disk space. 8.4. Operator notion In the current standard system, a person who is to do such maintenance operations as flle system dumps and restores is required to have super-user privileges, allowing unrestricted access to all system facilities. This is undesirable on many systems, and several sites have implemented a notion of an “operator’ with maintenance privileges but not all privileges. We feel that this notion is a useful one. We propose to integrate the changes made at CMU for the support of an operator into the standard system. CSRG TR/4 - August 31, 1981 — Joy/Fabry Proposals for UNIX - 49 - Operational support 8.5. Clean localization of system The standard system bhas commands that have to be recompiled per-site because they contain site-dependent information. There should be a standard provision and use of the information needed by these programs so that they could be site-independent. The programs are typically part of the mail system or have compiled machine names into them. The information about users on the system is also not cleanly parameterized Some systems put information about users into the GECOS field of the password file, but this seems less than desirable. We propose to develop a standard form for a user information file. Any such data base should be extensible, and contain, at minimum, the information accessed by the current finger com- mand. Other system information such as the terminal type databases currently exists in several flles because of the evolutionary path by which these files were developed. We propose to compress this information into single data bases where appropriate to make maintenance of this information simpler. Finally, we propose to add a new standard directory /local, which on each system will contain all the local files and databases. Databases that currently exist in other directories with long-term associations, such as /etc/passwd will be replaced by symbolic links to their counterparts in /local. 8.6. Error logging The current UNIX systemn does not produce error log information in a format that DEC field service is used to. More seriously, the system does not log recovered soft errors, so that impending problems can go undiscovered when evidence of their onset would otherwise be available. At least one site (UCLA) bhad many problems with their VAX that might have been avoided or alleviated if full error logging were available. Don Markuson at CMU is working on an implementation of error logging in UNIX. He is cooperating with the UNIX group at DEC, which previously produced a system written by Fred Cantor called '“v8m'’, which was a modified version 6 UNIX that supported error logging. 8.7. Dump/restore needs The dump and restore programs have been modified by CMU and Wisconsin to do multiple dumps per tape and to restore hierarchies respectively. We feel that these modifications should be combined and incorporated back into the standard system. 8.8. Archive/retrieve design UNIKX sorely needs a system whereby users can request portions of the file system hierarchy be safely archived on tape, so that they can later request themn be restored. Our group at Berkeley is working on two programs, archive and refrieve that will meet this need. The archive command will take a list of fle names and queue them for archival. When the files are archived, an entry will be made in a data base associated with the user noting information that can later be used to retrieve the file, and the user will receive mail notification that the archival has taken place. A refrieve command will queue a request for file retrieval from an archive tape. The file will later be retrieved when a extraction program is run by an operator. CSRG TR/4 - August 31, 1981 — Joy/Fabry Proposals for UNIX - 50 - Operational support So that users may have confidence in the archive/retrieve procedure, we intend that an option be available to make multiple copies (normally 2) of each archive tape, and that this procedure result in tapes which are stored in separate locations. Provision of manpower to make the turnaround time on archive and retrieve requests sufficiently low to encourage use should be paid back in lowered disk space usage. We intend that archive and retrieve will store files on magnetic tape in tar format and maintain an on-line database of the files that have been stored. CSRG TR/4 — August 31, 1981 — Joy/Fabry Proposals for UNIX -581- Miscellaneous topics 9. Miscellaneous topics This conciuding section contains discu ssion of several topics of general interest that didn't fit naturally in any of the other sections. 9.1. Software census and contributi on to standard system We are currently preparing to mail quest ionnaires to all users of the VAX systemn asking them to tell us the softw are they have brought up on the VAX that they are willing to share with the gener al VAX community. We hope to take the information gathered by this *“VAX software census’’ and place it in an on-line data base. We hope that this informatio n will eventually be available through CSNE T for general examination and update by authorized users. We are also interested in finding out what software efforts are going on. Qur questionna ire will ask both what kinds of softw are are being developed and what software the different sites are inter ested in porting to UNIX. We hope that this procedure will make us aware of the software that is available, and help us to tell what software should be made available in a standard system. 8.2. Eectronic forum for system users We are interested in creating an electronic system. forum for users of the VAX UNIX The forum “unix-wizards@sri-unix’’ has proven a useful information exchange for a limitied set of VAX users. We plan to establish a forum for ARPA users of the VAX UNIX system as soon as our NCP C/70 is firmly on the ARPAN ET. An electronic mailbox “esvax.4bsd-bug s@berkeley’’, available via uucp as “ucbvax!4bsd-bugs”, has been available for about 6 months, although only a few sites have been submitting trouble report s.* We hope to advertise this more widely, mentioning it in the questionnair es. ‘Another mail box “csvax 4bsdideas®@berkeley’* collects ideas for impro vements to the system. Some of the proposals discussed in this report benefited from suggestions mailed to *'4bsd- ideas." 9.3. Hardware support; new and dual proces sors The VAX UNIX system supports all releas ed DEC bardware for the VAX except the TU78 tape transport. We are working with the UNIX group within to provide support for new DEC devices DEC and VAX processors as they are released. Bob Kridle, of the Systems Support Group at U.C. Berkeley and Bill Joy have prepared a document giving hints to UNIX users on Configuration of VAX sys- temns. tions. This document-has helped many sites bring Recently, George up economical VAX installa- Goble and Mike Marsh of the Electrical Engine ering department at Purdue University have create d a dual processor 11/780 UNIX system, by cabling an additional VAX processor to an 11/780 SBI. While some minor problems remain with running compat ibility mode on the slave processor, the system is functional Since current VAX 11/780 systems are limite d in growth largely by the available CPU power, this appears to be an attractive way to get nearly twice the CPU horsepower of a single processor system for much less additional cost. Addition of a CPU and a second memory contro ller to a single CPU VAX 11/780 system, and provision of 4 Megabytes for the second CPU should be possible for * Official nquemkomARPAcmtmrehudtotheVA to “cmrg@berkeley’. - CSRG TR/4 XUN’D(mumahould be mailed — August 31, 1981 — Joy/Fabry Proposals for UNIX under $100,000. -52 - Miscellaneous topics With some help from DEC it would be possible to run this configuration in shops where more processor power is needed at & lower cost than replication of entire systems. We have been working with George Goble and Mike Marsh to develop reason- able processor scheduling algorithms for the dual 11/780. We intend to encourage DEC to provide support for this option and assist us in fixing minor problems with this configuration. 9.4. Debuggers The VAX UNIX system currently comes with two debuggers: adb and sdb. The adb debugger is oriented towards examination of memory and object code, and currently bas no knowledge of source text. The sdb debugger knows about source code, but suffers from several minor bugs and lack of information in the symbol table needed to do proper bandling of displayed variable values. Sdb aiso does not contain an expression parser powerful enough to accept source language expressions. Robert Elz of the University of Melbourne has extended adb to provide some programmability. We have worked with Rob Gurwitz at BBN to provide adb with knowledge to interpret the VAX page tables and to rmake it more useful for debugging the UNIX kernel. We have recently made some minor modifications to adb so that it records the source line information used by sdb when present in the object file, and hence can show source as well as object code. We intend to fix the display of local variables in adb and make this improved debugger available to other sites in the next release. A source language symbolic debugger was written for the Pascal interpreter px on the PDP-11 by Mark Linton at Berkeley. This debugger is currently being moved to the VAX and made to work for C,.Pascal and FORTRAN 77 code. We hope that this debugger will be part of a future release of the system. 9.5. Fortran 77 There are many sites that would like to use UNIX on their VAX systems but have need of a fast FORTRAN implementation. While the f77 compiler is a com- plete implementation of the language, the speed of compiled code produced by the compiler is noticeably less than that produced by the VYMS FORTRAN compiler. This is not surprising. The f77 compiler is not an optimizing compiler, while the VMS FORTRAN compiler is. Stuart Feldman, one of the authors of f77, visited Berkeley last academic year and formed a group to work on optimizations in f77. This group is now in the process of implementing the designed optimization pass of the compiler, and hopes to have a prototype of the new compiler running by the end of the year. We have funding to hire a programmer to work on /77 next year to finish this project. We also hope to incorporate the improvernents made to FORTRAN by Jim Kulp at IIASA in Austria. Kulp's group produced documentation designed to help users of FORTRAN on other machines learn to use f77. We hope to work with the Computer Center at Berkeley to make the documentation produced in Austria more widely available. A group of students at Berkeley under the direction of Prof. Kahan are producing basic math library routines such as sin and sgrt to conform to the new IEEE standards. These routines will be integrated into the standard VAX UNIX math library as they become available. " CSRG TR/4 - August 31, 1981 — Joy/Fabry Proposals for UNIX -53- Miscellaneous topics 9.8. Detaching jobs We realize the desirability of detaching jobs from one terminal and reattaching them to another terminal. This facility was considered for inclusion when the job control facilities were added to UNIX but rejected because of the difficulty of communicating the change of environment to the newly attached jobs. If conventions for use of the naming directory as a communications area are adequate, this problem can be solved. Jobs that are reattached could look in their naming directory to see the terminal type they are now attached to and discover other aspects of their environment. We pian to investigate the provision of attach and detach facilities in future releases of the system. 9.7. UNIX and YMS: performance and facilities There has been a good deal of discussion of the relative performance of UNIX and VMS. Much of the available information is now out of date, and more will be outdated as the facilities described here are incorporated into UNIX and new versions of VMS become available. Our recent measurements show that the differences in paging performance of UNIX and VMS reported by Kashtan at SRI are no longer significant. We measured the behavior of the 4.1bsd system running his benchmarks and got times not significantly different from the times he reported for VMS. When we used vadvise to tell the system that the sequential access jobs were, indeed, sequential, then the system outperformed VMS substantially. In our experience the largest reason for research sites to use VMS is the quality of the FORTRAN on VMS. We hope that a future release of a better f77 compiler will make the FORTRAN issue moot, so that the choice need not be made for FORTRAN alone. We expect to run a new set of UNIX and VMS benchmarks after the facilities described here are in place in the system, probably sometime early in 1982, The results of these benchmarks should prove valuable for further refinements to the systems. CSRG TR/4 — August 31, 1881 — Joy/Fabry -54 - Proposals for UNIX Appendix: summary 1. Index and summary of proposed system facilities The following table summarizes the new system facilities proposed in this paper. The entries in the table are system calls (whose names are all in lower case), constants related to system calls (whose names are all upper case), and new types associated with the new facilities (which are given in italics). Each item is classified as relating to memory management facilities mman, IPC and networking ipc, the file system filsys, or general needs general. Other categories include changed for systemn calls whose interface is changed, or deleted for system calls to be deleted. Name answer associate asynchronous call chnamdir closesend control disassociate ENBLOCK general See 2.5 2.8 . 2.10 general ipe 5.2 2.7.5 Kind ipe ipe ipe general ipc general general general Description Receive a call establishing virtual circuit Provides a server for a network address Request interrupt notification about i/o Place a call establishing virtual circuit Change naming directory Close transmit half of a circuit Replacement for ioctl with cleaner interface Remove an association from associate 25 7.8 2.8 2.10 2.3 211 Error returned instead of blocking with nondlocking 23 Type representing set of file descriptors Type representing watermarks for i/o Internetwork address type select mman mman mman mman general 35 35 3.5 3.7 a7 3.7 2.8 Segment is to be shared (in segalloc) Segment is private (in segalloc) No access allowed in segment (in segchmod) Read access allowed in segment (in segchmod) Write access allowed in segment (in segchmod) Execute access allowed in segment (in segchmod) Provides a synchronous i/o multiplexing facility signal sigstack changed general 7.4 3.13 - New signal facility to become standard Provide special stack for signal processing 29.2 2.4 Urgent data arrival signal Create a socket for [PC communications Jdset Jd waterm " addr tn_proto joctl nonblocking open portal portal kind PORTAL_CALL PORTAL_FILE PORTAL.DEV PORTAL.DIR receive recordbetween recordmode rename segadvise segalloc segchmod segfree SEG_SHARED SEG_PRIVATE SEG.NA SEG.R SEGY SEG.X send SIGIO SIGURG socket CSRG TR/4 ipe ipc deleted general changed ipc ipc ipc ipe ipe ipc ipc ipc ipe general mman mman mman mman mmean mman ipe general ipc ipc Socket type, from SOCK.DG, SOCKYC, SOCKLALL To be replaced by control with cleaner interface I/0 requests return ENBLOCK instead of blocking New fiags from UNIX 3.0 ard for locking Create a server gateway in UNIX file system Portal types defining protocols Portal type for simple circuit connections , Portal type for file emulation Portal type for device emulatien Portal type for directory emulation Receive a datagram Is a circuit between records? Place circuit in record mode Atomic rename primitive for file system Give system advice about a segment Allocate a segment in virtual memory Change access protection of a segment Free a segment in virtual memory 23 7.8 2.10 5.3 2.7 2.7 2.7 2.7 2.7 2.7 24 2.9.1 2.9.1 8.6 3.9 3.5 3.7 3.8 2.4 210 Send a datagram Input /output possible signal (with asynchronous) — August 31, 1981 — Joy/Fabry -55- Proposals for UNIX Index and summary Name socketstatus SOCK.CALL Kind ipe ipc See 2.11 23 Description Return internal state of 2 socket Call director socket for establishing circuits SOCK.YC symlink ipe flisys 2.3 5.1 Virtual circuit socket type Create a symboalic link symfollow filsys 5.1 Enable/disable symbolic links truncate urgentmode urgentnext flisys ipc ipc 8.5 2.9.2 2.8.2 2.9.2 Shorten the length of a file Place circuit in urgent data mode Is next data in circuit urgent? Is there any upcoming urgent data? SOCK.DG symunlink syncfd urgentpending urgentsockets vadvise vread vwrite watermarks -4 CSRG TR/4 ipe flisys genernl ipe ipc deleted deleted deleted general general 2.3 5.1 5.7 2.9.2 3.1 3.1 3.1 2.12 §2 Datagram socket type Remove a symboalic link Flush buffering associated with file or device Return set of sockets with urgent data pending Replaced by segadvise facilities Replaced by segalloc facilities Replaced by segalloc facilities Set buffering watermarks for stream descriptor Naming directory filename prefix character — August 31, 1981 — Joy/Fabry
Home
Privacy and Data
Site structure and layout ©2025 Majenko Technologies