Mr. Elliston is working on a protocol for using SCSI devices to network Linux clusters in order to transfer data at high speeds.
by Ben Elliston
I was introduced to the UNIX operating system about seven years ago, and I soon became familiar with the networking companion to UNIX: TCP/IP (transmission control protocol/Internet protocol). As time progressed, I evolved from being a user of the UNIX command-line TCP/IP utilities, i.e., TELNET and FTP, to gaining an understanding of the internal workings of the protocols.
One point, reinforced by every book on TCP/IP I have read, is that IP was designed to be encapsulated in almost any of the available data link protocols. This design makes it an inter-networking protocol; it is inconsequential that a computer on the opposite side of the globe is connected to a token ring and your machine is connected to an Ethernet. I found this concept so impressive that I examined the various types of existing IP encapsulation. At the time, there was IP in IP, IP in IPX, IP in PPP, IP over Ethernet using NCR's WaveLAN spread-spectrum network adapters and others.
Before ATM (asynchronous transfer mode) and 100 Mbps Ethernet were readily available, I started thinking about what other bus networks existed for networking computers. There was ARCnet and token ring, but these media offered throughput capacities comparable to Ethernet. Moreover, I was interested in experimenting with coarse-grained parallel processing using a set of cheap PCs sitting in the same room where applications not only had to dispatch jobs, but also exchange large data sets in order to accomplish their tasks. In this situation, network latency was not a great concern, but throughput was.
Perhaps due to its analogous operation to Ethernet, the SCSI (small computer system interface) protocol popped into my head. It was very fast--SCSI-2 adapters were commonplace at the time. SCSI shares some attributes of Ethernet, making it suitable as a network data link: each ``station'' has an identifier, only one ``station'' can use the bus at any time, and each end of the bus must be terminated with a terminator of a characteristic impedance. SCSI provided a miniature Ethernet, only much faster.
I acquired the ANSI SCSI standards documentation and started doing the background research that would be necessary to undertake such a project. After a great deal of reading and advancing from SCSI-1 to SCSI-2, I started thinking about a design which could elegantly handle the encapsulation of IP version 4. I immediately recognized the forthcoming issues of IP version 6, but chose to ignore them, given that I wanted to get something running immediately. I also reasoned that I wouldn't be doing much harm by developing yet another protocol that would be disrupted by IP version 6.
My design led to the RFC (Request for Comments) draft document entitled ``IP Encapsulation over the Small Computer Systems Interface'', which can be found at ftp://ftp.internic.net/rfc/rfc2143.txt.
SCSI is a peripheral interconnection technology designed to offer hardware manufacturers a standardized protocol and hardware description in order to build peripherals and computers which can be interconnected. For example, the Apple Macintosh was an early mass-produced computer which allowed the connection of SCSI devices.
SCSI devices communicate with each other by sending data packets across a shared bus. The device uses a hardware handshake to acquire the bus--all other devices must be silent while another device uses the bus. Unlike Ethernet, the bus is not accessed using a collision detection mechanism. Instead, devices follow a stateful algorithm to acquire the bus. When idle, the bus is in a state, or phase, known as the bus-free phase. If a device wishes to access the bus, it enters an arbitration phase, but only if the bus was previously in the bus-free phase. Clearly, there exists the classic problem in mutual exclusion where two devices check the state of the bus, both finding it in the bus-free phase, and go into arbitration. In this situation, the device with the highest SCSI ID always wins. This could prove significant when designing a network of machines running IP over SCSI.
After arbitration, the device enters a selection phase in which the target SCSI device's ID is placed on the data bus. The command phase is used to transfer the command data to the target. The reselection phase is entered when the target device wishes to respond to the initiator. This allows the bus to be used by other devices while a device is performing its task. The data-in and data-out phases are used to actually transfer data between the initiator and the target. The message-in and message-out phases are available to transfer additional control information between the initiator and the target. The status phase is used to transfer a status byte from the target back to the initiator to indicate the result of the operation. For instance, a tape drive might return a status code to indicate that the media was not loaded.
At the beginning of the project, I specified some overall goals. These goals have had a major impact on the scope of my ``IP over SCSI'' project. Some people found items worthy of criticism--and on occasion, they were right. The main thing is to realize that some of the issues raised just didn't fit the scope of the current project. They will be addressed in a later implementation. The goals I set were:
Given these design goals, I developed a network driver which had the following attributes:
When I designed IP over SCSI, my intentions were to permit a number of closely situated machines running Linux to communicate using their existing base of software applications without modification, but at much higher speeds. This has minimal value, however, as networks such as Ethernet seem to serve most people's needs.
Other applications, which have not yet been fully exploited, could benefit a great deal from high-speed interconnectivity between hosts. I was recently a witness to a demonstration of the PVM (parallel virtual machine) manager running a massive computation on 31 Pentium-based Linux machines, and we observed that the bottleneck was the network used to transmit units of ``work'' and the subsequent results between the machines.
I, therefore, see that IP over SCSI has a number of immediate applications:
.----+---+---+---+---+---. | | | | | | B C D E A | F G H I | | | | | .----+---+---+---+---+---.Here, hosts [B-E] can communicate with hosts [F-I], despite the fact that a SCSI-1 bus, for example, is unable to support a total of nine hosts.
Getting more creative:
A---B---C---D---E---. | F---G---H---I---J---. | K---L---M---N---O---. | P---Q---R---S---T---. | U---V---W---X---Y---.This arrangement can naturally be extended to three dimensions by, at the bare minimum, adding a third SCSI interface to the gateway hosts {A,F,K,P,U}.
The encapsulation protocol for IP over SCSI has been documented and drafted a number of times and has passed through the Internet Engineering Task Force and is now published as a RFC document (RFC 2143).
There has been a good deal of interest in this concept. Another Linux user and recent computer science graduate, Randy Scott, has implemented the IP over SCSI protocol with success. His project does not exactly meet the protocol given in the RFC, but it does prove that the concept works. Randy's work, however, illustrates that there is an issue of performance when it comes to IP networking in the Linux kernel, most of which was beyond his control. It is understood that there is some doubt as to whether a network interface could have a maximum transmission unit (MTU) of 64KB.
My own implementation has not been getting as much attention from me as I would like. Until recently, work was progressing well. I have a modular network interface which can be brought on-line using insmod and ifconfig, and IP packets can be sent onto the SCSI bus and the correct SCSI ID selected using my implementation of an address resolution protocol (ARP).
The next step is to verify the modifications made to the device driver for initializing target mode, then receive data from the SCSI bus and pass it up the protocol stack. I would be grateful to receive any help in completing this project from interested individuals.
Ben Elliston is a software engineer currently working for Cygnus Solutions. His interest in computers just gets him into trouble, so in his spare time, he enjoys rock climbing, mountain biking, playing the guitar and spectating at rallies. He can be reached at bje@cygnus.com.