Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

High-Speed Networking


AUG92: HIGH-SPEED NETWORKING

HIGH-SPEED NETWORKING

Header prediction and forward-error correction for very high-speed data transfer

William Frederick Jolitz

Bill is the developer of 386BSD, the Berkeley UNIX Research operating system for the 386/486 PC. He is currently writing a book on 386BSD internals. You can reach Bill on CompuServe, 76703,4266.


Networking hardware technologies of the future promise data-transfer rates of gigabits per second in place of today's megabits--yet many of the techniques used in our software and hardware architectures don't scale well to these high data rates. The basic problems we face are an order-of-magnitude increase in processor speed, coupled with a three order-of-magnitude increase in network bandwidth. So even if our network-protocol processing cost is reduced to 1 percent (with today's networks), and the processor speed improves as expected, we'll still reach the saturation level, with no room left to run an application!

Once we bring I/O and application demands into the picture, things become quite problematic. I/O is at the bottom of the computer system's "food chain" for processor cycles. Applications demand massive I/O and want to supervise the demand for the processor as well. The obvious loser in this competition is the operating system, which is being squeezed from both the top (by applications) and bottom (by device hardware).

In this article, I'll discuss problems involved with high-speed networking, focusing on certain software improvements in the Berkeley TCP/IP implementation that attempt to deal with some of the challenges. You can see this software in operation in any 386BSD system. Variations of this approach have already been used with FDDI (100 megabits/sec) and HiPPi (880 megabits/sec).

Network Profusion Magnifies the Problem

Ironically, the very success of computer networking has magnified the scale of these problems. While the number of host computers on networks is growing at a more-or-less "linear" rate, the geographic span of computer networks is growing much faster. This problem isn't restricted only to workstations; even in the PC arena, networks are a formidable consideration. (Novell estimates that 45 percent of all PCs sold today will be connected to a network.)

The transmission rates of new long-distance communications technologies cause the original ARPANET 56-kilobits/second network and the current T1 (1.544 megabit/sec) service common on the Internet today to pale in comparison. Among these technologies are T3 (44.736 megabits/second) and the Synchronous Optical Network (SONET), which is ordered in units of 51.84 megabits/second (OC-1), up to 2.488 gigabits/second (OC-48). (T1 has considerable historical significance, however, since it was created as the original digital "trunk" service between telephone exchanges--hence its name, T1.)

All these new services result in more bits on the wire per unit of time. By increasing the data rate by three orders of magnitude, we now have literally tens of megabytes of data in flight between any two urban centers. Since we can't do anything to reduce the delay on a long-distance connection (unless someone learns to travel faster than the speed of light), we have to look to other solutions.

For example, let's send a message from San Francisco to New York over different high-speed networks. Assuming a 70-millisecond "flight" time, we will find 12 Kbytes in flight for a T1 connection, 385 Kbytes for a T3, and almost 20 Mbytes for an OC-48 connection! If we discover an error in this message upon reception, we are forced to receive a barrage of useless data before the sender is informed of the problem and can respond. Worse yet, the sender would need to remember the previous "in flight" data to be able to respond. Clearly, high-speed networks require new approaches to leverage their advantages appropriately; otherwise, the problems begin to resemble those of the sorcerer's apprentice.

LANs Suffer Too, but Differently

Not only are long distance networks getting faster, so are local area networks (LANs). FDDI (made more cost effective by a twisted-pair 100 Base 2 version) holds the promise of 100 megabit/second technology for workstations and PCs. (Indeed, PC LANs may exceed the PC disk-drive transfer rates in use!) Many workstation vendors are exploring ATM (Asynchronous Transfer Mode) and SONET to provide scalable LAN bandwidth. One method is to downsize the telephone-company model of communications by using a simplified switch at the center of a star network radiating to workstations. The switch allows each "ray" of the star to be independently upgraded to faster bandwidths as they become available.

But the transition of LANs to very high-speed networks is not strongly affected by volume of traffic or delay; instead, high speed networks suffer from a different malady. A notable characteristic of very-high bandwidth LAN network transfers is for them to be "bursty"--usually, one transfer will command all the resources of the computer's processor and network I/O device for a brief period of time, to the exclusion of others. These "loudmouth" transfers (such as image downloads from a server or database group-commit operations) command the attention of client, server, and any gateways and/or network in between.

Another characteristic of LAN technologies is the ever-present cost factor. While 100+ MIPS on the desktop is within sight, 15-20 MIP PCs and workstations are far more realistic. If we could streamline network-protocol processing overhead, we might be able to leverage 0.1-gigabit/sec LAN technologies available on current workstations/PCs, and use the same trick for the 1-gigabit/100 MIP LAN/computer combinations as they become available.

Knowing this, network and operating-system designers can implement single-element caches at key points in the networking software implementation. This will "short-circuit" the lookup and evaluation of data structures in processing a protocol. This approach favors the most common case, in which the packet being processed is tied to the program associated with the previous packet processed. This will clear out much of the overhead in our increasingly complicated network protocols.

None of these problems are irreconcilable. In fact, there are many competing solutions we should be aware of as the world gradually migrates into gigabit networks. Both simple and complex solutions can be found by "thinking gigabit."

Example: Single-element Caches in 386BSD TCP

A simple example of thinking gigabit can be found within 386BSD's Transmission Control Protocol (TCP) implementation, the current version of the University of California's Internet networking implementation. This protocol corresponds to level 4 of the famous OSI seven-level model (see Table 1), which concerns itself with the speed, quality, and reliability of data transfer between systems. This is the correct layer to attempt to accelerate, since its sole concern is moving information at the fastest possible rate. (See the textbox "TCP/IP In Brief.")

Table 1: Seven layers of the ISO model.

  Layer         Description
  ----------------------------------------------------------------------

  Physical      Defines how data is electrically moved between two
                 physically connected machines.
  Datalink      Breaks data up into manageable packets and ensures that
                 they arrive intact and in order.
  Network       Determines route data will take to reach target machine.
  Transport     Maintains virtual connections between local and remote
                 machine.
  Session       Maintains communication sessions and may include ability
                 to reestablish broken connections automatically.
  Presentation  Specifies the format in which data will be
                 represented.  This is especially important when
                 communicating across different platforms.
  Application   The user application level which determines the contents
                 and meaning of data.

Associated with each TCP "session" is a pcb, or protocol control block. On receipt of a new TCP segment, we need to find the associated Internet pcb, or inpcb. We must search hundreds of possible connections to find the appropriate buffer; see inpcb() in Listing One, page 122. This costly search could be shortened by assuming bursty behavior, checking whether the last TCP segment came from the same source, and then using the previous search, which is still valid; see Listing Two, page 122. Figure 2 describes the TCP segment format.

You might be tempted to elaborate upon these simple kinds of improvements or to spread the theme throughout the networking implementation.

Header Prediction: Stretching the Present into the Future

There is a header (or trailer) associated with the data of a packet that contains the control information relevant to the protocol processing of this and related packets. Usually, this information is processed as it is received. If it is a connection-oriented or "virtual circuit" protocol (as opposed to a "connectionless," or datagram protocol), the state for this connection must be updated, and new packets might need to be scheduled for transfer. (Perhaps either an acknowledgment or more data needs to be sent.) This represents the computational overhead of the protocol, and in most cases it occupies the front end of the process of accepting new data on reception, in conjunction with receipt of a packet. This is normal when we don't know what the next packet might contain, or when it will arrive.

Header prediction is founded on the sensible expectation that the next packet will come along on the heels of the previous one, and that the next packet's header will likely be "guessable" by the software. These assumptions are true in cases associated with mass transfers in the first place; corrupted or lost packets are low probability events, and the more elegant features of a given transport-level protocol are not usually exercised by the applications programmer. So, when possible, header prediction uses a good heuristic to process the protocol with a short-circuit evaluation.

Essentially, the heuristic involves tracking expected state changes associated with each communications "stream." Thus, a model of each stream's next protocol header is prepared before the packet arrives. When it arrives, the incoming header must simply be matched to its intended form. If the match is correct, the precomputed "new" state is assigned, the packet's data is delivered, and packet processing is complete. This process may take very few instructions, thus reducing the critical path. However, if the match is incorrect, all the processing we attempted to avoid must now be done, and our speed advantage is lost. Fortunately, this occurs less than a fraction of a percent of the time, so it's of little consequence.

Listing Three (page 122) shows the code in 386BSD that attempts limited header prediction. It searches for two common cases: receipt of a TCP segment that can be transferred to the application immediately, without further processing; and response to a simple acknowledgment to the last outstanding segment transmitted. For these cases, processing the TCP segment can be collapsed down to as little as 40-50 instructions, instead of the several thousand generally involved.

More Details.

The incoming packet (ti) is compared with recorded information about this TCP session (tp) kept in a protocol control block. We check to see that it:

  • Uses no other features of the protocol (flags).
  • Is a consecutive, in-sequence order received packet (ti->ti_seq == tp->rcv_nxt).
  • Is not the subject of a retransmission (tp->snd_ nxt == tp->snd_max, that is, the outstanding transmission).
Also, we verify that the session is at steady state, with both sides proceeding in lockstep (ti->ti_win && ti->ti_win == tp->snd_wnd). If so, we can process the packet, which is stored in a buffer (m) overlaid by a data structure (ti) that exposes the details of the packet's TCP/IP headers.

The heuristic works best with large packets of data that arrive at a consistent pace, without variation, as occurs most with fast transfers. With header compression, a modern 100 MIP workstation might be able to handle multigigabit transfers with little hardware assist.

New Challenges in Very High-Speed Networking

Once we understand more fully the nature of very high-speed network traffic, we must find ways (such as new protocol processing techniques) to better use the present arrangement of computer hardware and software. After all, people are not going to just throw out all their equipment and start from scratch.

Demands for greater bandwidth, round-trip time exchanges, and bursty networks are just a few of the challenges faced by very high-speed networks. But they all illustrate this underlying problem: As you scale a base technology used to build a complex entity, risky side effects grow as a power function, or worse yet, as an exponential! Take, for example, the pragmatic redesign of the Internet in response to its growth (in terms of number of hosts and additional demand of service). Numerous times, linear scaling has provoked nonlinear effects, thus necessitating evolution of the network standards and redeployment of the network structure itself. (You could make a fair argument as to why OSI has not displaced TCP/IP simply on the basis of not keeping up with the required changes that have strained the Internet in this way.)

Forward-error Correction: When You Care Enough to Not Retransmit

At still higher data rates than header prediction can handle, we might need to abandon the current paradigms in protocols like TCP. One such paradigm is the network delay and processing overhead of handling retransmission of failed data. One simple method to avoid resending garbled data is to not resend it at all! Instead, additional information is sent in the form of an error-detection and correction-encoding scheme. Thus data integrity is preserved without the need for retransmissions. The cost of this technique is contained in the additional processing of the encoding/decoding and the overhead of transmitting redundant data-state information.

A serpentine, or "barbershop pole" algorithm is possible; see Figure 3. Since we won't know where the error might occur, we must transmit all the data at least twice, or transfer ways to reconstruct it. For Figure 3, we just redundantly transmit all information, spacing the data at the largest time offset allowed by the buffers allocated on the sending side. This resembles the original retransmission scheme, except that when we exceed the memory space of the transmitting computer system, we send the (delayed) packet. Thus the retransmission memory is a FIFO queue of packets.

A scheme like this relies on probabilities: It assumes that the error condition is transitory (soft) within the FIFO's lifetime. If it isn't, you can drop all the data in flight (!), abandon the connect (chances are it's dying anyway), and return to the original retransmission scheme, taking the performance hit.

Forward-error correction addresses network delay, but doesn't deal with increasing protocol-processing performance. Suppose you had a workstation that you wanted to receive different real-time video feeds, each to a separate window, all in software. In this case, the assumptions that allow header prediction are not present. Can we cope with this even greater demand for network bandwidth? Yes, we can; protocol engines are one solution.

Protocol Engines: Specialized Hardware to Bypass the Bottlenecks

At some point, processing the network protocols pushes the limits of the average workstation too far. We might consider factoring in increasing amounts of dedicated hardware to more lithely handle the necessary processing. This is warping the hardware to fit the situation; we can call this a "protocol engine."

A protocol engine departs from the traditional frame of computer system/network integration by substituting varying degrees of special-purpose hardware to remove bottlenecks that otherwise limit performance. In this case, both hardware and software are mutable to the degree that the desired performance objective needs to be achieved. If a protocol implementation's software consumes too many hardware instructions, we rebuild the hardware--we provide new facilities that make the software's use of hardware fall into line. Examples of this mountain-to-Mohammed strategy are providing separate data paths for the data and header of a packet or providing a separate functional unit to checksum a packet as it is processed. We might also wish to overlap routing lookup with ordinary packet processing, killing the processing if a route is found that exists to another network interface.

We could conceivably process every step of protocol implementation in hardware, potentially parallelizing every operation possible. Such an endeavor would be extraordinarily expensive--and not necessarily justified. As mentioned earlier, high-speed traffic is characterized by bursty behavior. An ultimate solution such as a fully hardware-implemented protocol stack would be used intensively only during those brief bursts. Given current expectations, this would be expensive overkill. In fact, the virtue of protocol engines lies in their potential to scale to vaster aggregate amounts of bandwidths by means of increasing application of VLSI hardware.

Protocol engines are a controversial topic; many prominent critics argue that they are either underjustified, or a continuation of the outmoded front-end processor arrangement popular in early networks. Front-end processors would entirely process the protocols on a separate microprocessor, usually on the same board as the network interface. This allowed a nonnetworked computer system to interface to a network by adding a device driver and some software "glue." The value of this approach was that it avoided changing the operating system to work with the network. A vague argument for offloading the protocol processing was frequently made. However, these units were usually slower than native-mode protocols because the main processor was considerably faster than the front end, and the cost of communication between the main processor and the front end was just as expensive as having the main processor run the protocols itself! A true protocol engine is a difficult and considerable achievement, since it must outperform main-processor technology without shifting the burden back to the main processor.

Conclusion

Testbed networks that employ the elements described here are currently in operation at various technology centers around the world. In the U.S. alone, interest in the competing projects for the National Research and Educational Network (NREN), as well as the interest in fiber-based networks like FDDI and HiPPi, have catalyzed this activity. ATM and SONET are watchwords for scalable network bandwidth to take us into the gigabit/sec world. Improvements like single-element caching and header compression are already in Berkeley networking implementations that you can obtain today (386BSD). In short, gigabit networking is far closer than you think.

TCP/IP In Brief

The Transmission Control Protocol (TCP) and its companion Internet Protocol (IP) form the core protocols upon which the Internet network is built. In the '70s, TCP/IP replaced Network Control Program (NCP) on the embryonic ARPANET. Since then, it has evolved from a university curiosity to the predominate open systems protocol of choice, forming the largest de facto network standard in the world today. It predates the now-standard OSI model for describing computer network structure; see Table 1.

IP provides the "legs" to move packet traffic around; it provides a minimal infrastructure to shoot "datagrams" across an arbitrarily connected data network. Datagrams are just "small" packages of data with source and destination addresses on the "catenet," or catenation of networks; see Figure 1. IP's only concern, being solely an OSI level-3 or network-layer protocol, is to provide a fabric for the simplest possible communication of datagrams between computers on a network. As such, datagrams going across the Internet travel without guarantees of any kind. They can arrive out of order or be lost, corrupted, or misaddressed.

As the "planner," or scheduler of traffic across the network that IP provides, TCP is concerned solely with transporting data contained within IP datagrams. The next level up in the OSI model (level 4--TCP) uses the IP datagram to send its data messages (called "segments"), taking care to account for the chaos possible when transiting the network via IP. As these segments arrive, they are buffered and handed off to an application program as a part of the stream of data between the two application programs. TCP uses a floating "window" of virtual buffer space to flexibly handle flow control, and to avoid the logjams that result when network delay becomes significant. While TCP is the predominate transport protocol used on the Internet, many others (UDP, TP4,...) are also in use, and similarly make use of IP.

Since TCP provides communication to application programs in the form of a bidirectional data stream, it requires a connection to engage the stream and a disconnection to terminate the stream. Thus it is a connection-oriented protocol (or virtual circuit), the opposite of a datagram service. Header prediction makes it possible to streamline connection-oriented protocols. This is because data streams commonly fall into predictable patterns when connections are in place and supply/consumption is at a steady state. In other words, the clever transport-protocol features that ensure well-behaved communication are infrequently used; usually, the "null" or pass-through case occurs.

--W.F.J.



_HIGH-SPEED NETWORKING_
by William F. Jolitz


[LISTING ONE]
<a name="01c3_0010">

From file /sys/netinet/in_pcb.c:

/* Find a internet protocol control block associated with a given session.
 * Session is determined as a match between foriegn (faddr, fport) and local
 * (laddr, lport) source/destionation identifiers. Allows wildcard matches
 * when some but not all of the parameters are known. */
struct inpcb *
in_pcblookup(head, faddr, fport, laddr, lport, flags)
    struct inpcb *head;
    struct in_addr faddr, laddr;
    u_short fport, lport;
    int flags;
{
    register struct inpcb *inp, *match = 0;
    int matchwild = 3, wildcard;

    for (inp = head->inp_next; inp != head; inp = inp->inp_next) {
        if (inp->inp_lport != lport)
            continue;
        wildcard = 0;
        if (inp->inp_laddr.s_addr != INADDR_ANY) {
            if (laddr.s_addr == INADDR_ANY)
                wildcard++;
            else if (inp->inp_laddr.s_addr != laddr.s_addr)
                continue;
        } else {
            if (laddr.s_addr != INADDR_ANY)
                wildcard++;
        }
        if (inp->inp_faddr.s_addr != INADDR_ANY) {
            if (faddr.s_addr == INADDR_ANY)
                wildcard++;
            else if (inp->inp_faddr.s_addr != faddr.s_addr ||
                inp->inp_fport != fport)
                continue;
        } else {
            if (faddr.s_addr != INADDR_ANY)
                wildcard++;
        }
        if (wildcard && (flags & INPLOOKUP_WILDCARD) == 0)
            continue;
        if (wildcard < matchwild) {
            match = inp;
            matchwild = wildcard;
            if (matchwild == 0)
                break;
        }
    }
    return (match);
}





<a name="01c3_0011">
<a name="01c3_0012">
[LISTING TWO]
<a name="01c3_0012">

From file:/sys/netinet/tcp_input.c

    ...
    /* * Locate pcb for segment. * */
findpcb:
    inp = tcp_last_inpcb;
    /* first, see if this segment looks familiar */
    if (inp->inp_lport != ti->ti_dport ||
        inp->inp_fport != ti->ti_sport ||
        inp->inp_faddr.s_addr != ti->ti_src.s_addr ||
        inp->inp_laddr.s_addr != ti->ti_dst.s_addr) {
        inp = in_pcblookup(&tcb, ti->ti_src, ti->ti_sport,
            ti->ti_dst, ti->ti_dport, INPLOOKUP_WILDCARD);
        if (inp)
            tcp_last_inpcb = inp;
        ++tcppcbcachemiss;
    }
    ...





<a name="01c3_0013">
<a name="01c3_0014">
[LISTING THREE]
<a name="01c3_0014">

From file:/sys/netinet/tcp_input.c:

    ...
/* Header prediction: check for the two common cases of a unidirectional
 * data xfer. If the packet has no control flags, is in-sequence, the window
 * didn't change and we're not retransmitting, it's a candidate. If the length
 * is zero and the ack moved forward, we're the sender side of the xfer. Just
 * free the data acked & wake any higher level process that was blocked
 * waiting for space. If length is non-zero and the ack didn't move, we're the
 * receiver side. If we get packets in-order (the reassembly queue is empty),
 * add data to the socket buffer and note that we need a delayed ack. */
    if (tp->t_state == TCPS_ESTABLISHED &&
        (tiflags & (TH_SYN|TH_FIN|TH_RST|TH_URG|TH_ACK)) == TH_ACK &&
        ti->ti_seq == tp->rcv_nxt &&
        ti->ti_win && ti->ti_win == tp->snd_wnd &&
        tp->snd_nxt == tp->snd_max) {
        if (ti->ti_len == 0) {
            if (SEQ_GT(ti->ti_ack, tp->snd_una) &&
                SEQ_LEQ(ti->ti_ack, tp->snd_max) &&
                tp->snd_cwnd >= tp->snd_wnd) {
                /* this is a pure ack for outstanding data. */
                ++tcppredack;
                /* need to update round trip timers */
                   if (tp->t_rtt && SEQ_GT(ti->ti_ack,tp->t_rtseq))
                    tcp_xmit_timer(tp);
                /* free sent data that was acknowledged */
                acked = ti->ti_ack - tp->snd_una;
                tcpstat.tcps_rcvackpack++;
                tcpstat.tcps_rcvackbyte += acked;
                sbdrop(&so->so_snd, acked);
                tp->snd_una = ti->ti_ack;
                /* free this packet */
                m_freem(m);
                /* If all outstanding data are acked, stop
                 * retransmit timer, otherwise restart timer
                 * using current (possibly backed-off) value.
                 * If process is waiting for space,
                 * wakeup/selwakeup/signal. If data are ready
                 * to send, let tcp_output decide between more
                 * output or persist. */
                if (tp->snd_una == tp->snd_max)
                    tp->t_timer[TCPT_REXMT] = 0;
                else if (tp->t_timer[TCPT_PERSIST] == 0)
                    tp->t_timer[TCPT_REXMT] = tp->t_rxtcur;
                if (so->so_snd.sb_flags & SB_NOTIFY)
                    sowwakeup(so);
                if (so->so_snd.sb_cc)
                    (void) tcp_output(tp);
                return;
            }
        } else
        /* This is a pure, in-sequence data packet with nothing on the
        * reassembly queue; we have enough buffer space to take it. */
            if (ti->ti_ack == tp->snd_una &&
            tp->seg_next == (struct tcpiphdr *)tp &&
            ti->ti_len <= sbspace(&so->so_rcv)) {
            /* update received state and statistics */
            tp->rcv_nxt += ti->ti_len;
            ++tcppreddat;
            tcpstat.tcps_rcvpack++;
            tcpstat.tcps_rcvbyte += ti->ti_len;
            /* Deliver data by dropping TCP and IP headers, then
                         * add data to application's socket buffer, and wakeup
             * application if necessary. */
            m->m_data += sizeof(struct tcpiphdr);
            m->m_len -= sizeof(struct tcpiphdr);
            sbappend(&so->so_rcv, m);
            sorwakeup(so);
            /* request that a acknowledgement be sent with next
             * data packet outbound -- we delay in hopes of
             * "piggy backing" on top of a data packet.
            tp->t_flags |= TF_DELACK;
            return;
        }
    }
    ...


Copyright © 1992, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.