Internet Telephony Protocols

By Linden DeCarmo, July 01, 1999

Linden examines the strengths and weaknesses of SIP and H.323, the two dominant "Voice over the Internet" protocols. He also takes a look at a new challenger -- the Media Gateway Control Protocol.

Jul99: Internet Telephony Protocols

Linden is a senior software engineer at NetSpeak and the author of The Core Java Media Framework (Prentice Hall, 1999). He can be contacted at [email protected].

Media Gateway Control Protocol

The H.323 protocol has so dominated Internet telephony that the two terms are considered synonymous. Despite H.323's entrenchment, the creators of the Session Initiation Protocol (SIP) believe they have a superior solution for controlling Internet phone calls. Are SIP's advantages enough to dethrone H.323? In this article, I'll discuss why this struggle is critical, examine the strengths and weaknesses of the H.323 and SIP protocols, and provide insights about which technology will succeed.

Before delving into the SIP versus H.323 debate, I'll clarify some common misconceptions about Internet telephony, or Voice over the Internet Protocol (VoIP). Most people assume that Internet telephony is limited to phone calls between two computers. In reality, VoIP refers to packetized audio streams transported over the Internet Protocol (IP) to endpoints where they are decoded. These endpoints can be computers, or even a normal analog phone (a gateway is required to convert the packetized audio to a format that your phone understands).

Another misperception about VoIP is that its primary benefit is free long-distance calls. While cheap long distance creates headlines, the reasons phone companies are attracted to Internet telephony are its ease of service creation and consolidation of networks.

The primary advantage of VoIP over the Public Switch Telephone Network (PSTN) is the ease of adding new features and functions. For instance, you can create a video conference by defining how the video stream is packetized and decoded. In contrast, retrofitting video conferencing support into the PSTN can be an arduous and expensive process. Similarly, since Internet Telephony is software centric, it is easier to add voice mail, call forwarding, and other services to it.

A second advantage for VoIP is the consolidation of data networks. Currently, phone companies maintain duplicate networks -- one for voice, the other for packetized data. Because voice networks are approximately four to five times more expensive than data networks, converting to VoIP eliminates the voice network, thus dramatically reducing costs.

The final misconception about Internet telephony is poor transmission quality. If the network is heavily loaded, there can be sizable transmission delays, which causes poor audio quality. However, if VoIP is rolled out on a dedicated private network with quality of service guarantees, audio quality rises to the level consumers expect.

Signaling 101

To complete a phone call, a signaling protocol is required to locate the party (or parties) you wish to call and alert the party with whom you desire to communicate. Once you've located the callee, the protocol must reserve resources and prepare to stream the multimedia content. Furthermore, it must offer call-control features such as hold, forward, and transfer.

As Figure 1 illustrates, most signaling protocols focus on connection and call control, and intentionally separate themselves from the actual transmission of multimedia content between endpoints. Consequently, the endpoints are able to stream multimedia content without the intervention of a third party. If the server transmitting the signaling protocol intervened in multimedia transmissions, the extra processing could increase the likelihood of transmission delays; see Figure 2.

Virtually all packet-based signaling protocols use the Real-Time Protocol (RTP) to transmit audio-visual data between endpoints using User Datagram Protocol (UDP). RTP is built on UDP because speed is paramount and lost packets can be compensated for in the audio or video Compressor/Decompressor.

Besides controlling Internet-friendly endpoints, signaling protocols must be able to interact and control PSTN endpoints. This control is enabled by gateways (or bridges). A gateway is responsible for converting IP-based commands and packetized audio streams to traditional protocols, such as Signaling System Seven (SS7) or ISDN, used by the current phone system.

During the initial phases of Internet telephony, each vendor designed its own proprietary signaling protocol. Unfortunately, these phones and PSTN gateways could not interoperate and the Internet telephony market was in chaos. Fortunately, several standards bodies recognized the gravity of this problem and proposed open standards that theoretically were to provide cross-vendor compatibility.

Swiss Army Knife

The first body to release a proposal was the International Telecommunication Union (ITU). The ITU's Packet Based Multimedia Communication Systems (H.323) recommendation is an all- encompassing definition of the protocols necessary to transmit multimedia streams. These protocols are described in the following ITU recommendations: H.225 for connection, H.245 for control, H.332 for large conferences, H.235 for security, H.246 for interoperability with the PSTN world, and H.450.1, H.450.2, and H.450.3 for supplemental services. All of these standards are part of the Series H recommendations (hence the "H." prefix) that cover audiovisual and multimedia systems.

Because H.323 is composed of so many subrecommendations, you must wade through more than 700 pages of heavy technical reading before grasping its features. As you read these documents, you'll be surprised to find out that, unlike most text-based Internet protocols, H.323 is a binary protocol based on Abstract Syntax Notation One (ASN.1) and the Packed Encoding Rules (PER).

One of the most important H.323 protocols is H.225, commonly referred to as RAS (Reservation, Admission, and Status), which lets a client application access resources (such as routing and directory services) granted by a gatekeeper. Besides controlling local network resources, gatekeepers can interface with external gatekeepers to form Wide Area Networks; see Figure 3.

H.225 applications typically send a Gatekeeper request (GRQ) to discover their gatekeeper, a Reservation request (RRQ) to request resources, and an Admission request (ARQ) to begin using the resources. When the call is complete, the application sends a Disengage request (DRQ) followed by an Unregister request (URQ) to deallocate resources.

H.323 Version 1 (H323 V1.0) was plagued with poor performance and severe incompatibility problems between vendors. The first revision of a standard is often more concerned with architectural purity and functionality than performance, and H323 V1.0 typified this phenomena. For instance, to connect two parties, multiple roundtrips between endpoints were necessary and this extended connection times from milliseconds to seconds (see Figure 4).

In addition, because H.323 V1 was not feature rich, virtually every vendor added proprietary extensions to the protocol to maintain functional equivalence with their prior proprietary protocols. Unfortunately, these proprietary extensions prevented interoperation with other vendor's H.323 products, thereby invalidating a major design goal of the standard.

The ITU recognized these problems and addressed them in H.323 V2, which provides the Fast Start feature that only requires one roundtrip to connect parties. Furthermore, features were added to the specification to reduce the necessity of adding proprietary extensions.

A Different Philosophy

The primary competitor of H.323 is SIP, which was conceived at Columbia University and later submitted for consideration to the Internet Engineering Task Force (IETF). Like other dominant Internet protocols (such as HTTP for web browsing, and SMTP for mail services), SIP is text based. Unlike H.323, which has gone through two official revisions, SIP is an Internet draft (or work in progress) in the late stages of development.

SIP focuses on signaling and does not try to define every aspect of multimedia communication, as H.323 does. Consequently, it can be documented in less than 130 pages of relatively light reading. SIP achieves this feat by reusing other protocol's concepts and leveraging other protocols for specific features.

For example, SIP reuses the same headers, errors, and encoding rules as HTTP, so if you're familiar with HTTP syntax, you'll learn SIP syntax quickly. Similarly, SIP uses the IETF's Session Description Protocol (SDP) to communicate attributes of the multimedia stream (CODECS used to compress audio and video, data rate of the stream, and so on).

Because SIP does not define every potential protocol used in multimedia communication, it must be combined with other IETF protocols to create a complete solution. For instance, you can use the playback and recording features of the Real-Time Streaming Protocol (RTSP) to provide voicemail functionality with SIP. Since RTSP and SIP are text based and use SDP, the same SDP parser can be used for both protocols.

Like HTTP, SIP is a client-server architecture. Clients invoke methods on servers, and servers respond with acknowledgments. There are three core methods in SIP:

INVITE requests that a party join a conference (or call).
BYE terminates a connection between two users in a conference (hanging up the phone).
ACK acknowledges the reception of an INVITE request.

Figure 5 illustrates the message flow between an SIP server in proxy mode and a client machine. Each request or response is composed of headers (many of these headers are identical to their HTTP counterparts) and an optional message body. If an SIP parser encounters a header it doesn't understand, it skips it and concentrates on the headers it understands. The only exception to this rule is if the unknown header is listed in the Require header (see Examples 1 and 2). All headers specified by the REQUIRE header must be parsed or an error is returned to the sender.

The Call-ID header is the most critical header and it can be found in virtually every request and response. It is a unique string that servers use to associate a message with the resources associated with a call.

SIP commands are processed by three different types of servers:

Proxy servers receive SIP messages and forward the message for additional processing.
Redirect servers, by contrast, do not process messages, rather, they inform SIP clients of the address that will service their messages (this address can be a proxy, redirect and user-agent server). The client then uses the address provided by the redirect server to connect the call. See Table 1 for additional comparisons of SIP and H.323 message flow.
The final type of server is user agents, which are responsible for interfacing with end users and enable users to accept, reject, and forward calls.

Ready to Rumble

Companies are divided into camps about which technology will succeed. SIP advocates point out that SIP has the advantages over H.323 of simplicity, ability to grow, and scalability. The primary advantage of SIP over H.323 is its simplicity. Because it is a text-based protocol with only 37 headers to parse, you can have a fully functional SIP parser running very quickly. In fact, it is possible to write a useful SIP parser by implementing only three request types (INVITE, BYE, and ACK). Another significant advantage of SIP is that you can write a parser in quasi-interpreted languages (such as Perl or Java) that excel in text manipulation because an SIP parser is lightweight.

In contrast, H.323 requires you to parse hundreds of messages and complex data structures. Furthermore, because it is an intricate binary protocol, most vendors don't write an ASN compiler or decompressor themselves. Rather, they purchase the ASN compiler or decompressor from a third party and concentrate on processing and receiving messages.

H.323's complexity also complicates firewalls. To process H.323 traffic, a firewall must understand the intricate web of H.323 protocols and associated state information. An SIP firewall is marginally different from an HTTP firewall: The firewall need only process a single SIP message and need not maintain state information.

H.323's binary format has one significant advantage over a text-based protocol -- speed. It is quicker to identify and process a binary value than a textual string. Unfortunately, other than call setup, speed isn't critical for most signaling messages, thereby virtually negating this advantage.

Because H.323 is a tangled web of interwoven protocols, it can be confusing to figure out the most efficient technique to accomplish a task. For instance, H.323 provides three alternatives to control a call (the original multiround trip option, tunneling the control protocol through the connection protocol, and the H.323 V2 Fast Start option). Even more daunting is that H.323 V2 applications must support all three options to maintain backward compatibility with H323 V1.

Another side effect of H.323's numerous protocols is the duplication of functionality. For instance, both RTCP and H.245 provide similar stream feedback and conferencing controls.

The Agony of Backward Compatibility

The second advance of SIP over H.323 is its ability to adjust to new requirements. Internet telephony is transitioning from the trial stage to production use. Before it can replace the PSTN network, VoIP will have to increase reliability and quality, and dramatically increase the number of services. These changes require a signaling protocol that can easily incorporate new features. Simultaneously, as the changes are rolled out, backward compatibility must be maintained.

Both SIP and H.323 can be extended, but they use different techniques to maintain compatibility. SIP operates like HTTP and SMTP. If a parser encounters an unknown header, it skips over it. As a result, new features can be added to SIP without breaking older parsers.

SIP errors also aid in extensibility. Like HTTP errors, SIP errors are grouped into six logical types (or classes) of errors, as shown in Table 2. Because the class of an error is detected by the first digit, older applications can analyze and understand new errors by examining that first digit.

H.323 gives programmers access to the nonstandardParam field in ASN.1 data structures to extend the protocol. Unfortunately, the nonstandardParam isn't in every data structure, and there's no way to determine what extensions a given implementation supports. Furthermore, because the protocol is binary, it is impossible to determine the purpose of these extensions without talking with the person who created them. Thus, the difficulty of cleanly extending H.323 is the reason why interoperability has been so difficult to achieve.

One unfair advantage SIP has over H.323 is it is still an evolving draft, while H.323 has endured two revisions and has considerable backward compatibility baggage to maintain. As a result, SIP designers are able to point out the deficiencies in H.323 while constantly refining their protocol to address weaknesses they uncover. At some point, SIP will be officially released and the SIP designers will not be able to modify the specification as quickly.

Modularity Is the Key to Growth

The telephony industry is migrating away from proprietary protocols to open standards. In addition, customers want solutions that interoperate with each other. For instance, a typical network may contain H.323 endpoints and SIP servers.

The only way to ensure that a signaling protocol can interoperate with other signaling protocols is to create a platform that permits you to plug in new subfunctions (such as quality control, directory listings, content description, and conference control).

SIP's architecture is like a set of Lego blocks. The core Lego block defines basic call signaling, registration, and user location. SIP plugs in additional default protocols -- or Lego blocks -- to provide a richer set of subfunctions (see Table 3). However, you can enhance SIP with additional protocols to enable cross signaling compatibility. For instance, you can plug in the H.245 capability exchange feature into SIP without affecting the core SIP protocol.

Besides plugging in subprotocols (see Table 4), users can mix SIP and H.323 requests to get the best of both environments; see Figure 6. For instance, you can use SIP's superior directory features to find another user, and the SIP server can redirect you to an H.323 server to perform the actual call.

Unlike SIP, H.323's architecture is like a den of snakes. All the protocols are closely entwined and it's dangerous, if not impossible, to try to separate them.

Scaling Walls

Scalability is the ability of a protocol to adjust to large networks and heavy load conditions. Protocols that were intended for a Local Area Network (LAN) usage have serious problems when they are placed on large networks. For instance, H.323 was originally designed for use on a LAN and it had poor directory capabilities and struggled with wide area addressing.

As Figure 7 shows, H.323 V2 attempts to address these limitations by defining "zones" (collections of gateways and endpoints managed by a single gatekeeper). Although zones fix the issue of user directories, they do not provide loop detection and prevention. SIP was designed to operate in Wide Area Networks and as a result, provides loop detection algorithms to prevent infinite loop conditions.

Another aspect of scalability is the number of calls that can be handled by a server. SIP permits servers to be stateless (that is, they do not have to maintain call state throughout the life of the call). Therefore, they can process a request, forward it to its destination, and move on to the next request. On the other hand, H.323 requires gatekeepers to maintain state throughout the duration of a call. The lack of state and simpler protocol syntax means a server can process more SIP requests, and therefore potentially handle more calls than an equivalent H.323 server.

Battle of the Compressors

One of the vexing problems VoIP faces is voice quality. Although current audio Compressors/Decompressors (CODECs) are dramatically better than the initial algorithms, designers are constantly refining them to improve quality while simultaneously reducing processing requirements. These rapid improvements mandate that the signaling protocol be able to pick the optimal CODEC supported by the endpoints.

SIP uses SDP to inform call parties of the CODEC that will be used during the conference. Because SDP describes CODECs via descriptive strings, it can rapidly incorporate new CODECs. For example, if you created a CODEC called "KrystalClear," you would register it with Internet Assigned Number Authority (IANA), pass KrystalClear in the SDP, and endpoints that supported this protocol could instantly begin using it. Because the IANA is the same organization that registers MIME types, this is a relatively painless process.

In contrast, H.323 uses a big brother-like approach to control CODEC usage. H.323 only supports CODECs that have been approved and registered with a central authority. Once approved, CODECs are given a binary identifier (or code point) that is used by endpoints to describe CODECs. Although all popular CODECs are supported by H.323, this bottleneck could stifle the rapid adoption of new CODECs.

Even though H.323 is not as flexible in accommodating new CODECs, it provides more robust capability exchanges between endpoints. For example, H.323 not only lets endpoints report the types of CODECs they support, but also numerous attributes that can be used by these CODECs. In contrast, SIP provides only rudimentary capability exchanges between endpoints.

Services Are the Difference

As mentioned earlier, the lure of easy service creation is what attracts companies to VoIP. In the PSTN world, creating a new service can cost millions of dollars and require years of development. An equivalent VoIP-based service can be developed and tested within days or weeks. Consequently, the VoIP signaling protocol that has the best service creation environment will ultimately have the most applications and inevitably the most success.

SIP was specifically designed for the smooth incorporation of new services. This is accomplished via service-specific request methods (such as INVITE, BYE, and OPTIONS) and headers. These methods and headers enable an SIP server to create and destroy calls on external servers. The external entities send status messages to the controlling SIP server on a periodic basis so that the originating SIP server can adjust in real time. Thus, your ability to add services to SIP is limited only by your imagination.

In contrast, the current revisions of H.323 permit you to write service extensions, but it is an arduous process. For example, the H.323 FACILITY command lets you transfer a call to a third party. However, more exotic services, such as controlling a third-party call(s), require you to have intimate knowledge of the protocol and willingness to spend vast quantities of time figuring out how to properly implement them.

Who Will Win?

Although SIP and H.323 zealots believe their protocol will crush the competition, the VoIP signaling battle is not a zero sum game. Both H.323 and SIP will have their share of successes and each excels in different areas. SIP uses a decentralized approach that reuses principles pioneered by HTTP. Because of this open architecture, SIP will be able to adjust and grow as the VoIP market expands. H.323 is a well-defined centralized architecture that is expandable, but isn't as extensible as SIP.

In the near term, H.323 will dominate. Longer term, it is impossible to predict if H.323, SIP, or another protocol will ultimately replace PSTN. Only two things can be predicted with certainty -- packet-based multimedia will eventually replace our phone networks and a variety of signaling protocols will have to interoperate in order to meet customer requirements.

For More Information

Main SIP web site: http://www.cs.columbia.edu/~hgs/sip/sip.html

SIP FAQ: http://www.cs.columbia.edu/~hgs/sip/faq.html

SIP documents: http://www.cs.columbia.edu/~hgs/sip/related.html

SIP papers: http://www.cs.columbia.edu/~hgs/sip/papers.html

DDJ

1 2 3 4 5 6 7 8 9 10 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.