RTP was developed by the Audio-Video Transport Working Group of theInternet Engineering Task Force (IETF) and first published in 1996 asRFC1889, which was then superseded byRFC3550 in 2003.[2]
RTP is designed forend-to-end,real-time transfer ofstreaming media. The protocol provides facilities forjitter compensation and detection ofpacket loss andout-of-order delivery, which are common, especially during UDP transmissions on an IP network. RTP allows data transfer to multiple destinations throughIP multicast.[3] RTP is regarded as the primary standard for audio/video transport in IP networks and is used with an associated profile and payload format.[4][needs update] The design of RTP is based on the architectural principle known asapplication-layer framing, where protocol functions are implemented in the application as opposed to the operating system'sprotocol stack.
Real-timemultimedia streaming applications require timely delivery of information and often can tolerate some packet loss to achieve this goal. For example, loss of a packet in an audio application may result in loss of a fraction of a second of audio data, which can be made unnoticeable with suitableerror concealment algorithms.[5] TheTransmission Control Protocol (TCP), although standardized for RTP use,[6] is not normally used in RTP applications because TCP favors reliability over timeliness. Instead, the majority of the RTP implementations are built on theUser Datagram Protocol (UDP).[5] Other transport protocols specifically designed for multimedia sessions areSCTP[7] andDCCP,[8] although, as of 2012[update], they were not in widespread use.[9]
RTP was developed by the Audio/Video Transport working group of the IETF standards organization. RTP is used in conjunction with other protocols such asH.323 andRTSP.[4] The RTP specification describes two protocols: RTP and RTCP. RTP is used for the transfer of multimedia data, and the RTCP is used to periodically send control information and QoS parameters.[10]
The data transfer protocol, RTP, carries real-time data. Information provided by this protocol includes timestamps (for synchronization), sequence numbers (for packet loss and reordering detection) and the payload format, which indicates the encoded format of the data.[11] The control protocol, RTCP, is used for quality of service (QoS) feedback and synchronization between the media streams. The bandwidth of RTCP traffic compared to RTP is small, typically around 5%.[11][12]
An RTP session is established for each multimedia stream. Audio and video streams may use separate RTP sessions, enabling a receiver to selectively receive components of a particular stream.[14] The RTP and RTCP design is independent of the transport protocol. Applications most typically use UDP with port numbers in the unprivileged range (1024 to 65535).[15] TheStream Control Transmission Protocol (SCTP) and theDatagram Congestion Control Protocol (DCCP) may be used when a reliable transport protocol is desired. The RTP specification recommends even port numbers for RTP and the use of the next odd port number for the associated RTCP session.[16]: 68 A single port can be used for RTP and RTCP in applications that multiplex the protocols.[17]
RTP is designed to carry a multitude of multimedia formats, which permits the development of new formats without revising the RTP standard. To this end, the information required by a specific application of the protocol is not included in the generic RTP header. For each class of application (e.g., audio, video), RTP defines aprofile and associatedpayload formats.[10] Every instantiation of RTP in a particular application requires a profile and payload format specifications.[18]: 71
The profile defines the codecs used to encode the payload data and their mapping to payload format codes in the protocol fieldPayload Type (PT) of the RTP header. Each profile is accompanied by several payload format specifications, each of which describes the transport of particular encoded data.[4] Examples of audio payload formats areG.711,G.723,G.726,G.729,GSM,QCELP,MP3, andDTMF, and examples of video payloads areH.261,H.263,H.264,H.265 andMPEG-1/MPEG-2.[19] The mapping ofMPEG-4 audio/video streams to RTP packets is specified inRFC3016, and H.263 video payloads are described inRFC2429.[20]
Examples of RTP profiles include:
TheRTP profile for Audio and video conferences with minimal control (RFC3551) defines a set of static payload type assignments, and a dynamic mechanism for mapping between a payload format and a PT value usingSession Description Protocol (SDP).
RTP packets are created at the application layer and handed to the transport layer for delivery. Each unit of RTP media data created by an application begins with the RTP packet header.
The RTP header has a minimum size of 12 bytes. After the header, optional header extensions may be present. This is followed by the RTP payload, the format of which is determined by the particular class of application.[23] The fields in the header are as follows:
Version: 2 bits
Indicates the version of the protocol. Current version is 2.[24]
Padding(P): 1 bit
Used to indicate if there are extra padding bytes at the end of the RTP packet. Padding may be used to fill up a block of certain size, for example, as required by an encryption algorithm. The last byte of the padding contains the number of padding bytes that were added (including itself).[16]: 12 [24]
Extension(X): 1 bit
Indicates presence of anExtension Header between the header and payload data. The extension header is application or profile specific.[24]
CSRC Count(CC): 4 bits
Contains the number of CSRC identifiers (defined below) that follow the SSRC (also defined below).[16]: 12
Marker(M): 1 bit
Signaling used at the application level in a profile-specific manner. If it is set, it means that the current data has some special relevance for the application.[16]: 13
Payload Type(PT): 7 bits
Indicates the format of the payload and thus determines its interpretation by the application. Values are profile specific and may be dynamically assigned.[25]
Used by the receiver to play back the received samples at appropriate time and interval. When several media streams are present, the timestamps may be independent in each stream.[a] The granularity of the timing is application specific. For example, an audio application that samples data once every 125 μs (8 kHz, a common sample rate in digital telephony) would use that value as its clock resolution. Video streams typically use a 90 kHz clock. The clock granularity is one of the details that is specified in the RTP profile for an application.[26]
SSRC: 32 bits
Synchronization Source Identifier uniquely identifies the source of a stream. The synchronization sources within the same RTP session will be unique.[16]: 15
CSRC: Variable (CSRC Count × 32 bits)
Contributing Source IDs enumerate contributing sources to a stream that has been generated from multiple sources.[16]: 15
Header Extension: Variable; Exists when X=1
WhenExtension is true, this optional field contains:
Profile-specific Extension Header ID: 16 bits
a profile-specific identifier
Extension Header Length: 16 bits
indicates the length of the extension in 32-bit units, excluding the 32 bits of the extension header.
A functional multimedia application requires other protocols and standards used in conjunction with RTP. Protocols such as SIP,Jingle, RTSP,H.225 andH.245 are used for session initiation, control and termination. Other standards, such as H.264, MPEG and H.263, are used for encoding the payload data as specified by the applicable RTP profile.[27]
An RTP sender captures the multimedia data, then encodes, frames and transmits it as RTP packets with appropriate timestamps and increasing timestamps and sequence numbers. The sender sets thepayload type field in accordance with connection negotiation and the RTP profile in use. The RTP receiver detects missing packets and may reorder packets. It decodes the media data in the packets according to the payload type and presents the stream to its user.[27]