Introduction to WebRTC
Web Real-Time Communications (WebRTC) is a browser technology that enables retrieval of media stream of physical media sources and exchange media stream or any other data in real time. It comprises of three APIs: the MediaStream
constructor, RTCPeerConnection
constructor, and RTCDataChannel
interface.
In short, MediaStream
is used to retrieve the stream of physical media source, RTCPeerConnection
is used to exchange MediaStream
among peers in real time, and finally, RTCDataChannel
is used to exchange arbitrary data among peers.
Let's see how these APIs work.
MediaStream API
Two main components of MediaStream API are the MediaStream
constructor and MediaStreamTrack
interface.
A track represents the stream of a media source. A track implements the MediaStreamTrack
interface. A track can either be an audio track or be a video track. That is, a track attached to an audio source is an audio track, and a track attached to a video source is a video track. There can be multiple tracks attached to a particular media source. We can also attach constraints to a track. For example, a track attached to a webcam can have constraints such as the minimum video resolution and FPS. Each track has its own constraints.
You can change the constraints of a track after it's created using the applyConstraints()
method of the MediaStreamTrack
interface. You can retrieve the constraints applied to a track anytime using the getSettings()
method of the MediaStreamTrack
interface. To detach a track from a media source, that is, to stop the track permanently, we can use the stop()
method of the MediaStreamTrack
interface. To pause a track, that is, to stop the track temporarily, we can assign false
to the enabled
property of the MediaStreamTrack
interface.
Note
Find out more about the MediaStreamTrack
interface at https://developer.mozilla.org/en-US/docs/Web/API/MediaStreamTrack.
A track can either be a local or remote track. A local track represents the stream of a local media source; whereas, a remote track represents the stream of a remote media source. You cannot apply constraints to the remote track. To find whether a track is local or remote, we can use the remote
property of the MediaStreamTrack
interface.
Note
We will come across the remote track while exchanging tracks between peers. When we send a local track to a peer, the other peer receives the remote version of the track.
A MediaStream
holds multiple tracks together. Technically, it doesn't do anything. It just represents a group of tracks that should be played, stored, or transferred together in a synchronized manner.
Note
Find out more about the MediaStream
constructor at https://developer.mozilla.org/en/docs/Web/API/MediaStream.
The getSources()
method of the MediaStreamTrack
object allows us to retrieve the ID of all the media devices, such as speakers, microphones, webcams, and so on. We can use the ID to create a track if the ID represents a media input device. The following is an example that demonstrates this:
MediaStreamTrack.getSources(function(sources){ for(var count = 0; count < sources.length; count++) { console.log("Source " + (count + 1) + " info:"); console.log("ID is: " + sources[count].id); if(sources[count].label == "") { console.log("Name of the source is: unknown"); } else { console.log("Name of the source is: " + sources[count].label); } console.log("Kind of source: " + sources[count].kind); if(sources[count].facing == "") { console.log("Source facing: unknown"); } else { console.log("Source facing: " + sources[count].facing); } } })
The output will vary for everyone. Here is the output I got:
Source 1 info: ID is: 0c1cb4e9e97088d405bd65ea5a44a20dab2e9da0d298438f82bab57ff9787675 Name of the source is: unknown Kind of source: audio Source facing: unknown Source 2 info: ID is: 68fb69033c86a4baa4a03f60cac9ad1c29a70f208e392d3d445f3c2d6731f478 Name of the source is: unknown Kind of source: audio Source facing: unknown Source 3 info: ID is: c83fc025afe6c7841a1cbe9526a6a4cb61cdc7d211dd4c3f10405857af0776c5 Name of the source is: unknown Kind of source: video Source facing: unknown
navigator.getUserMedia
There are various APIs that return MediaStream
with tracks in it. One such method is navigator.getUserMedia()
. Using navigator.getUserMedia()
, we can retrieve a stream from media input sources, such as microphones, webcams, and so on. The following is an example to demonstrate:
navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia || navigator.mozGetUserMedia; var constraints = { audio: true, video: { mandatory: { minWidth: 640, minHeight: 360 }, optional: [{ minWidth: 1280 }, { minHeight: 720 }] } } var av_stream = null; navigator.getUserMedia(constraints, function(mediastream){ av_stream = mediastream; //this is the MediaStream }, function(err){ console.log("Failed to get MediaStream", err); });
When you run the preceding code, the browser will display a popup seeking permission from the user. The user has to give the permission to the code to access the media input devices.
By default, the media input devices to which the tracks are attached while using getUserMedia()
depends on the browser. Some browsers let the user choose the audio and video device that they want to use, while other browsers use the default audio and video devices listed in the operating system configuration.
We can also provide the sourceId
property assigned to the ID of the media input device in the constraint object's audio
or video
property's mandatory
property to make getUserMedia()
attach tracks to these devices. So, if there are multiple webcams and microphones, then you can use MediaStreamTrack.getSources()
to let the user choose a media input device and provide this media input device ID to getUserMedia()
instead of relying on the browser, which doesn't guarantee whether it will let the user choose a media input device.
The first parameter that it takes is a constraint object with audio and video track constraints. Mandatory constraints are those constraints that must be applied. Optional indicates that they are not very important, so they can be omitted if it's not possible to apply them.
Some important constraints of an audio track are volume
, sampleRate
, sampleSize
, and echoCancellation
. Some important constraints of a video track are aspectRatio
, facingMode
, frameRate
, height
, and width
. If a constraint is not provided, then its default value is used.
You can simply set the audio
or video
property to false
if you don't want to create the audio or video tracks respectively.
We can retrieve the tracks of MediaStream
using the getTracks()
method of MediaStream
. Similarly, we can add or remove a track using the addTrack()
and removeTrack()
methods, respectively. Whenever a track is added, the onaddtrack
event is triggered. Similarly, whenever a track is removed, the onendtrack
is triggered.
If we already have some tracks, then we can directly use the MediaStream
constructor to create MediaStream
with the tracks. The MediaStream
constructor takes an array of tracks and returns MediaStream
with the reference of the tracks added to it.
An API that reads data from tracks of MediaStream
is called a MediaStream
consumer. Some of the MediaStream
consumers are the <audio>
tag, <video>
tag, RTCPeerConnection
, Media Recorder
API, Image Capture
API, Web Audio
API, and so on.
Here is an example that demonstrates how to display data of tracks of MediaStream
in the video tag:
<!doctype html> <html> <body> <video id="myVideo"></video> <br> <input value="Pause" onclick="pause()" type="button" /> <script type="text/javascript"> navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia || navigator.mozGetUserMedia; var constraints = { audio: true, video: true } var av_stream = null; navigator.getUserMedia(constraints, function(mediastream){ av_stream = mediastream; document.getElementById("myVideo").setAttribute("src", URL.createObjectURL(mediastream)); document.getElementById("myVideo").play(); }, function(err){ console.log("Failed to get MediaStream", err); }); function pause() { av_stream.getTracks()[0].enabled = !av_stream.getTracks()[0].enabled; av_stream.getTracks()[1].enabled = !av_stream.getTracks()[1].enabled; } </script> </body> </html>
Here we have a <video>
tag and a button to pause it. A video tag takes a URL and displays the resource.
Note
Before HTML5, HTML tags and CSS attributes could only read data from http://
and file://
URLs. However, in HTML5, they can read blob://
, data://
, mediastream://
, and other such URLs.
To display the output of MediaStream
in the <video>
tag, we need to use the URL.createObjectURL()
method, which takes a blob, file object, or MediaStream
and provides a URL to read its data. URL.createObjectURL()
takes extra memory and CPU time to provide access to the value passed on to it via a URL, therefore, it is wise to release the URL using URL.revokeObjectURL()
when we don't need the URL anymore.
If there are multiple audio and video tracks in MediaStream
, then <video>
reads the first audio and video tracks.
RTCPeerConnection API
RTCPeerConnection
allows two browsers to exchange MediaStream
in real time. RTCPeerConnection
is an instance of the RTCPeerConnection
constructor.
Establishing peer-to-peer connection
For a peer-to-peer connection to be established, a signaling server is needed. Through the signaling server, the peers exchange data required to establish a peer-to-peer connection. Actual data transfer takes place directly between peer-to-peer. The signaling server is just used to exchange pre-requirements to establish a peer-to-peer connection. Both the peers can disconnect from the signaling server once the peer-to-peer connection has been established. The signaling server doesn't need to be a highly configured server as the actual data is not transferred through it. Data transfer for a single peer-to-peer connection will be in some KB, so a decent server can be used for signaling.
A signaling server usually uses a signaling protocol, but it is also okay if it's an HTTP server as long as it can pass messages between two peers. WebRTC doesn't force us to use any particular signaling protocol.
For example, say that there are two users, Alice and Bob, on two different browsers. If Alice wants to establish a peer-to-peer connection with Bob for chatting, then this is how a peer-to-peer connection would be established between them:
- They both will connect to a signaling server.
- Alice will then send a request to Bob via the signaling server, requesting to chat.
- The signaling server can optionally check whether Alice is allowed to chat with Bob, and also if Alice and Bob are logged in. If yes, then the signaling server passes the message to Bob.
- Bob receives the request and sends a message to Alice via the signaling server, confirming to establish a peer-to-peer connection.
- Now both of them need to exchange messages related to session control, network configuration, and media capabilities. All these messages are exchanged between them by the
RTCPeerConnection
. So, they both need to create anRTCPeerConnection
, initiate it, and attach an event handler toRTCPeerConnection
that will be triggered byRTCPeerConnection
when it wants to send a message via the signaling server.RTCPeerConnection
passes the message to the event handler in the Session Description Protocol (SDP) format, and the messages for theRTCPeerConnection
received from the signaling server must be fed toRTCPeerConnection
in the SDP format, that is,RTCPeerConnection
only understands the SDP format. You need to use your own programming logic to split custom messages and messages forRTCPeerConnection
.
The preceding steps seem to have no problem; however, there are some major problems. The peers may be behind a NAT device or firewall, so finding their public IP address is a challenging task, sometimes it is practically impossible to find their IP address. So, how does RTCPeerConnection
find an IP address of the peers when they may be behind a NAT device or firewall?
RTCPeerConnection
uses a technique called Interactive Connectivity Establishment (ICE) to resolve all these issues.
ICE involves Session Traversal Utilities for NAT (STUN) and Traversal Using Relays around NAT (TURN) server to solve the problems. A STUN server is used to find the public IP address of a peer. In case the IP address of a peer cannot be found, or due to some other reason a peer-to-peer cannot be established, then a TURN server is used to redirect the traffic, that is, both the peers communicate via the TURN server.
We just need to provide the addresses of the STUN and TURN servers and RTCPeerConnection
handles the rest. Google provides a public STUN server, which is used by everyone. Building a TURN server requires a lot of resources as the actual data flows throw it. Therefore, WebRTC makes it optional to use a TURN server. If RTCPeerConnection
fails to establish a direct communication between two peers and a TURN server is not provided, there is no other way for the peers to communicate and a peer-to-peer connection establishment fails.
Note
WebRTC doesn't provide any way to make signaling secure. It's your job to make the signaling secure.
Transferring MediaStream
We saw how RTCPeerConnection
establishes a peer-to-peer connection. Now, to transfer MediaStream
, we just need to pass the reference of MediaStream
to RTCPeerConnection
and it will transfer MediaStream
to the connected peer.
Note
When we say that MediaStream
is transferred, we mean the stream of individual tracks is transferred.
The following are some of the things you need to know regarding the transfer of MediaStream
:
RTCPeerConnection
uses SRTP as an application layer protocol and UDP as a transport layer protocol to transferMediaStream
. SRTP is designed for media stream transfer in real time.- UDP doesn't guarantee the order of packets, but SRTP takes care of the order of the frames.
- The Datagram Transport Layer Security (DTLS) protocol is used to secure the
MediaStream
transfer. So, you don't have to worry about the security while transferringMediaStream
. - Constraints of the tracks that the remote peer receives may be different from the constraints of the local tracks, as
RTCPeerConnection
modifies the stream automatically, depending on the bandwidth and other network factors to speed up the transfer, achieving real-time data transfer. For example,RTCPeerConnection
may decrease the resolution and frame rate of video stream while transferring. - If you add or remove a track from
MediaStream
that is already being sent, thenRTCPeerConnection
updatesMediaStream
of the other peer by communicating to the other peer via the signaling server. - If you pause a track that is being sent, then
RTCPeerConnection
pauses transfer of the track. - If you stop a track that is being sent,
RTCPeerConnection
stops the transfer of the track.
Note
You can send and receive multiple MediaStream
instances via single RTCPeerConnection
, that is, you don't have to create multiple RTCPeerConnection
instances to send and receive multiple MediaStream
instances to and from a peer. Whenever you add or remove a new MediaStream
to or from RTCPeerConnection
, the peers exchange information related to this via the signaling server.
RTCDataChannel API
RTCDataChannel
is used to transfer data other than MediaStream
between peers to transfer arbitrary data. The mechanism to establish a peer–to-peer connection to transfer arbitrary data is similar to the mechanism explained in the earlier section.
RTCDataChannel
is an object that implements the RTCDataChannel
interface.
The following are some of the things you need to know regarding RTCDataChannel
:
RTCDataChannel
uses SCTP over UDP as a transport layer protocol to transfer data. It doesn't use unlayered SCTP protocol as the SCPT protocol is not supported by many operating systems.- SCTP can be configured for reliability and delivery order, unlike UDP, which is unreliable and unordered.
RTCDataChannel
also uses DTLS to secure data transfer. So, you don't have to worry about the security at all while transferring data viaRTCDataChannel
.
Note
We can have multiple peer-to-peer connections open between browsers. For example, we can have three peer-to-peer connections, that is, first one for webcam stream transfer, second one for text message transfer, and third one for file transfer.