Modern JavaScript Applications
上QQ阅读APP看书,第一时间看更新

Introduction to WebRTC

Web Real-Time Communications (WebRTC) is a browser technology that enables retrieval of media stream of physical media sources and exchange media stream or any other data in real time. It comprises of three APIs: the MediaStream constructor, RTCPeerConnection constructor, and RTCDataChannel interface.

In short, MediaStream is used to retrieve the stream of physical media source, RTCPeerConnection is used to exchange MediaStream among peers in real time, and finally, RTCDataChannel is used to exchange arbitrary data among peers.

Let's see how these APIs work.

MediaStream API

Two main components of MediaStream API are the MediaStream constructor and MediaStreamTrack interface.

A track represents the stream of a media source. A track implements the MediaStreamTrack interface. A track can either be an audio track or be a video track. That is, a track attached to an audio source is an audio track, and a track attached to a video source is a video track. There can be multiple tracks attached to a particular media source. We can also attach constraints to a track. For example, a track attached to a webcam can have constraints such as the minimum video resolution and FPS. Each track has its own constraints.

You can change the constraints of a track after it's created using the applyConstraints() method of the MediaStreamTrack interface. You can retrieve the constraints applied to a track anytime using the getSettings() method of the MediaStreamTrack interface. To detach a track from a media source, that is, to stop the track permanently, we can use the stop() method of the MediaStreamTrack interface. To pause a track, that is, to stop the track temporarily, we can assign false to the enabled property of the MediaStreamTrack interface.

Note

Find out more about the MediaStreamTrack interface at https://developer.mozilla.org/en-US/docs/Web/API/MediaStreamTrack.

A track can either be a local or remote track. A local track represents the stream of a local media source; whereas, a remote track represents the stream of a remote media source. You cannot apply constraints to the remote track. To find whether a track is local or remote, we can use the remote property of the MediaStreamTrack interface.

Note

We will come across the remote track while exchanging tracks between peers. When we send a local track to a peer, the other peer receives the remote version of the track.

A MediaStream holds multiple tracks together. Technically, it doesn't do anything. It just represents a group of tracks that should be played, stored, or transferred together in a synchronized manner.

Note

Find out more about the MediaStream constructor at https://developer.mozilla.org/en/docs/Web/API/MediaStream.

The getSources() method of the MediaStreamTrack object allows us to retrieve the ID of all the media devices, such as speakers, microphones, webcams, and so on. We can use the ID to create a track if the ID represents a media input device. The following is an example that demonstrates this:

MediaStreamTrack.getSources(function(sources){
  for(var count = 0; count < sources.length; count++)
  {
    console.log("Source " + (count + 1) + " info:");
    console.log("ID is: " + sources[count].id);

    if(sources[count].label == "")
    {
      console.log("Name of the source is: unknown");
    }
    else
    {
      console.log("Name of the source is: " + sources[count].label);
    }
    
    console.log("Kind of source: " + sources[count].kind);

    if(sources[count].facing == "")
    {
      console.log("Source facing: unknown");
    }
    else
    {
      console.log("Source facing: " + sources[count].facing);
    }
  }
})

The output will vary for everyone. Here is the output I got:

Source 1 info:
ID is: 0c1cb4e9e97088d405bd65ea5a44a20dab2e9da0d298438f82bab57ff9787675
Name of the source is: unknown
Kind of source: audio
Source facing: unknown
Source 2 info:
ID is: 68fb69033c86a4baa4a03f60cac9ad1c29a70f208e392d3d445f3c2d6731f478
Name of the source is: unknown
Kind of source: audio
Source facing: unknown
Source 3 info:
ID is: c83fc025afe6c7841a1cbe9526a6a4cb61cdc7d211dd4c3f10405857af0776c5
Name of the source is: unknown
Kind of source: video
Source facing: unknown

navigator.getUserMedia

There are various APIs that return MediaStream with tracks in it. One such method is navigator.getUserMedia(). Using navigator.getUserMedia(), we can retrieve a stream from media input sources, such as microphones, webcams, and so on. The following is an example to demonstrate:

navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia || navigator.mozGetUserMedia;

var constraints = {
  audio: true, 
  video: {
    mandatory: {
      minWidth: 640,
      minHeight: 360
    },
    optional: [{
      minWidth: 1280
    }, {
      minHeight: 720
    }]
  }
}

var av_stream = null;

navigator.getUserMedia(constraints, function(mediastream){
  av_stream = mediastream; //this is the MediaStream
}, function(err){
  console.log("Failed to get MediaStream", err);
});

When you run the preceding code, the browser will display a popup seeking permission from the user. The user has to give the permission to the code to access the media input devices.

By default, the media input devices to which the tracks are attached while using getUserMedia() depends on the browser. Some browsers let the user choose the audio and video device that they want to use, while other browsers use the default audio and video devices listed in the operating system configuration.

We can also provide the sourceId property assigned to the ID of the media input device in the constraint object's audio or video property's mandatory property to make getUserMedia() attach tracks to these devices. So, if there are multiple webcams and microphones, then you can use MediaStreamTrack.getSources() to let the user choose a media input device and provide this media input device ID to getUserMedia() instead of relying on the browser, which doesn't guarantee whether it will let the user choose a media input device.

The first parameter that it takes is a constraint object with audio and video track constraints. Mandatory constraints are those constraints that must be applied. Optional indicates that they are not very important, so they can be omitted if it's not possible to apply them.

Some important constraints of an audio track are volume, sampleRate, sampleSize, and echoCancellation. Some important constraints of a video track are aspectRatio, facingMode, frameRate, height, and width. If a constraint is not provided, then its default value is used.

You can simply set the audio or video property to false if you don't want to create the audio or video tracks respectively.

We can retrieve the tracks of MediaStream using the getTracks() method of MediaStream. Similarly, we can add or remove a track using the addTrack() and removeTrack() methods, respectively. Whenever a track is added, the onaddtrack event is triggered. Similarly, whenever a track is removed, the onendtrack is triggered.

If we already have some tracks, then we can directly use the MediaStream constructor to create MediaStream with the tracks. The MediaStream constructor takes an array of tracks and returns MediaStream with the reference of the tracks added to it.

An API that reads data from tracks of MediaStream is called a MediaStream consumer. Some of the MediaStream consumers are the <audio> tag, <video> tag, RTCPeerConnection, Media Recorder API, Image Capture API, Web Audio API, and so on.

Here is an example that demonstrates how to display data of tracks of MediaStream in the video tag:

<!doctype html>
<html>
  <body>

    <video id="myVideo"></video>
    <br>
    <input value="Pause" onclick="pause()" type="button" />

    <script type="text/javascript">

      navigator.getUserMedia = navigator.getUserMedia || navigator.webkitGetUserMedia || navigator.mozGetUserMedia;

      var constraints = {
        audio: true, 
        video: true
      }

      var av_stream = null;

      navigator.getUserMedia(constraints, function(mediastream){

        av_stream = mediastream;

        document.getElementById("myVideo").setAttribute("src", URL.createObjectURL(mediastream));
        document.getElementById("myVideo").play();
      }, function(err){
        console.log("Failed to get MediaStream", err);
      });

      function pause()
      {
        av_stream.getTracks()[0].enabled = !av_stream.getTracks()[0].enabled;
        av_stream.getTracks()[1].enabled = !av_stream.getTracks()[1].enabled;
      }

    </script>
  </body>
</html>

Here we have a <video> tag and a button to pause it. A video tag takes a URL and displays the resource.

Note

Before HTML5, HTML tags and CSS attributes could only read data from http:// and file:// URLs. However, in HTML5, they can read blob://, data://, mediastream://, and other such URLs.

To display the output of MediaStream in the <video> tag, we need to use the URL.createObjectURL() method, which takes a blob, file object, or MediaStream and provides a URL to read its data. URL.createObjectURL() takes extra memory and CPU time to provide access to the value passed on to it via a URL, therefore, it is wise to release the URL using URL.revokeObjectURL() when we don't need the URL anymore.

If there are multiple audio and video tracks in MediaStream, then <video> reads the first audio and video tracks.

RTCPeerConnection API

RTCPeerConnection allows two browsers to exchange MediaStream in real time. RTCPeerConnection is an instance of the RTCPeerConnection constructor.

Establishing peer-to-peer connection

For a peer-to-peer connection to be established, a signaling server is needed. Through the signaling server, the peers exchange data required to establish a peer-to-peer connection. Actual data transfer takes place directly between peer-to-peer. The signaling server is just used to exchange pre-requirements to establish a peer-to-peer connection. Both the peers can disconnect from the signaling server once the peer-to-peer connection has been established. The signaling server doesn't need to be a highly configured server as the actual data is not transferred through it. Data transfer for a single peer-to-peer connection will be in some KB, so a decent server can be used for signaling.

A signaling server usually uses a signaling protocol, but it is also okay if it's an HTTP server as long as it can pass messages between two peers. WebRTC doesn't force us to use any particular signaling protocol.

For example, say that there are two users, Alice and Bob, on two different browsers. If Alice wants to establish a peer-to-peer connection with Bob for chatting, then this is how a peer-to-peer connection would be established between them:

  1. They both will connect to a signaling server.
  2. Alice will then send a request to Bob via the signaling server, requesting to chat.
  3. The signaling server can optionally check whether Alice is allowed to chat with Bob, and also if Alice and Bob are logged in. If yes, then the signaling server passes the message to Bob.
  4. Bob receives the request and sends a message to Alice via the signaling server, confirming to establish a peer-to-peer connection.
  5. Now both of them need to exchange messages related to session control, network configuration, and media capabilities. All these messages are exchanged between them by the RTCPeerConnection. So, they both need to create an RTCPeerConnection, initiate it, and attach an event handler to RTCPeerConnection that will be triggered by RTCPeerConnection when it wants to send a message via the signaling server. RTCPeerConnection passes the message to the event handler in the Session Description Protocol (SDP) format, and the messages for the RTCPeerConnection received from the signaling server must be fed to RTCPeerConnection in the SDP format, that is, RTCPeerConnection only understands the SDP format. You need to use your own programming logic to split custom messages and messages for RTCPeerConnection.

The preceding steps seem to have no problem; however, there are some major problems. The peers may be behind a NAT device or firewall, so finding their public IP address is a challenging task, sometimes it is practically impossible to find their IP address. So, how does RTCPeerConnection find an IP address of the peers when they may be behind a NAT device or firewall?

RTCPeerConnection uses a technique called Interactive Connectivity Establishment (ICE) to resolve all these issues.

ICE involves Session Traversal Utilities for NAT (STUN) and Traversal Using Relays around NAT (TURN) server to solve the problems. A STUN server is used to find the public IP address of a peer. In case the IP address of a peer cannot be found, or due to some other reason a peer-to-peer cannot be established, then a TURN server is used to redirect the traffic, that is, both the peers communicate via the TURN server.

We just need to provide the addresses of the STUN and TURN servers and RTCPeerConnection handles the rest. Google provides a public STUN server, which is used by everyone. Building a TURN server requires a lot of resources as the actual data flows throw it. Therefore, WebRTC makes it optional to use a TURN server. If RTCPeerConnection fails to establish a direct communication between two peers and a TURN server is not provided, there is no other way for the peers to communicate and a peer-to-peer connection establishment fails.

Note

WebRTC doesn't provide any way to make signaling secure. It's your job to make the signaling secure.

Transferring MediaStream

We saw how RTCPeerConnection establishes a peer-to-peer connection. Now, to transfer MediaStream, we just need to pass the reference of MediaStream to RTCPeerConnection and it will transfer MediaStream to the connected peer.

Note

When we say that MediaStream is transferred, we mean the stream of individual tracks is transferred.

The following are some of the things you need to know regarding the transfer of MediaStream:

  • RTCPeerConnection uses SRTP as an application layer protocol and UDP as a transport layer protocol to transfer MediaStream. SRTP is designed for media stream transfer in real time.
  • UDP doesn't guarantee the order of packets, but SRTP takes care of the order of the frames.
  • The Datagram Transport Layer Security (DTLS) protocol is used to secure the MediaStream transfer. So, you don't have to worry about the security while transferring MediaStream.
  • Constraints of the tracks that the remote peer receives may be different from the constraints of the local tracks, as RTCPeerConnection modifies the stream automatically, depending on the bandwidth and other network factors to speed up the transfer, achieving real-time data transfer. For example, RTCPeerConnection may decrease the resolution and frame rate of video stream while transferring.
  • If you add or remove a track from MediaStream that is already being sent, then RTCPeerConnection updates MediaStream of the other peer by communicating to the other peer via the signaling server.
  • If you pause a track that is being sent, then RTCPeerConnection pauses transfer of the track.
  • If you stop a track that is being sent, RTCPeerConnection stops the transfer of the track.

Note

You can send and receive multiple MediaStream instances via single RTCPeerConnection, that is, you don't have to create multiple RTCPeerConnection instances to send and receive multiple MediaStream instances to and from a peer. Whenever you add or remove a new MediaStream to or from RTCPeerConnection, the peers exchange information related to this via the signaling server.

RTCDataChannel API

RTCDataChannel is used to transfer data other than MediaStream between peers to transfer arbitrary data. The mechanism to establish a peer–to-peer connection to transfer arbitrary data is similar to the mechanism explained in the earlier section.

RTCDataChannel is an object that implements the RTCDataChannel interface.

The following are some of the things you need to know regarding RTCDataChannel:

  • RTCDataChannel uses SCTP over UDP as a transport layer protocol to transfer data. It doesn't use unlayered SCTP protocol as the SCPT protocol is not supported by many operating systems.
  • SCTP can be configured for reliability and delivery order, unlike UDP, which is unreliable and unordered.
  • RTCDataChannel also uses DTLS to secure data transfer. So, you don't have to worry about the security at all while transferring data via RTCDataChannel.

Note

We can have multiple peer-to-peer connections open between browsers. For example, we can have three peer-to-peer connections, that is, first one for webcam stream transfer, second one for text message transfer, and third one for file transfer.