The road to the promised land.
For more than 6 years, we have been working on and looking forward to a simpler way to build RTC (Real Time Communications) applications on the web. In order for this technology to truly show its value, the major browser vendors needed to show up.
Mobile, mobile, mobile.
Now that Apple has joined the party in earnest, does the technology have the coverage required in order for developers to make good use of WebRTC on mobile devices? Let’s find out.
Until now, in order for WebRTC to work on iOS, we were relegated to wrapping WebRTC code in Objective-C and Swift, in our native iOS apps. Basically, we had to take the Chrome code and build an app that was sent to the app store for approval and wait in line, like all the other chumps (yours truly included). Conversely, on Android we could run much of that same code from our desktop Chrome apps, on the Android device as well, within reason of course.
Now that Safari and Chrome are shipping compatible WebRTC on mobile, we get to reuse the same code, right!? Well, mostly, they are different code bases, after all.
A word about hardware acceleration.
If ubiquitous mobile video is to take off, the battery life of the device has to last more than the length of the 10 minute video call (ok, I am exaggerating a bit, but I think you get the point) and the performance needs to be at least adequate enough to distinguish facial features. My bar is set a little higher, baby steps for now.
Without h/w acceleration the CPU is likely working too hard to encode the local video and decode the inbound video + service the other processes required at the same time. That really means there needs to be hardware onboard the device dedicated to video coding. That in turn means H.264, since there are very few vendors that offer VP8 or VP9 h/w acceleration.
Question: Does this mean that mobile apps written with VP8 will not be able to deliver decent mobile video conferencing?
Answer: No, not at all, but they will likely not be as performant as those taking advantage of hardware acceleration.
Suffice to say that SVC (Scalable Video Coding) usage would be another reason why we need h/w acceleration, but that’s for another day.
Who’s using what?
The majority of desktop and mobile WebRTC apps written today, are using VP8 for video.
Since Apple and Microsoft both use H.264 and Google uses VP8 and H.264 (recently shipped Open H.264 – on the desktop and mobile). Also, many of the Enterprise RTC developers are already on that H.264 bandwagon.
Question: If Apple and Microsoft devices ship with H.264, what is the case with Google Chrome on desktops and android, are they preferencing VP8?
Answer: Chrome for desktop and android now have H.264 native. Many of the Android devices that ship today all have H.264 hardware acceleration onboard. In order to understand which units have H.264 and hardware acceleration, you can run use the Android APIs to pull a list of available codecs, but in the case of WebRTC, you will only get H.264 in Android WebRTC if there is a h/w encoder on the device.
Is H.264 the answer for WebRTC video?
Here is a recent test:
Host 1 – (before joining):
macOS Sierra, Macbook, Safari (Technology Preview 32)
Host 2 (after joining):
Android 7, Samsung 7, Chrome 55
Host 1 (after joining):
According to the Chrome Status page, Chrome for Android should have H.264. So why is the session barfing when trying to set up video? The logs do not lie…
Safari – offer:
Chrome on android – answer:
Err, huh? No H.264 in reply?
So, I updated to latest Chrome on android (58) and tried again…
… et voilà!!
Next topic, paying the man!
Shipping your product with H.264 enabled, means you may potentially need to deal with the MPEG-LA royalty police for H.264 royalties, but there are some grey areas.
In the case of Apple and Microsoft, where H.264 royalties are already being paid for by the parent vendor, the WebRTC developer is riding on the coattails of papa bear, at least in theory.
Cisco’s generous OpenH.264 offer means that those using this binary module, can do so at potentially no cost:
We will not pass on our MPEG-LA licensing costs for this module, and based on the current licensing environment, this will effectively make H.264 free for use on supported platforms.
Q: If I use the source code in my product, and then distribute that product on my own, will Cisco cover the MPEG LA licensing fees which I’d otherwise have to pay?
A: No. Cisco is only covering the licensing fees for its own binary module, and products or projects that utilize it must download it at the time the product or project is installed on the user’s computer or device. Cisco will not be liable for any licensing fees incurred by other parties.
That seems to mean (I am no lawyer) every developer shipping WebRTC apps supporting Open H.264 binary module, get a free ride. Those using some other binary, or shipping the above source code for that module, could be on the hook for those royalties. That said, since there are royalties being paid by parent vendors where devices are shipping H.264 anyways, developers may not get hassled regardless.
So what did we learn here?
- Apple has joined the party, now we have a full complement of browser vendors!
- If you want to leverage WebRTC video to deliver a ubiquitous mobile and desktop experience for your users, you should likely consider including both H.264 and VP8.
- VP8 is (still) free and powers most of the WebRTC video out there today.
- You can make use of the Open H.264 project and get a free H.264 ride, albeit baseline AVC.
- WebRTC on Android does not support software encoding of H.264, so unless there is local hardware acceleration, H.264 will not be in the offer.
- H.264 is not fully enabled (or buggy) in Chrome 55 (I was using it on Samsung S7 Edge (Android 7), but it does work with Chrome 58.
- WebRTC is not DOA!
- SDP still sucks and ORTC can’t come soon enough!!
As a side note, it would be interesting to see something like this open sourced; VP8 / H.264 conversion without transcoding, if only to service the existing desktop apps currently running VP8 <-> mobile H.264. It would likely overwhelm the mobile device, but it would be cool if it worked!
Disclaimer: The views expressed by me are mine alone and do not necessarily represent the views or opinions of my employer.
The concept of video streaming seems extraordinarily simple. One side has a camera and the other side has a screen. All one has to do is move the video images from the camera to the screen and that’s it. But alas, it’s nowhere near that simple.
Cameras are input sources, but they have a variety of modes for which they can operate. Each camera has it’s own dimensions (width/height), aspect ratio (the ratio of width/height) and frame rate. Cameras are often capable of recording at selectable input formats, for example, SD, or HD formats, which dictate their pixel dimensions and aspect ratio sizes (e.g. 4:3 or 16:9). If a camera opens in one format and switches to another there can be a time penalty before the video starts streaming again from the camera (thus switching modes needs to be minimized or avoided entirely). On portable devices, the camera can be orientated in a variety of ways and dynamically change its pixel dimensions and aspect ratio on the fly as the device is physically rotated.
Some devices have multiple camera inputs (e.g. front camera or rear camera). Each source input need not be identical in dimensions nor capability and the user can choose the input on the fly. Further, there are even cameras that record multiple angles (e.g. 3D) simultaneously, but I’m not sure if that should be covered right now even though 3D TVs are all the rage (at least from Hollywood’s perspective).
If I could equate cameras to a famous movie quote: they are like a box of chocolates, you never know what you are going to get.
Cameras aren’t the only sources though. Pre-recorded video can be used as a source just as much as a camera. Pre-recorded video has a fixed width, height and aspect ratio, but it must be considered as a potential video source.
The side receiving video typically renders the video onto a display. These output displays are also known as a type of video sink. There are other types of video sinks though, such as a video recording sink or even videoconferencing sink. Each has it’s own unique attributes that vary and the output width and height of these video sinks vary greatly.
Some video recording sinks work best when they receive the maximum resolution possible. While others might desire a fixed width/height (as it’s intended for later viewing on a particular fixed size output device). When video is displayed in a webpage, it might be rendered to a fixed width/height or there might be flexibility in the size of the output. For example, the page can have a fixed width, but the video height could be adjustable (up to a maximum viewable area), or vice versa with the width being the adjustable axis. In some cases both dimensions can adjust automatically larger or smaller.
Some output areas are adjustable in size when manually manipulated by a user. In such cases the user can dynamically resize the output areas larger or smaller as desired (up to a maximum width and height). Other output screens are fixed in size entirely and can never be adjusted. Still other devices adjust their output dimensions based upon the physical rotation of the device.
The problem is how do you fit the source size into the video sink’s area? A camera can be completely different in dimensions and aspect ratio than the area for the video sink. The old adage “how do you fit a square peg in a round hole” seems to apply.
In the simplest case, the video source (camera) would be exactly the same size as the output area or the output area would be adjustable in the range to match the camera source. But what happens when they don’t match?
The good news is that video can be scaled up or down to fit. The bad news is that scaling has several problems. Making a small image into a big image makes the image appear pixelated and ugly. Making a bigger image smaller is better (except there are consequences for processing and bandwidth).
Aspect ratio is also a big problem. Anyone who’s watched a widescreen movie on a standard screen TV (or vice versa) will understand this problem. There are basically three solutions for this problem. One solution is shrinking the wide screen to fix into the narrow and put “black bars” above and below the image, known as letterboxing (or pillarboxing on the other axis). Another solution is to expand the image large enough while maintaining aspect ratio so there are no black bars (but with the side effect that some of the image is cropped because it’s too big to fit in the viewing area). Another method is to stretch the image making images look taller or fatter. Fortunately that technique is largely discredited, although still selectively used at times.
Some people might argue that displaying video using a letterboxing/pillarboxing technique is too undesirable to ever be used. They would prefer video was stretched to fit the display area and any superfluous video image edges are automatically cropped off. Videophiles might gasp at such a suggestion for the very idea that discarding part of an image is nearing sacrilege. In practical terms, it’s both user preference as well as context that determine which technique is best.
As an example of why context is important, consider video rendered to the entire view screen (i.e. full screen mode). In this context, letterboxing/pillarboxing might be perfectly acceptable, as those black bars become part of the background of the video terminal. In a different context, black bars in the middle of a beautifully formatted web page might be horrifically ugly and unacceptable under any circumstance.
The complexities for video are far from over. When users place video calls, the source and the video sink are often not physically located together. That means that the video has to go from the source to the video sink located on different machines/devices and across a network.
When video is transmitted across a network pipe, a few important considerations must be factored. A network pipe has a maximum bandwidth that fluctuates with usage and saturation. Attempt to send too large a video and the video will become choppy and glitch badly. Even where the network pipe is sufficiently large, bandwidth has a cost associated, thus it’s wasteful to send a super high quality image to a device that is incapable of rendering it to the original quality. To waste less bandwidth, a codec is used to compress images and thus preserve network bandwidth as much as possible (the cost being the bigger an image, the more CPU required to compress the image using the codec).
As a general rule…
- a source should never send video images to the remote video sink that ends up being discarded or at a higher quality than the receiver is capable rendering, as it’s a waste of bandwidth as well as CPU processing power. For example, do not send HD video to a device only capable of displaying SD quality.
Too bad this general rule above has to have an exception. There are cases where the video cannot be scaled down before sending, although rare. Nonetheless, this exception cannot be ignored. Some servers offer pre-recorded video and do not scale the video at the source because doing so would require expensive hardware processing power to transcode the recorded video. Likewise, a simple device might be too underpowered or hard wired to its output format to be capable of scaling the video appropriately for the remote video sink.
The question becomes which end (source or sink) manipulates the video? And then there are the questions of how and what does each side need to know to do the right thing to get the video in the correct format for the other side?
I can offer a few suggestions that will help. Again, as to the general rules..
- A source should always attempt to send what a video sink expects and nothing more
- A source should never attempt to stretch the source image larger than the original source image’s dimensions.
- If the source is incapable of adjusting the dimensions to the video sink completely, it does so as much as it is capable and then the video sink must finish the job of adjusting the image before final rendering.
- The source must understand the video sink can change dimensions and aspect ratio anytime with a moment’s notice. As such, there must be a set of active “current” video properties the source must be aware of at all times with regard to the video sink.
- The “current” properties include the active width and height of the video sink (or maximum width or height should the area be automatically adjustable). The area needs to be flagged as safe for letterboxing/pillarboxing or not. If the area is unable to accept letterbox or pillarbox then the image must ultimately be adjusted to fill the rendered output area. Under such a situation the source could and should pre-crop the image before sending knowing the final dimensions used.
- The source needs to know the maximum possible resolution the output video sink is capable of producing to not waste its own CPU opening a camera at a higher resolution than will ever be possible to render (e.g. an iPad sending to an iPhone device). Unfortunately, this needs to be a list of maximum rendered output dimensions as a device might have multiple combinations (such as an iPhone device suddenly turned on its side).
I’m skeptical if a reciprocal minimum resolution is ever needed (or even possible). For example, an area may be deemed letterbox/pillarbox unsafe and the image is just too small to fit a minimum dimension (and thus would have to be stretched upon rendering). In the TV world, an image is simply stretched to fit upon output (typically while maintaining aspect ratio). Yes, a stretched image can become pixilated and that sucks, but there are smoothing algorithms that do a reasonable job within reasonable limitations. People playing DVDs on Blu-ray players with HD TVs are familiar with such processes, which magically outputs the DVD video image to the HD TV output size. Perhaps a “one pixel by one pixel” source connected to an HD (1920×1080) output would be the extreme case of unacceptable, but what would anyone expect in such a circumstance? That’s like hooking up an Atari 2600 to an HD TV. There’s only so much that can be done to smooth out the image, as the source image quality just isn’t available. But that doesn’t mean the image shouldn’t be displayed at all!
Another special case happens when a source cannot be scaled down for whatever reason before transmission and the receiving video sink is incapable of scaling it down further to display (due to bandwidth or CPU limitations on the device). The CPU limitation might be pre-known, but the bandwidth might not. In theory the sink could report failures to the source and cause a scale back in frame rate (i.e. cause the sender to send fewer images rather than smaller images). If CPU and bandwidth conditions are pre-known, then a maximum acceptable dimension and bandwidth could be elected by the video sink thus such a non dimension adjusting source must be incapable of connecting.
Aside from the difficulties in building good RTC video technology, those involved in RTCWEB / WebRTC have yet to agree on which codecs are Mandatory to Implement (MTI), which isn’t helping things at all. Since MTI Video is on the agenda for IETF 86 in Orlando maybe we will see it happen soon. If there is a decision (that’s a big IF), what is likely to happen is that there will be two or more MTI video codecs, which means we will need to support codec swapping and all the heavy lifting related thereto.
It looks like the first victim in the Microsoft acquisition of Skype is Digium and the open source PBX – Asterisk. The following is an email sent to existing Skype for Asterisk users…
Skype for Asterisk will not be available for sale or activation after July 26, 2011.
Skype for Asterisk was developed by Digium in cooperation with Skype. It includes proprietary software from Skype that allows Asterisk to join the Skype network as a native client. Skype has decided not to renew the agreement that permits us to package this proprietary software. Therefore Skype for Asterisk sales and activations will cease on July 26, 2011.
This change should not affect any existing users of Skype for Asterisk. Representatives of Skype have assured us that they will continue to support and maintain the Skype for Asterisk software for a period of two years thereafter, as specified in the agreement with Digium. We expect that users of Skype for Asterisk will be able to continue using their Asterisk systems on the Skype network until at least July 26, 2013. Skype may extend this at their discretion.
Skype for Asterisk remains for sale and activation until July 26, 2011. Please complete any purchases and activations before that date.
Thank you for your business.
Digium Product Management
One has to wonder what will become of Skype Connect, Skype’s answer to SIP Trunking. Will Microsoft shut off the Skype Connect vendors (Cisco, Avaya, Grandstream, etc.) as well?
Original forum post here.
Well, I am a little sad that I have to turn ON international mobile roaming with Bell in order to get my mobile phone working here, which it is still not, but all is not lost. I have been using FaceTime over the WifI on my MacBook Air and iPhone 4 to call my business partner and my wife back home. Kinda fitting actually, FaceTime is standards based and is all about SIP and RTP etc. Now if we can just get them to open up that API…
Update: Yes, it was indeed launched and it’s called the iPhone 3G S but no video calling as yet.
There are rumors abound regarding the next release of the iPhone, every tech blog known to man is all over this like a fat kid on a smarty.
The iPhone 3.0 SDK has pretty much been proven to support video so a iPhone Video product seems to make sense. What kind of video? Recording full frame video is one thing but transporting that over 3G is quite another. My guess is it will not support real-time streaming or video calling on 3G, the question is will it deliver the goods on WiFi.
It will be interesting to see what happens at WWDC (running from the 8th to the 12th), the new iPhone is sure to launch at this event.
Sony Electronics and GlowPoint, Inc. today announced the availability of Sony’s IVE or “Instant Video Everywhere” service powered by GlowPoint, the first comprehensive IP-based video communications solution that can be used anyplace where there is broadband access.
Sony’s IVE service, powered by GlowPoint, enables users to conduct “face-to-face” video conversations with colleagues, friends and family as easily as using a telephone. The IVE (pronounced “ivy”) service portfolio has been developed to extend across multiple Sony product lines, moving beyond the conference room and typical videoconferencing equipment to desktop or laptop computers and other video enabled devices.