SDP in WebRTC? Who cares…

I would like to believe that I’m not hopelessly confused and outdated with regards to what is going on with RTCWEB. Last I checked my head is not stuck in the sand nor have I been buried under a rock for the last several years. I recently watched the February 7th, 2013 netcast talking about the data channel and questions about how it relates the SDP and the SDP ‘application m-line’.
For the love of all that is human, why is SDP part of RTCWEB efforts at all?
To be clear, I’m talking about a few specific aspects of SDP: the format, the exchange of SDP between browsers and media negotiations via the offer/answer model (and all that it implies regarding the negotiation of media streams). Come to think of it, all that makes SDP, well… SDP. I know what some will say: We need to exchange some kind of blob-like information between browsers so they can talk, that’s why SDP is used. And I would respond “of course”! Beyond arguing how arcane SDP is as a format, RTCWEB was specifically designed not to do signaling stuff at all. That part was purposefully (and wisely IMHO) left out of the specification so that the future was wide open for whatever it might hold in creative solutions.
So why was it so wise to take out call flow signaling but then decide to keep media signaling, specifically the SDP and the offer/answer model? To be clear, I’m not suggesting we dump SDP in favour of something else, like JSON blobs or JavaScript structured data with unspecified exchange formats. Nor am I’m proposing a stateless exchange of SDP (breaking offer/answer). I’m saying that it’s operating entirely at too high a level.
What we really need in order to do future stuff in the browsers (yet remain compatible with the past) is a good API for a lower level media engine to create, destroy, control and manage media streams. That’s it. Write an engine that doesn’t take SDP, but manages much lower level streams and allow the programmer to dictate how they are plumbed together, which are active and inactive, and give events for the streams as they progress.
There are sources for audio/video such as microphones, camera or pre-recorded sources. There are destinations, like the speaker and the video monitor or perhaps even a recording destination. Then there’s the wiring and controlling of the pipelines between mixers, codecs engines and finally RTP streams that can be opened given the proper information (basically a set of ICE candidates that can optionally be updated e.g. trickle ICE). Media engines are well-understood things with many reference implementations to draw upon and abstract for the sake of JavaScript.
That’s the API I want. There’s no SDP offer/answer needed. There’s no shortage of really smart people out there who would know how to produce a great API proposal.
Some will argue that this is way too complex for JavaScript programmers. Nonsense! The stuff these JavaScript guys can do is mind blowing. Plus, it’s vital and necessary to allow for incredible flexibility and power for future protocols, future services (including in browser peer-to-peer conferencing or multiparty call control), while maintaining existing compatibility to legacy protocols. Yes, there are some JavaScript programmers that would be too intimidated to touch such an API. That’s completely fine. The ambitious smart guys will write the wrappers for those who want a simple “give me a blob to give to the remote side”.
There’s no security threat introduced by managing streams with a solid API. As long as ICE is used to verify there are two endpoints who agree to talk and user listening acquisition is granted, why shouldn’t the rest be under the control of JavaScript?
Likewise, this will not make any more silos between the browsers than those that already exist because basically both sides need to have the same signaling regardless. Is it really a big stretch that both sides would have the same JavaScript to run the media engines too?
Such an API would lower the bar of browsers being able to interoperate at the media level. This removes the concerns about SDP compatibility issues (including the untold extensions that will happen to handle more powerful features and all that it implies and complex behaviours associated with SDP offer/answer, including rollback and ‘m=’ stability). If the browsers support RTP, ICE and codecs, and can stream then they are pretty much compatible even if individually their API sets aren’t up to par to their counterparts.
This also solves an issue regarding the data channel. There is no need for the data channel to be tied to an offer/answer exchange in the media at all. They are separate things entirely (as well they should be). For example, in Open Peer’s case the data channel gets formed in advanced to maintain our document subscription model between peers and media channels are open and closed as required.
Those who still want to do full on SDP can do SDP. Those who want stateless SDP-like exchanges can do exactly that. Those who want to negotiate media once and leave the streams alone can do so.
Perhaps there are those that would argue that the JavaScript APIs build the SDP hidden behind the scenes and the SDP can be treated as an opaque type and thus the appropriate low level API already exists. But they are missing the point. The moment the SDP is required the future is tied to offer/answer.
As an example, let’s examine Open Peer’s use case. Open Peer does not have, nor does it need or want a stateful offer/answer model. It also doesn’t support or require media renegotiation. Open Peer offers the ports and codecs (including offering to multiple parties the same port sets) and establishes the desired media. Call and media state is completely separated out. From then if alternative media is needed, a new media dialog is created to replace the existing one and then a ‘quick swap’ happens and the media streams are rewired appropriately to the correct inputs and outputs without renegotiation, at least this is not a renegotiation in the offer/answer sense of the meaning. Further, Open Peer allows either side to change its media without waiting for the other party’s answer.
Media is complicated for good reason as there are many use cases. The entire IETF/W3C discussion around video constraints illustrates some of the complexities and competing desires for just one single media type. If we tie ourselves to SDP we are limiting ourselves big time, and some of the cool future stuff will be horribly hampered by it.
Let’s face it, browsers are moving toward becoming sandboxed operating systems. So why do we not give an appropriate API low level as it deserves that allows for flexible futuristic application writing? Complicated and powerful HTML 5 APIs are being well received, so why can’t the same be true for lower level RTCWEB APIs?
I know Microsoft has argued the API is too high level and they’ve even gone to the trouble of submitting their own specification with CU-RTC Web and splintering and fragmenting efforts. I don’t presume to represent this stance regarding SDP, nor will I go into the merits of their offering, but I think they are right in principle. And for saying so, I’ve got my rotten tomato and egg shield in position.
Robin, I agree that adopting SDP as part of the browser’s media interface is somewhat limiting, but I don’t think that getting rid of offer/answer is the way to go. While you might eventually get common codec implementations in browsers, I believe that one of the big use cases of WebRTC will be to gateway to the legacy world of SIP, with all that that implies.
That’s my point in a way. It’s too focused on legacy. If you had a lower level API you could absolutely support legacy and still allow the future to happen. The lower level API would need a JS library to handle the media streams but that’s just fine.The future in our opinion is not SIP.
If you as an application developer are truly concerned about this, you need to express your concerns on the rtcweb mailing list, ideally with examples of things you *can’t* do with the existing API.
Thank Justin, I’ll do exactly that. I have serious reservations about exposing SDP and I’d prefer if SDP would disappear from the specification entirely.
Do you object to the syntax, or to the overall use of session descriptions as a control surface? That is, would a JSONized representation of the same concepts SDP describes be more palatable?
If I had to live with SDP, I would prefer it be JSONified. It’s not a very palatable format and it’s extremely legacy oriented. But that’s not at all my only objection.
I’ll try to break down my objections.
First, the obvious – the format. It’s archaic. It’s legacy. JSONifying would be better but it’s not the only issue. SDP opens up a can of worms.
Is the SDP blob meant to be transported in full format only to the destination “as is” or is it okay to mess with the SDP? This is important. SDP is a not just a format but its an extendable specification. This mean it will be extended in unknown ways and as such those extensions can affect behavior and add/change or remove functionality.
The protocol we have written won’t use SDP (unless forced). We’d prefer to transport the information exchange ourselves into a more palatable and forward thinking way that SDP allows (in fact, we use JSON). This means we’d likely parse and tear apart the information contained inside the SDP and then reassemble a new SDP on the remote side. But that would be very BAD to do in practice and likely force us to use the SDP format forever and deliver this SDP blob inside our JSON format.
The reason why it’s bad to disassemble/reassemble the SDP is because it can be arbitrarily extended at will without knowledge of what is going on internally. New features could be added to the SDP without it being understood. This might sound like a beneficial feature but it’s actually dangerous.
With an API, it’s a contract. You don’t change the contact arbitrarily because it has implications. You know what you are getting and thus you can predict behavior. SDP is not such. It can be changed arbitrarily and there’s no guarantee the two will match. If you allow 3rd party people like me to modify it, we’ll lose the additional “features” in the name of transforming SDP into something palatable. So we can’t transform it which means SDP will be imposed as a signaling protocol.
Further if you allow modifications, SDP is a compatibility nightmare. I was the original author of the popular softphone client (X-Lite) and I understand the compatibility issues that happen with SDP. It was modified many different ways and extended over and over. People couldn’t even get basic things like ICE right, let alone all the crazy things they did to SDP. It was a mess. Everyone extending it every which way created a nightmare of issues as the formats couldn’t communicate.
If you use SDP, you inherit this mess and explode it. And compatibility issues won’t be limited to browser SDPs. There is a huge swath of legacy systems that will deliver a mess of SDP to the browsers and SBCs that will manipulate the SDP in untold ways. This will reintroduce the problems with SIP into the world of WebRTC. You will be locked to those legacy systems and they will drag the browser back like a ball and chain and it will limit your ability to innovate.
SDP is not required for compatibility. Simple stream primitives to control the ICE, codecs, and the RTP/RTCP streams are all that is needed to create compatibility. Anyone wanting to do SDP/SIP can write a JavaScript layer to do it. I think the SIP vendors think if the browsers use SDP they will gain lots more compatibility. They won’t. The reality is a JS library that understands SIP still must be used or they won’t get any compatibility. So having SDP doesn’t give them immediate compatibility. In fact, without SIP signaling too it gives nothing. Worse, the SDP that is likely offered by WebRTC will not be compatible anyway with many (perhaps most) legacy systems out there. Many of these systems still don’t support the latest ICE specifications and it’s been a long time standard. Imagine being tied down by these systems later.
SDP offer/answer is horrible. I would agree with Microsoft’s argument that it’s brittle. Negotiations force a natural reaction to what’s going on locally into a need to mix/match the state to the remote side. If you get into multiple party handshakes, it can be hell. In our case, our protocol with Open Peer is stateless. This was a conscious decision on my part based on my experience in history of working with SIP. Each side presents it’s current offer and can update its offer at any time. With SDP offer/answer we have to build this awful state machine just to support offer answer and forever hamper our ability to dynamically change offers until a full accepted round trip happens with unknown changes that happen in the return trip. Is it required to do media properly? Absolutely not. RTP can change codecs on the fly, or dynamically adapt video options. We supported trickle-like ICE by allowing new candidates to be re-offered at any time. We control the stream primitives so we can dictate when changes happen in a way that doesn’t violate the primitive stream expectations while allowing our stateless offers that simplifies the process. Worse, our model does NOT renegotiate. We replace previous offers using the same ports and do a quick swap that greatly simplifies the entire calling process. Offer / answer will prevent this and add untold headaches, but I don’t want to complain just because it’s harder for us. It’s not just programming difficulties, it breaks our concept of a proper signaling protocol should be.
SDP binds things that don’t need to be bound. We can negotiate all the streams independently. Maybe we want 6 video streams then suddenly want to change that to zero. Why do we need to preserve six dead video media SDP lines? The media streams then can be matched to pin in/outs to mixers as we so chose as developers. With SDP, we force them all bound to this SDP bundle that has to be negotiated together. It’s wholly unneeded and doesn’t allow flexibility of the streams, especially since all streams get put into this outstanding “offer pending” state (or whatever you call the state) instead of treating that as independent things.
There is no need to deliver packaged “blob” negotiation data. These blobs mean a lot of “magic” will happen. Instead of controlling the stream primitives, the browsers will magically do a bunch of stuff based on the package and nobody will truly know what is going on. It’s not a contract like an API.
Can I work around these issues? Sure, just like browsers can patch around IE 6.0. But I can assure you that having a more media stream centric API without the SDP offer/answer will simplify your release and increase compatibility, allow for newer and stronger protocols and more importantly allow many future capabilities that others can’t imagine if you don’t attach the boat anchor that is SDP.
Microsoft feels so strongly mostly that SDP is bad they decided to go the way of CU-RTC Web. I can imagine why they feel this way, feeling boxed in by SDP.
Feedback like this from app developers is important, but we get it rather infrequently.
WebRTC faces an important decision on how we are going to represent multiple media streams in the API; one proposal currently being discussed is https://datatracker.ietf.org/doc/draft-jennings-rtcweb-plan/. Your feedback on this proposal, from an app developer’s perspective, would be useful.
I understand you likely rarely get feedback. To be honest, following the standards tracks requires huge time devotion which few of us have, especially when few of us have the experience in knowing the problems from history. Keeping track of even the technology changes that are happening in the industry is hard enough. For that, I apologize for not bringing this stuff up sooner.
As for this proposal… Whoa… That is exactly what I mean. The extensions and complexities get nutty. SDP becomes a language all it’s own to describe media, behaviour, associations, negotiations and introduces untold complexities. And I have to ask, for whose benefit? This will make the browsers more and more complex and huge over time with increasing incompatibility and divergence not only from other brothers but to external applications which all must understand all this stuff too and implement it correctly.
Instead of making essentially a language, make the raw primitives. Synchronization? Flag the streams which go together via API (there are a bunch of suitable API approaches for doing so). FEC? Ask for FEC splitting of streams and then on other end identify which streams are FEC redundant.
As an API, you get a contract and you’ll get exactly what features/elements you’ve asked for in the API. As a developer, you get it because you’ve asked for it. It’s contract between developer and API. With SDP? Nobody knows what is going to spew out or arrive from a remote party.
For example, will the browser which opens the streams simultaneously presume the streams must be therefor synchronized and throw in these extra attributes for you in an attempt to make things better? Little things like that will start to happen and then what was a contract and understanding of what was being received is now broken as new stuff will get sneaked into baseline SDP.
The worst is that you’ll likely have to request this stuff from an API anyway from the generator side but then the receiver side gets all sorts of implied logic and behaviour by the nature of the SDP. Some might say “wonderful” as it makes it all free. But like the saying goes, free has a cost. You lose control over what behaviours you want. The remote browser has to support these features and who is to say it is going to just a remote browser anyway? You’ll still get versioning issues and then it’s tied to offer / answer which throws in a fixed unneeded state machine.
Oh but wait, since its all under my JS control, therefor I won’t get wonky unsupported/requested features? Not so. People will write bridges which will take the raw SDP from devices and cram it into other packages that will end up going to/from browsers and untold devices. What was once a controlled contract from JS will become a mess of compatibility. It will happen because its the path of least resistance for the bridge writers to get themselves into various protocols. I dread the support nightmare if Open Peer becomes popular and people write bridges into SIP and we get complaints about various incompatibilities between a peer device talking to a SIP device, all because of them cramming junk into SDP from these devices which isn’t mutually supported, not even in our spec, buggy, and not asked for by ourselves from the JS layer. It won’t be our fault but it won’t matter. We’ll get the blame and have to spend hours and devote precious resources to proving the culprit and not be able to solve the issue for the end user who just wants to talk and thinks they are getting an Open Peer experience but is in fact getting something else.
Likewise, it will be tempting for websites to sneak in addition features our protocol doesn’t want or isn’t compatible by turning on features locally for their website where it gets thrown into the SDP and crammed to the remote party unaware a new feature gets added which they don’t support. That local website might be very happy with their new feature but they’ll break our federation model to other sites and systems when they do this little sneak addition thinking they are being clever piggybacking on SDP and hiding it from the protocol that transports it.
With the primitives only, I can build whatever state machines are needed for the particular features I need (from none when I’m just changing codecs, to complex is I want to do dynamic repin in/outs to untold mixers). Throw SDP in and now we have to baseline all that into a common understanding of what it means, disallow me from doing it because its unsupported in the SDP or force me to have a hybrid where I transport this SDP with all my additional information and have to coordinate my state machine with the browsers offer/answer state machine. Yuck.
What if you want to add a feature to the browser? Great, add it, but hold on there’s another problem that will happen. All new features must be expressed in SDP and must be standards tracked. You can’t.
I have to say, the ideas presented are very good. I appreciate FEC, and synchronizing streams is good. But SDP isn’t needed to do it. Let me as the programmer worry about how to manage streams and the features on the streams and associations between the streams.
I can understand your reservations about SDP. In our SIP-to-WebRTC gateway, we have to sanitize the SDP going in both directions. We’ve found that Chrome will throw exceptions when it gets legal SDP that isn’t to its liking, and as for the stuff going the other way to legacy SIP systems and clients… enough said.
There are three places where you can clean up the SDP:
1. In the gateway, before you send it out to the client.
2. In the browser JavaScript SIP libraries, prior to handing the SDP to the media API.
3. In the browser itself.
We’ve chosen the first approach, since we obviously can’t control what happens under the hood in the browser (although we do submit bug reports, and things do get fixed), and we wouldn’t want to burden the library users/developers with the task.
If the intention of using SDP was to facilitate interoperability with legacy systems using SIP or Jingle for signaling, then it has not been totally successful. (Ironically, that benefits us SBC vendors, as fixing interop issues is our stock in trade.)
Thanks Robert. I never thought of SDP in terms of providing job security! Maybe I need to rethink my stance. In all seriousness, I’m glad I’m not the only one who knows the pain of SDP and I appreciate that you can further illustrate that it doesn’t’ actually provide interoperability despite it’s lofty goals.
Having said all of that, there are some benefits to using SDP in the media API:
1. Developers of JavaScript SIP libraries can simply hand a blob of SDP to the API, and not have to bother with what’s in it, simplifying the development process.
2. Vendors of SBCs and other SIP-based systems can simply add WebSockets as another transport, alongside UDP, TCP and TLS (and maybe SCTP).
I imagine the same is true of XMPP-based systems using Jingle to negotiate media.
This is of course something of an oversimplification, given all the SDP manipulation and media massaging that is also required when gatewaying to non-WebRTC systems.
You are correct Robert that it is simpler as an API. No question about it. But I think the cost of simpler is way too high. I’ve just put up a new posting to this tune. I appreciate very much your feedback!
Oh, and I hope you weren’t asking me about the specifics of doing FEC or synchronizations otherwise I missed the mark in my reply! I see SDP and my stomach gets very upset quickly…
Point 4, 5 and 6 all have to do with the complexities of having to describe the intentions of mixing in SDP. So no comment beyond “don’t use SDP”.
As for 7.1 – “this is because the sender choses the SSRC” – only true because we are forced to use SDP and the assumptions is that it’s SIP. We could have the receiver dictate what the sender should use in advance of any media. In our case, we establish in advance what we want from both parties before even “ringing” the other party. We do not have SSRC collisions as we reversed the scenario allowing the receiver to pick the SSRC. Coordinating the streams is a problem with SIP because of how they do forking/conferencing, not us. We do not fork like they do. We negotiate each location independently and statelessly. This forces this issue on us. If they have problems with streams arriving early to their stateful offer/answer then let them worry about “how” they intend to match the streams at a higher layer. Their proposal seems reasonable for their pains in SIP. But it’s way to SIP centric for general purpose.
What I need in the API is an ability to dictate the SSRC when I open a stream for sending (should I care to do that).
7.2 Multiple render
Again, issue of SIP/SDP. We can control the SSRCs to split them out to allow multiplexing easily on the same RTP ports with multiple parties/sources. If they have the primitives to control the streams just like us they can dictate how to negotiate their problems.
7.2.1
Ug. I’m feeling the pain. How about just giving me an API where I can indicate what streams are FEC associated.
7.3
Give me API to give crypto keys to RTP layer. Let me handle the fingerprint and security myself beyond that.
8.
/me shivers nervously in corner
Again, a perfect illustration why I don’t want SDP.
Could you post these observations to the rtcweb mailing list?
Sure, the w3c or the IETF?
IETF/rtcweb first, since there is a meeting next week on this topic.