Server-side NAT traversal with Kamailio: the definitive guide

If you are a retail-type SIP service provider — that is, you sell SIP service to SMB end-users rather than wholesale customers or enterprises — and your product does not include bundled private line or VPN connectivity to your customers, the vast majority of your customer endpoints will be NAT’d.

If you’re using Kamailio as a customer-facing “SBC lite” to front-end your service delivery platform, this article is for you.

There’s a lot of confusion around best practices for NAT handling with Kamailio, given (1) that there are multiple approaches to handling NAT in the industry and also (2) that Kamailio idioms and conventions for this have evolved over time. I hope this article helps to address these issues comprehensively and puts lingering questions to rest.

Before we delve into that, let’s lay down some important background…

Why is NAT such a problem with SIP?

There are a few reasons:

First — for VoIP telephony purposes, at least — SIP primarily provides a channel in which to have a conversation about the establishment of RTP flows on dynamically allocated ports. This puts it in league with other protocols such as FTP, which also do not multiplex data and “metadata” over the same connection, and instead create ephemeral connections on unpredictable dynamic ports. This is different to eminently “NATtable” protocols like HTTP, where all data is simply sent back down the same client-initiated connection.

Second, VoIP by nature requires persistent state and reachability. Clients not only have to make outbound calls, but also to receive inbound calls, possibly after the “connection” (construed broadly) has been inactive for quite some time. This differs from a more or less ephemeral interaction like that of HTTP (though this claim ignores complications of the modern web such as long polling and WebSockets).

Third, most SIP in the world still runs over UDP, which, in its character as a “connection-less” transport that provides “fire and forget” datagram delivery, is less “NATtable” than TCP. Although UDP is connection-less, NAT routers must identify and associate UDP messages into stateful “flows” as part and parcel of the “connection tracking” that makes NAT work. However, on average, their memory for idle UDP flows is shorter and less reliable than for TCP connections — in some cases, egregiously worse, no more than a minute or so. That calls for more vigourous keepalive methods. Combined with the grim reality of increasing message size and the resulting UDP fragmentation, it’s also an excellent argument for using TCP at the customer access edge of your SIP network, but to be sure, that’s a decision that comes with its own trade-offs, and in any case TCP is not a panacea for all SIP NAT problems.

Finally: despite the nominally “logical” character of SIP URIs, SIP endpoints have come to put network and transport-layer reachability information (read: IP addresses and ports) directly into SIP messaging. No clean and universal logical-to-NLRI translation layer exists, such as DNS or ARP. A SIP endpoint literally tells the other end what IP address and port to reach it on, and default endpoint behaviour on the other side is to follow that literally. That’s a problem if that SIP endpoint’s awareness is limited to its own network interfaces (more on that in the next section).

SIP wasn’t designed for NAT. Search RFC 3261 for the word “NAT”; you’ll find nothing, because it presumes end-to-end reachability that today’s IPv4 Internet does not provide.

Client vs. Server-side NAT traversal and ALGs

Broadly speaking, there are two philosophies on NAT traversal: client-side NAT traversal and server-side NAT traversal.

Client-side NAT traversal takes the view that clients are responsible for identifying their WAN NLRI themselves and making correct outward representations about it. This is the view taken by the WebRTC and ICE scene. This is also the central idea of STUN and some firewalls’ SIP ALGs (Application Layer Gateways).

Server-side NAT traversal takes the opposite view; the client needs to know nothing, and it’s up to the SIP server to discover the client’s WAN addressing characteristics and how to reach it. In broad terms, this means the server must tendentiously disbelieve the addresses and ports that appear in the NAT’d endpoint’s SIP packets, encapsulated SDP body, etc., and must instead look to the source address and port of the packets as they actually arrive.

Server-side NAT traversal is the vantage point of major SBC vendors, and is also the most universal solution because it does not require any special accommodation by the client. Server-side is what this article is all about.

One last note on the dichotomy: client-side and server-side approaches don’t play well together much of the time. Most server-side implementations detect NAT’d clients by identifying disparities between the addresses/ports represented in SIP packets and the actual source IP and port, and take appropriate countermeasures. While it is theoretically unproblematic to give an “effectively public” (that is to say, non-NAT’d) endpoint NAT treatment anyway, this is only true if every part of the client message containing addressing is appropriately mangled at every step.

ALGs (Application Layer Gateways), a type of client-side traversal solution embedded in the NAT router itself, are especially notorious for foiling this by substituting in correct public IP/port information. However, in my experience, and that of our service provider customers, they only correct some parts of the SIP message and not others (e.g. they will fix the Via but not the Contact address, and perhaps not touch the SDP at all, and even if they do, they don’t open the right RTP ports). This way lies madness, and that’s why we hate ALGs so much, but the same caveats can sometimes apply to STUN-based approaches.

“Great, the ALG fixed all the problems!” said noone, ever. Not that I know of, anyway. Some NAT gateways allow one to disable the SIP ALG, and if you are using a server-side NAT traversal approach, you should do this. However, other consumer-grade and SMB NAT gateways do not allow you to do this, and dealing with them can be nigh impossible. The best solution is to replace the NAT gateway with a better one. If that’s not possible, sometimes they can be bypassed by using a non-standard SIP port (not 5060) on either the client or the server side, or both. However, some of them actually fingerprint the message as SIP based on its content, regardless of source or destination port. They’re pretty much intractable.

In short, if you’re going to do server-side NAT traversal, make every effort to turn off any client-side NAT traversal measures, including STUN and ALGs. The “stupidity” of the client about its wider area networking is not a bug in this scenario, but a feature.

NAT and RTP

A server-side NAT traversal strategy typically requires solutions for RTP, not just SIP.

Even if you get SIP back to the right place across a NAT’d connection, that doesn’t solve two-way media. The NAT’d endpoint will send media from the port declared in its SDP stanza (assuming symmetric RTP, which is pretty much universal), but this will be remapped to a different source port by the NAT gateway.

This requires a more intelligent form of media handling, commonly referred to as “RTP latching” and by various other terms. This is where the RTP counterparty listens for at least one RTP frame arriving at the destination port it advertised, and harvests the source IP and port from that packet and uses that for the return RTP path.

If you have a publicly reachable RTP endpoint on the other side of Kamailio which can behave that way, such as Asterisk (with the nat=yes option, or whatever it is now), you don’t need an intermediate RTP relay. However, not all endpoints will do that. For example, if you are in the “minutes” business and have wholesale carriers behind Kamailio, their gateways will most likely not be configured for this behaviour, more as a matter of policy than technology.

There are other scenarios where intermediate RTP relay may not be necessary. For example, if you are providing SIP trunking to NAT’d PBXs, rather than hosted PBX to phones (Class 4 rather than Class 5 service, in the parlance of the North American Bell system), you may be able to get away with DNAT-forwarding a range of RTP ports on the NAT gateway into a single LAN endpoint. This works because the LAN destination is single and static. A number of our customers use this strategy to great effect. Another reason you may need an intermediate RTP relay is simply to bridge topology; if your ultimate media destinations as on a private network, as for example in my network diagram below, you’ll need to forward RTP between them.

These are important issues to consider because if your entire customer base is NAT’d, being in the RTP path will greatly change the hardware and bandwidth economics of your business. Nevertheless, assuming you’ve determined that you do need to handle RTP for your customers, convention has settled around Sipwise’s RTPEngine. RTPEngine is an extremely versatile RTP relay which performs forwarding in kernel space, achieving close to wire speed. Installation and setup of RTPEngine is outside the scope of this tutorial, but the documentation on the linked GitHub page is sufficient.

As with all other RTP relays supported by Kamailio, RTPEngine is an external process controlled by Kamailio via a UDP control socket. When Kamailio receives an SDP offer or answer, it forwards it to RTPEngine via the rtpengine control module, and RTPEngine opens a pair of RTP/RTCP ports to receive traffic from the given endpoint. The same happens in the other direction, upon handling the SDP offer/answer of the other party. These new endpoints are then substituted into the SDP prior to relay, with the result that RTPEngine is now involved in both streams.

What is to be done?

To provide server-side NAT traversal, then, the following things must be done within the overall logic of Kamailio route script.

  1. Ensure that transactional replies return to real source port – When an endpoint sends a request to your SIP server, normal behaviour is to forward replied to the transport protocol, address and port indicated in the topmost Via header of the request. In a NAT’d setting, this needs to be ignored and the reply must instead be returned to the real outside source address and port of the request. This is provided for by the rport parameter, as elaborated upon in RFC 3581. The trouble is, not all NAT’d endpoints include the ;rport parameter in their Via. Fortunately, there is a core Kamailio function, force_rport(), which tells Kamailio to treat the request as if ;rport were present.
  2. Stay in the messaging path for the entire dialog life cycle – If Kamailio is providing far-end NAT traversal functionality for a call, it must continue to do so for the entire life cycle of the call, not just the initial INVITE transaction. To tell the endpoints to shunt their in-dialog requests through Kamailio, a Record-Route header must be added; this is accomplished by calling record_route() (rr module) for initial INVITE requests.
  3. Fix Contact URI to be NAT-safe – This applies to requests and replies alike, and applied to INVITE and REGISTER transactions alike. This will be discussed further below.
  4. Engage RTPEngine – (if necessary)

It’s really as simple as that.

We will discuss how to achieve these things below, but first…

Testing topology

For purposes of example in this article, I will be using my home Polycom VVX 411, on LAN subnet 172.30.105.0/24>. It talks to a Kamailio server, 70.1.2.1, which also acts as a registrar, and front-ends an elastic group of media servers which are located on a private subnet, 192.168.2.0/24. This also means that the Kamailio server bridges SIP (and as we shall see, RTP, by way of RTPEngine) between two different network interfaces. This is perhaps more complex than the topology needs to be by way of example, but also illuminates a fuller range of possibilities.

A diagram may help:

nat_traversal_topology

The nathelper module

The nathelper module is Kamailio’s one-stop stop for NAT traversal-related functionality. Its parameters and functions encapsulate three main functional areas:

  • Manipulation of SIP message attributes to add outside-network awareness;
  • Detection of NAT’d endpoints;
  • Keepalive pinging of NAT’d endpoints.

There is a subtle link between this module and the registrar module, in that the received_avp parameter is shared among them—if you choose to take that approach to dealing with registrations.

The nat_uac_test() function performs a user-defined combination of tests to decide if an endpoint is NAT’d. The argument is a bitmask; if you’re not familiar with the concept from software engineering, it means that a combination of flags can be specified by adding them together. For example, to apply both flag 1 and flag 2, use an argument of “3”.

Here is a REGISTER request from my NAT’d endpoint:

2018/05/07 06:53:26.402531 47.39.154.156:5060 -> 192.168.2.220:5060
REGISTER sip:sip.evaristesys.com SIP/2.0
Via: SIP/2.0/UDP 172.30.105.251:5060;branch=z9hG4bKffe427d2756F1643
From: "alex-balashov" <sip:alex-balashov@sip.evaristesys.com>;tag=B84E1216-803F7CD7
To: <sip:alex-balashov@sip.evaristesys.com>
CSeq: 3561 REGISTER
Call-ID: 4ae7899d1cc396640e440df7c72662d3
Contact: <sip:alex-balashov@172.30.105.251:5060>;methods="INVITE, ACK, BYE, CANCEL, OPTIONS, INFO, MESSAGE, SUBSCRIBE, NOTIFY, PRACK, UP
TE, REFER"
User-Agent: PolycomVVX-VVX_411-UA/5.6.0.17325
Accept-Language: en
Authorization: [omitted]
Max-Forwards: 70
Expires: 300
Content-Length: 0

The Via header specifies where responses to this transaction should be sent. It can be clearly seen that although the Via header contains a private IP of 172.30.105.251:5060, the actual source of the request is 47.39.154.156:5060 (and, it should be noted, the fact that the internal port 5060 maps to an external port of 5060 is merely a coincidence from how this particular NAT gateway works; it is more typical for it to be mapped to an arbitrary and different external port). Therefore, in this case, test flags 2 and 16 to nat_uac_test() would detect this anomaly.

There is some debate as to whether the various tests for RFC 1918/RFC 6598 (private) addresses have merit. It’s tempting to think that one can reveal NAT straightforwardly by checking for private addresses, e.g. 192.168.0.0/16, 172.16.0.0/12, 10.0.0.0/8, in the Via or Contact headers. However, to return to the network diagram above, Kamailio is multihomed on a private as well as a public network. Although symmetric SIP signalling can be taken for granted from almost any SIP endpoint nowadays, it is nevertheless poor form to give NAT treatment to an endpoint that is directly routable. Give some thought to whether the central theme of your NAT detection approach should be in looking for private addresses, or looking for discrepancies between the represented address/port and the actual source address/port. I personally favour the latter approach.

The “old books” of nathelper vs. the new

Traditional OpenSER-era and early Kamailio folklore prescribes the use of fix_nated_contact() and fix_nated_register() functions. One can still find these in a lot of books and documentation:

fix_nated_contact() rewrites the domain portion of the Contact URI to contain the source IP and port of the request or reply.

fix_nated_register() is intended for REGISTER requests, so is only relevant if you are using Kamailio as a registrar or forwarding registrations onward (i.e. using Path). It takes a more delicate approach, storing the real source IP and port in the received_avp, where it can be retrieved by registrar lookups and set as the destination set, Kamailio’s term for the next-hop forwarding destination (overriding request URI domain and port).

fix_nated_register() is generally unproblematic, though it does require a shared AVP with the registrar module. From a semantic point of view, however, fix_nated_contact() is deeply problematic, in that it modifies the Contact URI and therefore causes the construction of a Request URI, in requests incoming to the NAT’d client, which are not equivalent to the Contact URI populated there by the client. RFC 3261 says thou shalt not do that.

The nathelper offers better idioms for dealing with this mangling nowadays: handle_ruri_alias() and set_contact_alias()/add_contact_alias. Using these functions, this:

Contact: <sip:alex-balashov@172.30.105.251:5060>

is turned into:

Contact: <sip:alex-balashov@172.30.105.251:5060;alias=47.39.154.156~5060~1>

and stored (if REGISTER) or forwarded (anything else). When handle_ruri_alias() is called, the ;alias parameter is stripped off, and its contents populated into the destination URI. The beautiful thing about handle_ruri_alias() is that if the ;alias parameter is not present, it silently returns without any errors. This simplifies the code by removing the need for explicit checking for this parameter.

For the sake of simplicity and minimum intrusiveness, I strongly recommend using these functions in place of the old fix_*() functions.

Implementation

Near the top of the main request_route, you’ll probably want to have a global subroutine that checks for NAT. At this point, the logic will not be specialised based on the request method or whether the request contains an encapsulated SDP body. Critically, ensure that this happens prior to any authentication/AAA checks, as 401/407 challenges, along with all other replies, need to be routed to the correct place based on force_rport():

   
   if(nat_uac_test("18")) {
      force_rport();

      if(is_method("INVITE|REGISTER|SUBSCRIBE"))
         set_contact_alias();
   }

Later, in the loose_route() section that deals with handling re-invites and other in-dialog requests, you’ll need to engage RTPEngine and handle any present ;alias in the Request URI:

   if(has_totag()) {
      if(loose_route()) {
         if(is_method("INVITE|UPDATE") && sdp_content() && nat_uac_test("18"))
             rtpengine_manage("replace-origin replace-session-connection ICE=remove");

         ...

         handle_ruri_alias();

         t_on_reply("MAIN_REPLY");

         if(!t_relay())
            sl_reply_error();

         exit;
      }
   }

Initial INVITE handling is similar:

request_route {
   ...

   if(has_totag()) {
      ...
   }

   ...

   t_check_trans();

   if(is_method("INVITE")) {
      if(nat_uac_test("18") && sdp_content()) 
         rtpengine_manage("replace-origin replace-session-connection ICE=remove");

      t_on_reply("MAIN_REPLY");

      if(!t_relay())
         sl_reply_error();

      exit;
   }

To accommodate the case that requests are inbound to the NAT’d endpoint or the case that NAT’d endpoints are calling each other directly, an onreply_route will need to be armed for any transaction involving a NAT’d party. Its logic should be similar:

onreply_route[MAIN_REPLY] {
   if(nat_uac_test("18")) {
      force_rport();
      set_contact_alias();

      if(sdp_content()) 
         rtpengine_manage("replace-origin replace-session-connection ICE=remove");
    }

}

For serial forking across to multiple potential gateways, it is strongly recommended that you put initial invocations to RTPEngine into a branch_route(), so that RTPEngine can receive the most up-to-date branch data and potentially make branch-level decisions.

Registration requests are already handled by the general NAT detection stanza above. However, registration _lookups_ require an additional nuance:

route[REGISTRAR_LOOKUP] {
   ...

   if(!lookup("location")) {
      sl_send_reply("404", "Not Found");
      exit;
   }

   handle_ruri_alias();

   if(!t_relay())
      sl_reply_error();

   exit;
}

That’s really it!

What about NAT’d servers?

In cloud and VPS environments, it is getting quite common to have a private IP address natively homed on the host with an external public IP provided via 1-to-1 NAT.

Kamailio’s core listen directive has a parameter to assist with just this:

listen=udp:192.168.2.119:5060 advertise 70.1.2.1:5060

This will ensure that the Via and Record-Route headers reference the public IP address rather than the private one. It has no impact on RTP.

Topology bridging with RTPEngine + NAT

The discerning observer will note that the foregoing invocations of rtpengine_manage() did not address a key requirement of the network topology outlined in the diagram, the need to bridge two disparate network topologies.

This requires two different RTPEngine forwarding interfaces, one of which has a public IP via 1-to-1 NAT. The latter would seem to require something like an advertise directive, but for RTP. Fortunately, RTPEngine has such an option, applied with the ! delimiter:

OPTIONS="-i internal/192.168.2.220 -i external/192.168.2.119!70.1.2.1

The direction attribute to rtpengine_offer() (or, equivalently, the initial call to rtpengine_manage()) allows one to specify the ingress and egress interfaces respectively:

rtpengine_manage("replace-origin replace-session-connection ICE=remove direction=internal direction=external");

Subsequent calls to rtpengine_manage(), including calls in onreply_route, will appropriately take into account this state and reverse the interface order for the return stream as needed.

Keepalives and timeouts

The most common challenge with NAT’d SIP endpoints is that they need to remain reachable in a persistent way; they can receive inbound calls or other messages at any moment in the future.

Recall that NAT gateways add mappings for connections or connection-like flows (in the case of UDP, for remember that for NAT purposes UDP isn’t truly “connection-less”) that they detect, e.g. from 192.168.0.102:5060 to $WAN_IP:43928. For the time that the latter mapping exists, any UDP sent to $WAN_IP:43928 will be returned to 192.168.0.102:5060.

The problem is that this mapping is removed after relatively short periods of inactivity. In principle this is a good thing; you wouldn’t want your NAT gateway’s memory filled up with ephemeral “connections” that have long ceased to be relevant. However, while, in our experience, most timeouts for UDP flows are in the range of a few minutes, there are some routers whose “memory” for UDP flows can be exceptionally poor — one minute or less. The same thing holds true for TCP, but UDP tends to be affected more egregiously.

When the connection tracking “mapping” goes away, the NAT gateway drops incoming packets to the old $WAN_IP:43928 destination on the floor. Consider this example:

screenshot-2018-05-08-08:37:42

In this test topology, 10.150.21.6 is a Freeswitch PBX on a private network (10.150.21.0/24) that receives registrations relayed from Kamailio (with help from the Path header). Kamailio is multi-homed on a private (10.150.20.2) and public (209.51.167.66) interface, the latter of which is presented to outside phones.

A registration which occurred about 15 minutes prior had established a contact binding of 47.39.154.156:5060 for my AOR (Address of Record). However, as no activity had occurred in this flow for as long, the NAT router “forgot” about it, and you can that efforts to reach the phone go nowhere. An ICMP type 3 (port unreachable) message (not shown) is sent back to Kamailio and that’s the end of it.

So, to keep NAT “pinholes” — as they’re often called — open, some means of generating frequent activity on the mapped flow is required.

The easiest and most low-hanging solution is to lower the re-registration interval of every NAT’d device to something like 60 or 120 seconds; this will generate a bidirectional message exchange (REGISTER, 401 challenge, 200 OK) which will “renew” the pinhole. This is effective in many cases. But there are two problems:

  1. Interval can’t be too low – Many devices or SIP registrars will not support a re-registration interval of less than 60 seconds, and believe it or not, that’s not low enough for some of the most egregious violators among the NAT gateways out there.
  2. Performance issues for the service provider – In a sympathetic moment, consider things from your SIP service provider’s perspective: tens of thousands (or more) of devices are banging on an SBC or an edge proxy — and with registrations no less, which are rather expensive operations that typically have some kind of database involvement for both authentication and persistent storage. That can greatly change the operational economics. So, as a matter of policy, allowing or encouraging such low re-registration intervals may not be desirable.

Enter the “keepalive”, a message sent by either server or client that garners some kind of response from the other party. Keepalives are an improvement over registrations in that they are not resource-intensive, since they invite only a superficial response from a SIP stack.

There are two types of keepalives commonly used in the SIP world: (1) a basic CRLF (carriage return line feed) message, short and sweet, and (2) a SIP OPTIONS request. While OPTIONS ostensibly has a different formal purpose, to query a SIP party for its capabilities, it’s frequently employed as a keepalive or dead peer detection (DPD) message.

Many end-user devices can send these keepalives, and if your end-user device environment is sufficiently homogenous and you exert high provisioning control over it, you may wish to do configure it that way and simply have Kamailio respond to them. In the case OPTIONS pings, you will want to configure Kamailio to respond to them with an affirmative 200 OK:

    if(is_method("OPTIONS")) {
        options_reply();
        exit;
    }

That goes in the initial request-handling section, toward the bottom of the main request route.

Pro-tip: Most end-user devices will send an OPTIONS message with a Request URI that has a user part, i.e.

OPTIONS sip:test@server.ip:5060 SIP/2.0

There is a valid debate to be had as to whether this is appropriate, since, strictly speaking, it implies that the OPTIONS message is destined for a particular “resource” (e.g. Address of Record / other user) on that server, rather than the server itself. Nevertheless, this is how a lot of OPTIONS messages are constructed. The Kamailio siputils module, which provides the options_reply() function, takes a fundamentalist interpretation in this debate, which will impair many replies.

Slightly unorthodox, but effective workaround, since keepalive applications of the OPTIONS message seldom care about the actual content of the response:

    if(is_method("OPTIONS")) {
       sl_send_reply("200", "OK");
       exit;
    }

You may find more profit in server-initiated keepalive pinging, however. The Kamailio nathelper module provides extensive options for that as well. Start with the NAT pinging section.

UDP fragmentation

The tendency over time is for the median size of SIP messages to creep up: SDP stanzas get bigger as more codecs are on offer, new SIP headers and attributes enter into use, etc.

When the payload size of a UDP message gets to within a small margin of the MTU (typically 1500 bytes), it gets fragmented. UDP does not provide transport-level reassembly as TCP does. Because only the first fragment will contain the UDP header, it takes considerable cleverness to reassemble the message. Kamailio’s SIP stack can, of course, do this, as can many others in the mainstream FOSS world. However, many user agents cannot.

More damningly, there’s virtually a zero-percent chance that a NAT gateway will handle UDP fragmentation correctly. So, as a rule of thumb, it is eminently safe to assume that a NAT’d endpoint will not receive a fragmented SIP message.

Strategies for dealing with this phenomenon are detailed in a separate post all about UDP fragmentation on this blog, but the short answer is: use TCP. It’s what RFC 3261 says to do.

What about SIP Outbound?

RFC 5626, known as “SIP Outbound”, is the latest opus of the IETF’s copious intellectual output on these topics. As is true of many such complicated ventures, Kamailio has supported it for a long time but most SIP UAs in the wild seldom do.

In brief, SIP Outbound proposes the establishment of multiple concurrent connection flows by the client for redundancy. A basic tenet of this arrangement is that all responsibility for establishment of connections through NAT, as well as all maintenance and upkeep of the same, is the responsibility of the client. There are a lot of other details involved, mainly to do with the registrar only using one of the “flows” at a time to reach a client with multiple registrations, so that multiple registrations established for redundancy do not lead to multiple forked INVITEs to the client. Some new parameters are involved in this new layer of bureaucracy for the registrar: instance-id and reg-id.

A full exposition of how it all works is certainly beyond the scope of this article, but RFC 5626 is captivating bedtime reading. However, until and unless widespread UA support for it appears, this author cannot be moved to say, “Use SIP Outbound, it’ll solve your NAT traversal problems!”

Written By Alex Balashov