VoIP service provider anti-patterns

In the software business realm, the term anti-pattern refers to dysfunctional but commonplace solutions to technical and business problems. Anti-patterns occur widely enough that one can reasonably generalise about them — that’s why they’re “patterns”.

There are many acknowledged technical anti-patterns in the software engineering world, such as database-as-an-IPC, or, my personal favourite (hat tip to The Daily WTF), the Inner-Platform Effect. The latter will be easily recognisable to a programmer who has been asked to write an application with business rules and data objects that can be extensively customised by non-programmers; invariably, the customisability demands made upon such systems approach infinity, ensuring that, over time, working with the system comes to resemble programming (if not necessarily “coding”) in its cognitive and technical dimensions, and therefore to require a skill set that approximates that of a programmer. Not only does this fail to address the original demand of the businesspeople–reduce dependence on programmers–but now there is a poorly performing, half-baked system-within-a-system. Such a system has all of the downsides without any of the benefits. Yet, it happens all the time where people who don’t really understand how software works are in charge. If you work with many organisations, you’ll come to encounter some manifestation of it over and over. That’s what makes it a prime example of an anti-pattern.

It’s hard to meaningfully identify anti-patterns in new industries or fields of commercial endeavour. The VoIP ITSP is a relatively recent development, all things considered. Companies in immature industries whose business models and equilibria still unsettled tend to try lots of different ways to make money, as well as to package and productise what they do in different shapes and sizes. Failed experiments–even repeated failures–in new growth markets aren’t necessarily anti-patterns. A lot of praxis, industry consolidation, and market development has to happen before something can truly be deemed an anti-pattern.

I would go further than most to say that anti-patterns aren’t just ideas that have repeatedly seen failure over a significant stretch of time, but in fact are consistently bad ideas or misconceptions that, for one reason or another, retain a stubborn hold on the imagination of businesspeople and managers even in the face of accumulated common knowledge that they are bad. My preferred term for such ill-fated ventures is “worst practices”. By now, the VoIP ITSP (Internet Telephony Service Provider) industry has been around long enough, and I have worked with so many ITSPs that I can safely venture to typify some of these “worst practices” from my 12 years of consulting experience, growing up with this industry.

Not all of them can be identified and avoided, of course, and I certainly don’t plan to try to survey them all in one meagre article. But some of the more conspicuous ones are worth stepping through in the hope that it will help someone starved of quality advisors avoid bad technical and business decisions. I must also be candid about the limitations of my perspective; my experience is primarily (though not exclusively) with US-based small to medium ITSPs and telcos, and heavily weighted toward open source VoIP solutions, so some of what I say may be unavoidably tendentious from a strictly enterprise or profoundly foreign-market perspective. Caveat emptor.

#1: SBC metaphysics

The Session Border Controller (SBC) industry has come to have an indelible hold on the conceptual vocabulary in which VoIP-related plans are laid. I have spoken this in a previous article on the suitability of Kamailio as an SBC and elaborated upon the problems it poses in my Kamailio World 2019 talk in Berlin (“Kamailio as an SBC: definitive answers”), so I won’t belabour it here very much.

What bears remark here is that there are a lot of ways to engineer the core, the customer access layer, and the intra-industrial carrier interface of VoIP networks, and conventional SBCs from the big brands are only one of several possible avenues, each with their own trade-offs deserving of thoughtful consideration. Yet somehow, there is a pervasive meme out there that SBCs are the essential building block of VoIP service delivery.

SBCs consist of a SIP back-to-back user agent (B2BUA) and (typically) integrated media relay combined with a routing policy engine/business layer, plus some other commonplace features (server-side NAT traversal, DoS protection/security, etc.) packaged into a particular kind of appliance. Talking about VoIP networks strictly in terms of SBCs is like being fixated on Coca Cola in a conversation that is ostensibly about beverages as a genus.

Overlooking the mountains of money so often shoveled into an open pit by buying, licencing and supporting SBCs where they are profoundly unnecessary (and only very occasionally, overlooking them where they are in fact necessary), this is an anti-pattern because of the sheer amount of communication and time spent when people talk past each other, typically where one side lacks the imagination, expertise, or agility to tune out a certain amount of brainwashing by the SBC industry and think in more fundamental SIP architecture concepts. A lot of cognitive bandwidth is sucked up in meetings and on conference calls trying to unpack business requirements that are slathered in the vernacular of SBC features and related marketing gibberish when all one really needs to talk about is SIP endpoints, and perhaps proxies.

By no means am I saying that SBCs are useless or inappropriate. Indeed, they are properly applicable to a variety of scenarios. Nor am I saying that the names the SBC industry has given to things are ipso facto bad; in some cases, they are a fine flag of convenience. However, a shockingly large amount of valuable engineering time is spent driving the conversation with business stakeholders to a point outside of a maddeningly confined SBC “Matrix” where one can explore options comprehensively. It’s a ubiquitous tax on getting real work done.

#2: Rejected open-source platform transplants

The acquisition of ITSPs with custom platforms built on open-source seems to be cyclical. There are periods where larger telecoms buy open-source service providers with voracious appetite, and there are periods where the discourse is all “if it’s not Broadsoft or Sonus, we don’t understand it”.

Everyone understands a customer/revenue grab; buy the subscribers, get a few decent acqui-hirees out of it, transition the subs to Sonus or Metaswitch or whatever, kill the scrappy legacy open-source platform (at which point the acqui-hirees may leave, whatever), end of story. Because small ITSPs tend to cultivate smaller customers and often to stake out local-colour or vertical-specific niches, two icebergs of some specificity lie there (both avoidable by seasoned management): the cost scaling and profitability implications of delivering smaller transactions and booking much lower ARPUs than the larger entity may be set up for, and the possibility of churning away not-so-sticky customers who preferred the old platform or the old crew over your cookie-cutter POTS/Key System replacement. Setting aside the more general and universal problems of any acquisition, such as integration of IT systems and billing, management of support workflow and process, etc., this is fundamentally doable.

But it’s far easier to buy customers on a mainstream big box platform that is already more consumable by the enterprise. As far as I can tell, the motives for buying an open-source ITSP with a custom platform are usually strategic; it’s a technology buy, with eyes on the intellectual property as a vehicle for saving money on big brand licencing costs, offering new products or entering new kinds of specialty markets, or just adding a down-scale switch to be able to deliver smaller transactions for smaller customers with more economies.

Strategic technology buys of open-source custom platforms are a fertile field for serious problems. The problem I see so often–so very often that it pleads for a star on the anti-pattern walk of fame–is that the acquirer is a sales-heavy organisation with the wrong kind of “corporate DNA”. A typical master agent/channel partner/managed services provider is sales-driven and very light on engineering, and can’t properly metabolise the huge core competency commitment that open-source platforms demand. It often comes as something of a shock to them that open-source is not, in fact, free, and virtually always requires an engineering-led corporate culture. The latter is a critical factor to attracting and retaining the kind of engineering talent that is needed to run, maintain, and above all else, extract value from such a platform, and sales shops don’t have that sauce. They’re often blindsided. Even if they take a write-down, they’re stuck with uneconomical legacy commitments to customers they don’t really want and can’t quite unload without churning them straight out the door.

Beyond that, I’ve seen enough FreeSWITCH or Asterisk potpourri slapped together under the heading of some kind of next-generation “cloud platform” acquired for comically large multi-million dollar sums to deduce that as often as not, the typical acquirers of these things do not really understand what they’re getting, and are susceptible to exuberant valuation voodoo. One tends to impute magic to what one does not understand, especially if it comes without overt enterprise-sized licencing and support costs. A lot of engineering effort may have gone into these platforms, but they are seldom “turn-key” as sales shops ordinarily understand that word.

Even for acquirers with some non-trivial commitment to engineering and a strong internal customer support organisation, it’s important to understand that, depending exactly on what you’re buying and its exact contours, you probably can’t take your support staff and just train them up on this new open-source stuff real quick, as one might with an Avaya, 3CX or Broadsoft grab. Open-source isn’t free; the costs are usually paid in operational expertise and integration costs rather than port licences.

I’m not saying that buying open-source platform companies for the platform is inherently a bad idea. But do your technical due-diligence, or hire someone who really gets open-source IP telecommunications to do it for you. More impactfully: take a serious, cold-eyed and blunt look at your human resources, with a special eye to whether your organisation has the in-house engineering apparatus for the care and feeding of your new creature. It’s natural to ask these questions before investing in a big, expensive commercial platform, but for some reason the critical faculties often seem to be offline when buying a free-range, grass-fed, organic open-source voice medley.

It takes a certain kind of company to competently appropriate and fruitfully squeeze value out of an open-source platform, just as an organ can only go into a certain kind of body and blood type. Put it in the wrong body, and the immune system will reject it right out. To the dismay and bewilderment of many sales folks moving seats and trunks through the channel, a lot of these genetic factors relate to people and culture. Your culture may not be wrong and it may not be bad for running a hosted PBX sales machine, but the fact is, you may not have the kind of place where open source-savvy engineering talent lingers, nor the business processes, workflows and institutional memory to embrace open-source.

Even if you’re lucky to get a crack development team as part of the acquisition, plan for them putting in their notice at 23:59 on the day their earn-out contract ends or their options vest or whatever, and figure out what you’re going to do without them. There’s really no large-scale track record of FOSS developers being happily absorbed into some sort of Borg cube, un-learning cherished customs and habits, and embracing things like C# and change control. I’ve heard of too many executives flabbergasted that the acqui-hires leave; “but we pay them so well!” Well, they built a highly scalable open-source platform (right? See the caveat about ensuring you get what you think you’re buying); they’ve got options.

I’m an open-source VoIP consultant. My colleagues and I would love to take your money under the theory that we’re going to rescue you, and it happens often enough that it’s a whole anti-pattern. But we also don’t want to be a begrudgingly necessary cost centre, and we know this isn’t what you had in mind with a strategic investment in a custom platform.

#3: Cutting out the switch resale platform

This one is a more generic variation of the previous theme, and I sense an uptick in recent years. The archetypal actor is, again, the sales-heavy managed service provider or agency without much engineering in the tent. Indulging the all-American zeal for “cutting out the middleman” and the broader mercantile passion for increasing gross margins, the notion possesses them to get off Coredial, someone else’s Broadsoft partition or what have you, and buy their own Class 4/5 switch platform.

The problem is, even the most artisanally packaged enterprise switch solutions require the operator to take fairly deep technical ownership. Operationalising a switch into the business, to say nothing of the migration process or the necessary back-office integrations and process development, demands a technical core competency commitment a sales-focused shop may not understand or be prepared to make. More damningly, they may not know how to so much as try; executives without competent technical advisors don’t how to hire or nurture next-level technicians.

The real self-styled trail-blazers in this group are excellent candidates for the disaster outlined in anti-pattern #2, as they see in an open-source platform buy a seductive opportunity to kill three birds with one stone:

  • Avoid the mega CAPEX and OPEX of big-brand commercial solutions;
  • Stop their switch platform provider relieving them of a sizable chunk of their subscriber revenue;
  • Grab intellectual property/technology capital with valuation multiplier effect on a future acquisition.

If it were so easy, everyone would do it. There’s a lot of reflexive dismissal of the value-add of perceived middlemen among this crowd. The value-add is usually invisible until it’s gone. Selling PBX and trunking isn’t the same as running the PBX and trunking, and there are a lot of sales-focused MSPs out there who would make a lot more money if they just stayed in their lane and didn’t try to run switches. I say that as someone eager to sell you a Class 4 trunking platform of your very own.

(I’d be remiss not to give an honourable mention to the small, but not wholly invisible subset of these companies who get the idea to build their own softswitch and/or SIP stack, though their efforts are mostly abortive. They are typified by a swashbuckling frontiersman type who is only emboldened by others’ dismissal of this “impractical” or “quixotic” venture, believing himself to have struck gold if it’s got the naysayers exercised. Go forth, pioneer, and blaze the path.)

#4: Mindless stampede into The Cloud

It’s not really news that a lot of open source-centric ITSPs have jumped onto the bandwagon of a build-out onto Amazon Web Services (AWS), or one of the other cloud majors. I covered some of the common misconceptions around this in some detail in my Kamailio World 2018 talk – “Kamailio in the ITSP: The Changing Winds”.

The operative fantasy here among most executives is that infrastructure can be someone else’s problem and one can fire those sysadmins, NOC techs, and gophers who are sent to the data centre at 3 AM to swap out blown power supplies. That’s an understandable aspiration, but one which does not in any way require nor specifically point to an AWS, Azure or Google cloud deployment.

Ultimately, this all stems from a deliberately engineered conflation between “cloud” (as the foregoing vendors implement and define it), and “running things on someone else’s computers”. It just so happens that Amazon and friends have captured and packaged this burning desire to run things on someone else’s computers in a way that is, from a marketing point of view, digestible to the business class at large, and have accordingly been granted something of a monopoly on the concept of farming out the infrastructure problem in general.

Running your communications platform on infrastructure cared for by a third party has been possible for a very long time in the form of leased dedicated servers and leased virtual machines. Indeed, a great deal of clustering and automatic service discovery on such a layout is made possible by modern tooling. Many providers offer measured hourly billing and straightforward APIs to automatically provision, turn up and spin down servers “elastically” in response to shifting demand throughout the course of a business day. Running an ITSP without owning or maintaining a single physical server has been possible since at least the mid-2000s.

AWS, for example, offer a particular paradigm for elastic, on-demand computing that, if used as they intend, makes heavy use of (e.g. Amazon’s) proprietary tooling and infrastructure helpers. It also requires extensive familiarity with the AWS Way of Doing Things, from a nuanced understanding of the limitations of various instance sizes, to their software-defined networking and security concepts, to various complementary products such as dynamic storage (EBS, S3, etc.). Moreover, AWS was built to meet the needs of web application and web service delivery; special considerations are required to run real-time, delay-sensitive media-involved communications on that type of system.

Thus, there are two distinct but related misconceptions proffered in the exuberance over “cloud”–in any form, really–which lead to the diagnosis of anti-pattern:

  • “Cloud” infrastructure magically runs itself and requires little or no headcount to support it;
  • No idiosyncratic knowledge is required to competently leverage an esoteric platform such as AWS.

Neither are true. What is true is that the nature of the required skill set changes, often with significant consequences; infrastructure consisting of your own server hardware can be supported by more or less entry-level IT staff with something like an ‘A+ Certification’ and a basic command of Linux, while any cloud venture, whether it’s of the esoteric AWS-style flavour or more generic, is going to involve DevOps-heavy concepts such as configuration management, orchestration, service discovery, etc.

Note especially the “Dev” part of “DevOps”; a lot of cloud architecture management relies on semi-programmatic tooling that draws upon skill sets higher up the technical value chain, and, accordingly, pay scale. It’s probably true that you can reduce operations headcount with cloud, but you most certainly cannot eliminate it, and what headcount you do have will probably be more costly because of the higher skill requirements of managing it.

Anyway, it does not seem that either fact is particularly well-known, if we are to judge by the number of scenarios in which folks uncritically ploughed straight into an AWS or similar deployment without much aforethought. Businesspeople who bought into the marketing around the concept of shedding operations baggage and throwing it all into “the cloud” are often surprised that managing cloud costs major money, and requires additional “elastic” resources and services they did not plan for–together with people who know how to use them, and especially to massage them in ways that meet the needs of IP telecoms.

This doesn’t mean that running IP telecom systems in AWS or similar is impossible or ill-advised. In fact, some of our largest and most successful customers do exactly that. However, it should not be confused with hosted infrastructure; if used as intended, AWS entails a lot of learning, and it is certainly Amazon’s aim to foster dependence on their cloud tooling with a view to vendor lock-in. If you don’t use AWS or its ilk as intended, there aren’t necessarily a lot of benefits to using it, and potentially plenty of downsides.

To properly realise value from AWS or its cousins, you have to really understand how to do cloud architecture right within their paradigm, and take full advantage of the various auto-scaling and self-assembly mechanisms on offer. If you’re not doing that, there’s no rationality in using the major cloud platforms, and depending on your service delivery architecture, there may not be much point in a full-bodied commitment to this kind of cloud approach. Either way, the devil is entirely in the details, and the decision must be carefully weighed against other cloud alternatives or more traditional infrastructure–which, as mentioned above, can still be made quite “elastic”. For many ITSPs, the latter is, in fact, from a holistic business point of view, the most sensible choice.

As a colleague pointed out to me, the biggest victims with the most to lose from not properly grasping the costs and benefits of cloud in detail are established companies who are already heavily invested in their own facilities. When the siren song of “cloud” and the chilly wind of FOMO (Fear of Missing Out) blow through the country club during tee time, their executives end up with huge OPEX for their colocation facilities, and more huge OPEX for their cloud build-out, which now, without a trace of irony, runs things they could be running in their existing facilities, along with over-taxed operations team burdened with maintaining two infrastructures that demand not-especially-overlapping ways of doing things. Alternately, they might hire an additional and massively expensive DevOps team specifically for the cloud operation. The vanishing horizon of aspirational, never-completed migrations to the cloud platform are a frequent theme in such cases.

I’m not against IP telephony in the cloud, and don’t consider it an anti-pattern. The anti-pattern is blind, pollyannaish, didn’t-see-it-coming, who-knew-this-shit-is-complicated, I-thought-there’d-be-savings marriage to the most iconic cloud platforms without a diligent and qualified analysis of the true, fully burdened costs and the human capital shifts required.

#5: B2BUAs and heavyweight network elements as quick-fix band-aids

This is a narrowly technical one, but we have seen it a lot in our Kamailio consulting work.

The typical case is that of an ITSP that has built out a Kamailio-centric routing platform but has hit a knowledge limit in what they can do with Kamailio while facing an immediate ask from the business side. In a rush to fulfill, they fall back to using the tools they know best–FreeSWITCH, Asterisk, etc.

RTPEngine can do transcoding and call recording, and that Kamailio can rewrite ANI/Caller ID in a SIP-compliant mannerspeak to SIP-over-WebSocket (WebRTC) endpointsflexibly rate-limit SIP requestscount and limit concurrent callsmake arbitrary database queriesmanipulate codecs in SDP, and even asynchronously query HTTP APIs and parse their JSON output. Yet, we often see a proliferation of a miscellany of FreeSWITCH or Asterisk servers to do these kinds of things, often for simple lack of awareness that they can be done any other way.

The result is redundancy and SPOF (Single Point of Failure) concerns, Rube-Goldbergian call flows, and a morass of burdensome infrastructure commitments and associated costs, the business risk of pertinent knowledge walking out the door, etc.

This is not to say that there is no legitimate use-case for a B2BUA in the call path of an otherwise proxy-heavy platform. For example, although interoperability is, overall, a declining a problem relative to a decade ago, there certainly remain cases where a B2BUA is the best vehicle to mediate between two subtly different flavours of SIP; a B2BUA can be liberal in what it accepts, and conservative in what it emits. And of course, B2BUAs continue to plug a number of topology hiding helpful to certain business models or security, or both.

Still, one should consider whether, in such a case, to deploy a lightweight, signalling-only B2BUA without an attached media gateway apparatus, as opposed to a full rig suitable for PBX or application server duty. Not merely once or twice, I’ve walked into a deployment with an otherwise powerful and logic-ridden Kamailio load balancer spreading calls across 15 Asterisk or FS servers for the sole purpose of doing something like ANI/Caller ID manipulation. This is irrational and wasteful, but is sufficiently widespread to earn mention as an anti-pattern.

#6: Back-end development done by front-end developers

Real-time communication systems have exacting timing and performance requirements that take real back-end programming experience and expertise to meet.

I don’t mean to make systems programming sound sound like rocket science; it’s not. However, it is a different problem space than front-end application development or dealing with HTTP workloads, and requires a deep understanding of parallelism and concurrency among other topics. The average web developer, which is what the term “developer” has come to presume in the eyes of many businesspeople nowadays, does not have the expertise to build services for high-performance call processing–at least, not without some help. All developers have their specialties.

The rise of NodeJS and isomorphic front/back-end JavaScript has done much to muddy the waters by giving currency to the idea that JavaScript web developers can write back-end services. One back-end service is not the same as another. Yet somehow, the idea has become widespread that “developer” means “JavaScript developer” and that “development” is fungible. The ludicrous and facile meme of “full-stack developer” bandied about in this context is still more misleading; the “full stack” of a web application is not the “full stack” of the rest of networked computing.

This problem pre-dates server-side JavaScript, though. In the mid-late 2000s, I was involved in rescue efforts that seemed to have been made necessary by a conversation like this:

A: “We need to build a PHP front-end for our contact centre product.”
B: “Okay, I’ve hired some PHP developers and they’ve built the front-end.”
A: “Oh. Now we have to build the actual call processing logic, I guess.”
B: “We need to hire developers for that.”
A: “But we’ve already hired developers.”
B: “Yes, they’re PHP developers.”
A: “Right, developers, so let’s have them develop the back-end.”
B: “…”

A few synchronous, blocking and poorly-performing, database-bound PHP-AGI scripts later, and the telephony backend was born.

While it seems doubtful, for ecosystem reasons alone, that anyone would non-ironically hire PHP developers in 2019, things aren’t too different a decade later. It’s just that now, a commonly preferred way to shoot onesself in the foot is with a blunderbuss that says something like, “drive SIP call routing with a single-threaded Node API service. Everything’s asynchronous, right?” The choice of technology is hardly the point here, and I’m not knocking Node–it’s perfectly good for what it is. There are simply a lot of considerations that go into a technical decision like that, not the least of which is how the call SIP element can consume such an API without compromising its throughput, how much work Node can really do in one thread, etc. Replace “Node” with “Java servlets” or “Python Flask web services” or whatever and the same basic idea applies; high-level web technology is not systems programming.

A blithe indifference to the pernicious consequences of combining the folk traditions of the latest web development fashion cycle with the rather ironclad requirements of real-time multimedia communications is sufficiently widespread, made more so by the stereotypical categories into which the concept of “software development” has fallen in the popular imagination, to warrant identification of an anti-pattern.

A decade ago, the response to naive implementations was in itself an anti-pattern: throw more hardware at the problem. This begat a kind of Jevons paradox, a vicious cycle of rewarding bad software engineering with more resources for it to consume. Today, that tends to be addressed with “horizontal scaling”; if you just throw five more m4.xlarge instances at it, inefficiency is no big deal–until you glance at your cloud provider bill.

We–and they–will happily take your money. 🙂

Thanks to Ryan Delgrosso for his valuable feedback and suggestions on drafts of this article.