Dynamic Asterisk Scalability with Amazon EC2
by Nir Simionovich of Greenfield Technologies Ltd at AMOOCON 2009
Abstract
Amazon EC2 had been slowly gaining force with developers world wide, in developing web applications and web 2.0 mesh-ups.
Asterisk, a somewhat resource consuming application is considered not fitting for Amazon EC2 structures. This talk will discuss the various issues related to creating dynamically extending platforms, using Asterisk, Amazon EC2, Amazon S3 and some web mesh-ups.
Language: English
Additional material
Here you can find all available material for this talk.
PDFs
Audio recordings
Video recordings
- Small Version (27.38 MB)
- Medium Version (191.78 MB)
- Big Version (317.82 MB)
- Small Version (144.94 MB)
- Medium Version (377.07 MB)
The slides
There are 24 different slides. Click on them to view an enlarged version.
Transcript
Nir Simionovich: My name is Nir. I’m located in Israel. I have a company called Greenfield Tech. We deal mainly with Asterisk development. I’m not talking about custom development as in actual C code, but more platform developments for customers such as telcos, service providers and so on. The past six, nine months we’ve been heavily involved in developing highly scalable systems with Asterisk and Amazon EC2, which is “Elastic Computing Cloud.” It’s a totally different methodology and way of thinking about IT problems and in Asterisk in particular.
How many are familiar with Amazon EC2? One person, two people. Two people are familiar with Amazon EC2. Just a bit.
Audience Member: [off-mic question]
Nir: Actually Amazon EC2 is based on XEN. It’s based on XEN. But, there is a difference. Where XEN is a purely virtualization or paravirtualization technology, where you can have your own virtualization environment. Amazon EC2, although provides the same similarities in terms of, you do get a virtual host. True, you do get a virtual server. There are issues that prevent it from being the ideal platform for Voice over IP besides the actual fact that it’s virtualized. We’re going to talk about that and how you solve the various aspects.
I see you’re bearing the Vicidial logo. Actually, I’m going to talk about a dialer that is capable of doing 680 concurrent dials per second. That’s what we’re going to talk about. This is a case study we’re going to show.
This is the unofficial geeky title of the presentation. When I say unlimited capacity I mean unlimited in terms of how much you can get from EC2 and not how much you can get from your carrier. If your carrier can only give you 120 channels, I’m sorry, that is what you’re going to get. So, bear in mind that.
Just a bit about myself. I’m founder and owner of Greenfield Technologies in Israel. I founded the Israeli Asterisk user group sometime around 2002, late 2002. I’m the author of two Asterisk books that were published by Packt publishing. There will be another one hopefully by next year. I’m working on a fourth one too, which is completely different.
When most system administrators look at Asterisk, they see one thing. In the spirit of Swine Flu, this is what they see. Because Asterisk is a resource hog. Whatever we’ll do, it consumes memory, it consumes file handlers, and it consumes resources.
If you take the simplest Asterisk machine and you run 60 calls on it, you’ll see it will reach a load average of about 0.5, 0.6 more or less. Even if you do a NetStack on a system that’s running 60 calls, you’ll see the NetStack goes for on and on and on. Especially if it’s a dialer.
Immediately if you’re thinking that Asterisk is a resource hog, and we believe virtual server containers such as XEN or VMware actually provide a lower capacity host for us. We reach a very simple conclusion. Asterisk in a virtualized computing environment is low performance if it works at all.
The answer is wrong. Actually Asterisk can provide much more capacity in a virtualized environment than on a normal host. You get the aggregated capacity of the host. We’ll talk about that shortly forward.
A little primer to EC2. Amazon EC2 is a part of the Amazon Web Services family of services. It includes EC2, which is “Elastic Computing Cloud” – the idea that you can get instant computing power on demand, billed according to the time that you are using it. S3, which is a storage environment. They have now their own [inaudible 4:46] CloudFront that’s a CDN network. Very much similar if you’re familiar to Akamai, from the World of the Web.
Sorry.
Audience Member: [off-mic question]
Nir: It’s not caching. No, it’s not caching. Caching is something different. You’re welcome to visit the aws.amazon.com website to learn more about EC2. You have to remember for the sake of this presentation and this subject regard EC2 as a XEN enabled container. That means you’re dealing with a very large XEN, if you’re familiar with XEN then XEN Dong Zero is like this huge box, which is completely virtualized. You have no idea where it is. The end result is that you can run as many hosts as you want on it.
The problem with EC2 is the fact that it utilizes multiple data centers across the world. If you go to the EC2 website and you try to initiate a host, it will tell you, “OK, where would you like it to be located? Would you like it to be US-1, 2, or 3?”
Actually US-1 is not a single data center. There are four. They’re located in different areas of the same state. Sometimes the same city, but not entirely in the same place.
When an instance inside Amazon is launched, the host, we have no idea where it’s actually going to be located. Even if the IP numbers that we get are consecutive. It is not entirely true that these machines or these two virtual instances are located on the same physical box at Amazon.
There’s a very big chance that one is located in San Diego and the other one is located in New York. If you initiate a third one, it may be Chicago for example.
We can’t rely on the fact that these machines are anywhere close to one other. This is something we need to take into account. Why? Because as Asterisk developers we tend to always rely on the fact that the database is right next to us.
For example, we’ll take Vicidial, and tell me if I’m wrong. I’ve tried installing it a few times. I never succeeded.
Audience Member: [off-mic question]
Nir: Yeah, I downloaded the CD. That one works. I tried installing from scratch. Never made it.
Audience Member: [off-mic question]
Nir: I’ll try that too. I have to admit that Ubuntu is not my favorite. But, I’ll try that. Vicidial, for example, has a lot of dependency between the actual application and the database. Right? It requires that database needs to be right next to it, because it uses it extensively all the time. As far as I can recall, the Vicidial has what’s called a distributed installation. You can have four or five different dialers running off of a single database.
Audience Member: [off-mic question]
Nir: The end thing is, at the end of the day your capacity is now limited by the MySQL. That’s one. You’re relying on the fact that the open wizard is located right next to the MySQL server because you need the fast connection.
Audience Member: [off-mic question]
Nir: I’m not familiar with that installation, sir. I’ll be happy to learn.
Audience Member: [off-mic question]
Nir: Yeah. I’ll be happy to learn. In any case, most Asterisk applications if we look at providers, Vicidial apparently isn’t such a good example because it does have some form of distributed environment. Most providers look more or less like this. We’ll have some form of connection to providers – IP providers, PSTN providers, whatever. We’ll have a bunch of Asterisk servers running in the front. We’ll have a back-end of a database and a back-end of storage. This more or less encompasses about 95 percent of the Asterisk installations and carriers in the world.
This is what it looks like. Here and there you’ll find a session border controller, maybe open server, maybe a radius server. It doesn’t really matter. The overall design is virtually the same.
The problem with this design is the system has a distinct bottleneck with the database. The minute the database is loaded completely, we cannot grow anymore. It’s no longer a telecom problem; it’s a data warehousing problem.
Problem number two. Storage resources are consumed across the entire network, both for Asterisk and for our database. Again that can be solved with storages, SANs, NAS, what ever you want, but again it is a problem.
A direct connection from the Asterisk application to the database server is usually required. Most of these people, if they’re using, let’s say, an application, which is a really horrid example of an application which is for example AT billing. They need a direct connection to the database all the time.
The end to your approach here simply doesn’t cut it. Why doesn’t it cut it? Because in a clouded environment, there is no direct connection between the Asterisk servers and the database server. We have no idea where they’re located. We can’t rely on Amazon to provide us a MySQL server that’s located in San Diego and another virtual host located in god knows where and so we’ll connect these together and maybe it will work. Maybe the TTL, maybe the ping time from one server to the other is so reliable, then it will work. We can’t rely on that.
There is no direct connection again from Asterisk to the storage servers and the database servers. Again the same problem.
No commitment to geographic location of each server. This is something that even Amazon will tell you immediately when you initiate a host. You can select the database, but there is no real commitment to that.
So, how the hell are we going to do this? How is it possible to virtualize a platform into Amazon EC2 and still be able to do it? So, we need a new approach. It is fairly clear that we need a completely new methodology of distributing our load, of building these applications. We can’t rely on proximity of anything in the system. We can’t rely on… We can’t rely basically on anything. We can’t even rely on the high powered servers. That means the entire application needs to be rewritten at the Asterisk side to be very very lightweight. It needs to be different.
The idea is to use something that I call decoupled N-tier architecture. Why do I call it decoupled? Because what I’ve actually done is that I’ve taken anything that has to do with Asterisk and I’ve completely separated it from the actual platform. The idea here is fairly similar to what we’ve seen before, but with a twist. We still have the database cluster. Yes. We still have the storage server. Yes. But, instead of having Asterisk connect directly into these, we have something called the web-based application logic. Please note this is not FastAGI, absolutely not FastAGI. Absolutely not.
And we have something called a storage manager and distributor. Now, the storage manager and distributor is not only the distributor of the storage and information, but it is also a control point. It serves as our interface back into the EC2 cloud to be able to manage how many hosts I’m initiating at any given time. Think of it as a very sophisticated control center, which is fully automatic. It is capable of knowing at any given time how many calls it’s suppose to send outbound, how many calls are currently active, how many hosts are active, what load is on each machine. It is fully capable of doing that.
The communication from the actual Asterisk servers back into the web-based application logic is based on XML-RPC. There is actually an endless loop state engine implemented within the Asterisk dial plan running on each of the clouded instances which then communicates back into these application servers. So, what guidelines do we have here?
Each Asterisk server holds its own each application logic. That means that the actual servers are completely independent. They do not rely on each other. They don’t rely on proximity. They do not rely on anything. Each one is a single entity. It can be activated. It can be deactivated. The cluster doesn’t even care.
Retrieval of information from the database servers, and of course registration of information back into the database servers is always performed through XML-RPC. Always. Now, you can say well I prefer to use some other REST method. Use whatever you want. As long as you use some form of web service to make sure that your request happens correctly and it is not a direct connection to the database, use whatever you want. I’m using XML-RPC because it’s the simplest method to actually define a method and variables to a remote procedure call.
Storage and audio content is managed via storage manager and distributor. The idea is that the distributor system is actually uploading content back into the Asterisk servers as required. So, each one contains the information and the audio files or whatever I need in there, if it’s video files for 3G in the future, or 4G or whatever needs to be in there.
A web-based application logic implements the XML-RPC server side of the platform. So now, your entire application development is no longer the Asterisk side. It is fully on the web application logic.
What does the Asterisk web logic look like? Essentially it’s just a server. There’s nothing special there. You can use whatever HTTP server you want. Be it the Apache, [inaudible 16:54]. You can write your own Java, .NET or whatever. You can do whatever you want in there. The application logic can be any of the normal scripting or compiled languages you want. There’s no thing there.
This is a very important component – Memcached. Memcached is a memory caching environment or memory caching daemon that is capable of sitting between you and the actual SQL handler.
The idea is that if you cache you requests, the responses your information system will be sending back into the Asterisk servers is a lot faster. You will get a lot more capacity. Proper use of Memcached decoupled into your architecture can result in a 40 percent increase in response time.
In this presentation, the word AMI does not refer to Asterisk Manager Interface. AMI is the same initials that Amazon created for their machine instances. That’s actually a machine image.
Over the past six to nine months a lot of work had been performed over EC2 in creating proper AMI images. AMI template images for Asterisk.
How many people aren’t familiar with XEN? One, two, got three. XEN as you may know is a paravirtualization environment. Actually, it doesn’t have a real clock in it. It doesn’t have a timing source. Asterisk relies a lot on this timing source, especially for something like IKS, which requires a timing source. If you need MeetMe you need that virtual timer inside Asterisk.
The problem with normal AMI instances that Amazon provided, the Fedora type instances, was that it didn’t have a proper clock. It wasn’t a static one. A lot of work had been done in order to create an instance, an AMI image that has a static clock.
The introduction of the RPM repositories at Digium made the installation of new packages and making sure your Asterisk side of the application is really ordered. We are still using their Centeras as the image.
The person who is actually in charge of the current Asterisk AMI image is called Eric Chamberlain. This is the information of the actual AMI. Please note that it’s a 386-based AMI. It’s not 64-bit, it’s 32-bit. Not that you really should care, but it will come with everything on it.
That means that the minute you initiate it you will get on the box both Linux and Asterisk, I believe free PBX is also installed on there. MySQL, everything is installed, fully working, and fully compatible. Everything has been tested.
Up until now, we’ve said OK, I’ve been giving the buzz word talk. That means I’m trying to get you on Amazon EC2, but I don’t have the Amazon logo written here. That means I’m not selling you something.
Can we really use EC2 to build something? There’s a company I work with in Israel. I believe, in German, this means something different, right? It’s a type of cake, right?
As a piece of information, in Israel the @ sign, Shift+2, is called the strudel because it reminds us of… if you cut a strudel, it has the same shape. I have no idea what that part is still.
During the Israeli elections held February 2002, strudel actually utilized EC2 based Asterisk dialers and IVR engines in order to perform a lot of the automatic dialing in IVR campaigns that were required for the Israeli parties.
The servers utilized were completely on Amazon. What happened was – this is a very interesting story – at peak times, we had over 3500 concurrent calls running on Amazon EC2.
If you’re talking about IVR systems, that is a staggering number. There are not that many systems around the world being capable of handling 3500 concurrent calls.
Audience Member: [off-mic question]
Nir: Which latency? Which one?
Audience Member: [off-mic question]
Nir: No, I’m not going to [inaudible 22:29], never. We didn’t use it for Voice over IP phones. We connected it directly to carriers. Imagine this going over to carriers. Just for the sake of our specific project…
Audience Member: [off-mic question]
Nir: No, we connected through Voice over IP carriers. VoIP carriers. For the sake of that specific project, this area was located in Hudson 60. Who’s familiar with the Hudson 60 in New York? Just one! A big no-no for me. Hudson 60 is the world’s biggest tele-house. That is like a co-location facility specifically for carriers. If you’re talking about the global crossing, level-3, XO communications, cable and wireless. They’re all located there.
Think of it as the world’s biggest trunk, where you’re talking about gigabits upon gigabits upon gigabits of bandwidth.
Audience Member: [off-mic question]
Nir: How did we get that number? No, it’s a cluster. Yes, I can give you the exact number. We actually used 34 servers. Thirty-four instances. Each instance did about 100 calls. What we did is we were using what’s called the medium instance on EC2, which costs about 20 cents per hour. We said we’re building it as a building block of 100. Hundred to 120 and that’s it. We were very happy with it. We could have taken bigger instances and then let’s say gain 40 percent more capacity, but we’ll end up paying twice. Because in EC2, every time you step, you just pay twice. It didn’t make any commercial or economical sense to do that.
Architecture was the same one we used before. The actual Asterisk development inside the Asterisk server was the performer company, while the application logic was developed using .NET technologies. On the .NET side, what they did is they actually built an MS-SQL cluster with four web servers in the front using NLBS. Everything was load balanced and we never had any problems with that.
This is the beauty. The entire election time took about 50 hours. Over that time, we required that infrastructure to be active. We consumed the entire platform as a whole. Twenty-four servers hosted at Amazon operating Voice over IP running media traffic all the time. Fifty hours ended up paying no more than $900 for the infrastructure. I’m not talking about phone costs. I’m talking only about infrastructure.
If we would have wanted to build that on our own, we would have paid $65,000. This is the beauty of it. This is the Holy Grail. How do we end up with unlimited capacity when we need it, how much we needed immediately and not pay much? Amazon EC2 provides that. This is a shameless plug to the previous presentation.
That one had the shameless plug and this one has too.
Audience Member: [off-mic question]
Nir: No, no, no. Voice over IP traffic is different. I was talking only about infrastructure. We would have just paid $65,000 just for servers and hosting. The technology that was used on the Asterisk servers themselves is called Atomic AGI in terms of development. I had the previous talk I gave today was about Atomic AGI. Please note that the web logic is implementing the web service and not fast AGI. There was no fast AGI involved in this project at all.
Atomic AGI can be implemented with simple scripting languages. Even bash. If you really want to, you can, I have no idea why you want to, but you can. For more information about Atomic AGI please refer to my previous presentation or you can buy my book – but that’s another shameless plug.
A few gotchas all fine and dandy. We’ve said a lot of good things about Amazon EC2. We need to understand there are a few gotchas in here.
The main gotcha is the fact that Amazon EC2, the entire network is behind a net environment. That means that our servers need to handle media.
If we were thinking of doing re-invites, forget it. If you think about doing some kind of media pass through, forget it. The firewalls over there and the security structures will simply not allow it. Whatever you do, you will have to handle media on your own. Bear that in mind.
All instances are located behind that. Asterisk is well known for its ability to play back and pass through media. However, transcoding on EC2 is fairly out of the question. Forget it. Don’t even think about transcoding.
I had a customer call up to me saying I want to build a transcoding cluster over EC2 to be dynamic so I can have more transcoding capacity as I go along. I was saying OK, from what codec to what codec? He goes G.729 to G.711 and back. I said OK, which license are we going to use because every instance we can’t put the license and duplicate it. He goes Ahhh, what we can’t? I thought of using just the license for 30 and duplicate it. I said no, it doesn’t work like that.
There are several codecs out there, which are un-official G.729 and G.723, which are unofficial Asterisk codecs. I’ve tried using those and I’ve tried using Digium G.729. It works, but it consumes too many resources from the environment. I wouldn’t suggest doing that.
A couple more. The management of the instances being initiated on Amazon is far from being trivial. If you’re doing the management by hand there is a web interface and there is a Firefox plug-in. Again, it’s far from being trivial.
Cloud computing requires new management and economical structures to be utilized. Whatever you know or whatever you use about calculating your cost is completely different. Your cost is no longer per year, they’re per hour. Whatever you know has to be changed. It’s a completely new mathematical structure.
New development techniques need to be utilized. Most developers, as I said in my previous presentation, most developers are lazy. That means they actually rely on the fact the machine has 16 GB of RAM, eight cores of three GHz running. The end result is we get a platform which is really big. Really heavy. Sure it works, but if I want to move into something smaller, I can’t.
In Amazon EC2, you have to optimize – I always say optimize for Commodore 64 in order to execute like a mainframe. So think about that.
Cloud computing isn’t an all magical solution. If you think you can migrate your entire system to the cloud, forget it. It won’t happen. You have to plan it in a really meticulous way. It has to be well planned, well figured. Sometimes it will require you to change your programming. It will require you to change your development structure. Be sure you check that.
For more information about EC2 utilization with Asterisk, Asterisk in general, Atomic AGI. You’re welcome to contact me either through my blog or my email. This is no joke.
I’m here, questions?
Audience Member: [off-mic question]
Nir: Two. That’s like five questions. What am I, superman! OK, let’s do three questions and then we’ll go out to drinks. You know something, let’s go out to have drinks and I’ll answer the questions over there. That’s a better idea.
[applause]
Nir: Thanks.