Transloadit scales to 1500 machines for faster encoding
Encoding is hard work for servers. When we started this business, we knew from the start that it would be much easier for customers to send us jobs than it would be for us to process them. Given highly variable bursts in encoding jobs and a finite budget, queues are inherent to this business.
Anybody who tells you that is not the case, is either not operating at scale, or losing money.
Spikes & queues
Some people integrate with Transloadit's uploading capabilities so that their end-users can upload avatars to their websites. Obviously, it is not acceptable if those people have to wait four hours for their avatar to be resized, just because we are optimizing someone else's library of 500 videos for display on iPad.
For that reason, queues, and their impact on various use cases, need to be minimized:
To address the impact, we wrote algorithms that can distinct jobs that should have a real-time feel to it β such as the uploading and resizing of avatars, from jobs that are allowed to take a little longer β such as converting large batches of existing media.
Waiting an extra four hours is disastrous to the three second use-case, whereas waiting an extra three seconds is fine for the four hours use-case. Priorities matter.
Of course, it would be even better if the 4 hours could be reduced to ten minutes.
To address queues themselves, we knew we needed scaling. As mentioned, encoding traffic can be highly irregular. In order to drain long queues swiftly, we would need a very large base capacity.
Unfortunately, we would need that capacity just to handle the peaks. 90% of the other time, it would be rotting away in data centers, vaporizing their value. It would be extremely hard to turn that model into a profitable business.
Along came Amazon.
Amazon cloud
Amazon had a similar problem with extremely large traffic spikes during Christmas sales. Obviously, they could not turn anybody down and had to invest in however many servers necessary to handle these spikes.
But for the rest of the year, this expensive equipment was not making them any money.
Around this time Amazon's IT department was looking into ways to utilize virtualization in order to make their platform more maintainable. Just being able to reinstall most of the servers without having to go to the datacenter was already a big win. Jeff Bezos had issued a memo earlier, that any service built inside Amazon should be designed in such a way that it would be easy for other companies to use it as well.
They subsequently opened up their virtualization administration tools to the world, and were able to rent out their overcapacity, allowing them to make some money on these idling servers outside of Christmas time.
And that is how the cloud was born π£. Or at least, this is my simplified take on it π - if you are reading this thinking I'm full of it, let me know on twitter and I will refine this glorious tale.
Transloadit β€οΈ Amazon
As mentioned, we would not be able to afford the kind of capacity needed to handle spikes. We might have been able to convince an investor to cough it up, but as said: it is extremely hard to make a profit when the machines are doing nothing most of the time.
When we came up with the idea for Transloadit however, Amazon had just dropped the beta label on their cloud offering: AWS, and we were able to rent server capacity by the hour. And thanks to their API, we could write software to do this automatically, as traffic increased.
As you see, we owe our existence to them.
Glass ceilings
With the promise of Amazon's cloud, we thought the sky was the limit. "Unlimited capacity!" "Only when we actually need it!". And I remember how happy we were, running on our our first beefy machine:
Now running on a sweet 8-core machine!
— Transloadit (@transloadit) July 5, 2010
But there were some glass ceilings. When we tried to launch six machines, we ran into errors. As it turned out:
When you create your AWS account, AWS sets limits for instances on a per-region basis.
Our limit was set to five machines, and a queue we had at the time took forever to drain.
We have always had good encounters with Amazon, they even helped us get some exposure in the early days. We were quick to reach out, and they were quick to grant us a new, 20 machines maximum, "instance ceiling".
We were thrilled π as it validated what we were trying to do, and we had just broken through a glass ceiling.
Since then, we had to reach out to Amazon a few more times to increase our instance ceiling. The last time was November 5, 2013, when Amazon granted us a 500 machine-limit in our primary datacenter.
Scaling patterns
Recently, we have been seeing traffic pick up again following this pattern:
The graphs shows that machines are scaled up as encoding jobs are thrown into the queue, but as we scale up to 500 machines, the line falls flat and the 5 TB queue is processed more slowly than we would like.
Normally, we would shoot a quick email to Amazon, but it turns out that:
For a limit increase of this size, I will need to collaborate with our Service Team to get approval. This is to ensure that we can meet your needs while keeping existing infrastructure safe.
I know that Amazon is estimated to have around 450,000 (hardware) servers, so on their scale we are still small fish. Nevertheless, the fact that they had to do some extra resource planning for this request was exciting.
You might understand that today, I'm all the more excited that we just received an email saying:
Iβm happy to inform you that we've approved and processed your EC2 Instances limit increase request for the EU (Ireland) and US East (Northern Virginia) regions. It can sometimes take up to 15 minutes for this to propagate and become available for use.
Thanks to this new instance ceiling, we can now scale up a fleet of 1000 machines in the US, and 500 in the EU - meaning we will have more than doubled our capacity:
This isn't to say that we are already running on 1500 machines, but having that kind of headroom as of today is a breakthrough for us, and being able to utilize this will be a big help in swiftly draining multi-terabyte sized encoding queues that we have been seeing more often lately.
For you as a customer, this means faster encoding times and that we can deal with massive HD video imports as if they were just a little avatar! π