Stability & performance boosts with enhanced scaling
We are excited to announce a few updates that we have been working on over the past few weeks. These changes will hopefully serve to further stabilize the platform and make it even faster.
Let's dive in!
Autoscaling uploaders is a bit harder than it is for encoders, because once a big chunk of uploads
hits us those are then bound to certain machines. This means that any scaling we do as a response,
will have no effect on the uploads already in progress. Because of this, we have occasionally seen
high I/O demands as large waves of uploads came in, after which many encoding drones would scale up
and exchange input and output with few uploaders.
We have now replaced 2 c1.medium uploaders with 3
m1.xlarge ones. Right off the bat, this gives us 6
times the I/O throughput we had before and that has worked really well to accommodate these upload
spikes without any problems. We are also considering other strategies to further reduce the I/O on
our uploader machines - such as immediately sending all incoming files to Amazon S3 and letting
drones download them from there instead. We are also looking into ways to distribute and autoscale
uploader responsibilities. And we are keeping a close eye on
those SSD machines
as well. We will keep you posted on these!
We have rewritten our Autoscaler. This new and improved version is now capable of analyzing a queue and deciding how many machines it needs to launch (in parallel) in order to stay on top of it. It also scales much more aggressively now. The result is a drastic decrease in queue times. For example, during the past one and a half weeks, we did not experience an /image/resize queue time greater than 5 minutes or a /video/encode queue time greater than 18 minutes. And this instance of an 18-minute queue time only happened because two of our biggest customers did their video batch imports at the same time, which resulted in a 130 GB video queue. During this period, we jumped from 2 to 39 octa-core encoding machines in under 20 minutes.
But that isn't all! You are now also able to check the current queue times for image resizing and video encoding on our status page.
Furthermore, we have also rewritten the code that handles the Assembly list/search on the website. The result is a major increase in performance. It also means that timeouts should no longer be able to result in empty pages, a problem which some of our customers had been reporting. And lastly, we are now targeting heavy (but less critical) queries against read-only slaves, to ensure that production is not affected by heavy searches, analyses or reports.
Please let us know what you think of all these changes. We are thrilled to present you with a product that is ever more stable and performant. Rest assured that we have a lot more nice improvements in store, including some new features! 😄