I'd like to announce a few updates that we worked on over the past few weeks in order to further stabilize the platform and make it even faster:
Autoscaling uploaders is a bit harder than encoders because once a big chunk of uploads hits us they're bound to those machines. If we scale up that will have no affect on the uploads already in progress.
Because of this we've occasionally seen high I/O demands as big waves of uploads came in, many encoding drones would scale up and exchange input & output with few uploaders.
We replaced 2 c1.medium uploaders with 3 m1.xlarge ones. This gives us 6x the I/O throughput we had before and has worked really well to accommodate these waves without problems. We are also considering other strategies to further reduce the I/O on our uploader machines - like immediately sending all incoming files to Amazon S3 and letting drones download them from there instead. We are also looking into how we could distribute & autoscale uploader responsibilities. As well as keeping a close eye on those SSD machines. We will keep you posted on this one.
We have rewritten our Autoscaler. It is now capable of analyzing a queue and deciding how many machines it needs to launch (in parallel) in order to stay on top of it. It also scales much more aggressively now. The result is a drastic decrease in queue times. For example in the past one and half weeks, we did not experience an image/resize queue time greater than 5 minutes or a /video/encode queue time greater than 18 minutes. And the 18 minutes only happened because two of our biggest customers did video batch imports at the same time resulting in a 130 GB video queue. During this period we jumped from 2 to 39 octa-core encoding machines in under 20 minutes.
Furthermore, you are now able to check the current queue times for image resizing and video encoding on our status page.
We have also rewritten the code that handles the Assembly list/search on the website. The result is a major increase in performance. Also, timeouts resulting in empty pages that we've had some reports about, should not happen again. On top of this, we are now targeting heavy (but less critical) queries against read-only slaves, so that production is not affected by heavy searches, analyses, reports.
Please let us know what you think of this. We are thrilled to present to you a product that is ever more stable and performant, and we sure have a lot more nice improvements to come, including some new features