Choosing launch time: overcoming challenges at Transloadit
A few weeks ago, the team gathered in one of those tiny Skype conference windows to discuss an important question: should we launch Transloadit in its current version?
Ordered by importance, the problems we faced at this point were as follows:
- We were stuck on node v0.1.28
- We had very few unit tests
- The system had some weak spots in the design
However, at the same time we knew the system was capable of amazing things. Some of our alpha testers are using it to upload large videos and producing up to 32 result files from them (watermarks and different sized thumbnails) which are then uploaded to S3. That is more than 100 internal jobs we are spawning and managing in harmony, all thanks to the awesomeness of Node.
With that in mind, we decided to go ahead and launch at the upcoming JSConf. Up until last week we were working furiously to deliver on that goal. We integrated credit card payments into the website, came up with a great plan and pricing model, and put tons of hours into squeezing the last bugs from the system.
And then... bytes hit the fan.
We had noticed our node server dying every 1 or 2 weeks, but we attributed that to something we screwed up ourselves and we certainly didn't think that it would become a huge issue. Well, eventually it did. While testing some of the more intense Assemblies mentioned before, we noticed the server dying more frequently. It turns out that the old version we are on has some bad issues in the networking code. When such an issue occurs, the last thing you see is:
(evcom) recv() Success
And then the server dies. Without segfaults or details.
Ouch! One year of hard after-hour work suddenly went up in flames. Now, we could try to put some massive effort into tracking down this bug, maybe even fixing it. The fact is, however, that we would then be fixing the wrong problem.
The real problem is the fact that we are on an old Node version. And the reason for that is that we didn't go by the words of Uncle Bob:
"Professionalism - did the doctor wash their hands, did you write your tests?"
If we had good test coverage for our code, upgrading Node would not be a big problem. Refactoring some of the design issues within the application, that would also not be a big problem.
So now we are doing what we should have done from the beginning: rewriting our old version into a new, fully tested one, built against the upcoming node 0.2.0. This will take some time, but fortunately we can re-use a lot of hard trial-and-error work that we learned from version 1.
We would like to thank everybody who has tested version 1, and we are looking forward to keeping you updated on version 2.
P.S. It turns out I couldn't make it to JSConf after all due to the volcano. We'll take that as a sign. 😄