A few weeks ago, the team gathered into one of those tiny Skype conference windows to discuss an important question: Should we launch the current Transloadit version?
The problems we faced at this point were, ordered by importance:
- We were stuck on node v0.1.28
- We had very few unit tests
- The system had some weak spots in the design
However, at the same time we knew the system was capable of amazing things. Some of our alpha testers are using it to upload large videos and produce up to 32 result files from them (watermarks and different thumb sizes) which are then uploaded to S3. That's 100++ internal jobs we're spawning and managing in harmony, all thanks to the awesomeness of node.
So we decided to go ahead and launch at the upcoming JSConf. Up until last week we worked furiously to deliver on that goal. We integrated credit card payments into the website, came up with a great plan & pricing model, and put tons of hours into squeezing the last bugs from the system.
And then - Bytes hit the fan.
We had noticed our node server dying every ~1-2 weeks, but we attributed that to something we screwed up and didn't think it would become a huge issue. Well, eventually it did. While testing some of the more intense Assemblies mentioned before, we noticed the server dying more frequently. It turns out the old version we are on has some bad issues in the networking code, and the last thing one sees is:
(evcom) recv() Success
And then the server dies. No segfault, no details.
Ouch, 1 year of hard after-hour work going up in flames. Now we could try to put some massive effort into tracking this bug down, maybe even fixing it. But the fact is, that this would be fixing the wrong problem.
The real problem is that we are on an old node version. And the reason for that is that we didn't go by the words of uncle bob:
"Professionalism - did the doctor wash their hands, did you write your tests?"
If we had a good test coverage for our code, upgrading node would not be a big problem. Refactoring some of the design issues within the application, it would not be a big problem.
So we're doing what we should have done from the beginning, rewriting our old version into a new, fully tested one build against the upcoming node 0.2.0. This will take some time, but we can re-use a lot of hard trial-error work we learned from version 1.
We thank everybody who has tested version 1, and we're looking forward to keeping you updated on version 2.
PS It turns out I couldn't make it to JSConf after all due to the volcano. We'll take that as a sign