July 30, 2019

Enhanced security: fixing ImageMagick vulnerability

Kevin van Zonneveld

Co-founder · Amsterdam, The Netherlands · Show bio ·

On Thursday June 20, 2019, our support captain raised a red alert when a vulnerability was reported that a Transloadit server could be rooted. This is the maximum amount of privilege possible on a machine and a security’s engineer’s/technical startup founder’s worst nightmare.

In this blog post, we will disclose how the hack happened, what impact it had, what we have done and are still planning to do to prevent this from happening in the future.

Background

Transloadit uses many different encoding tools to manipulate and convert media. For images, we use ImageMagick and we have been sponsoring this project for many years. To this day, we owe a great deal to ImageMagick.

Transloadit customers build their businesses on top of ours and expect this foundation to be solid. This makes it difficult for us to change or upgrade software, as doing so causes behaviors to change in subtle and not-so-subtle ways. Hence so far, we have been launching new stack versions, while making their use optional. Customers can test the new stacks at their own convenience, while the old versions continue to provide the service they expect and rely on.

The oldest version, known internally as imagemagick_stack v1.0.0, has been supported by us for ten years. We have been meaning to deprecate it, but this needs to be done gently. If we pull an essential brick from our customers’ foundation, their businesses could potentially fall over. As with any software, bugs and vulnerabilities are found that need to be patched. ImageMagick is no exception. Over the years, a few severe vulnerabilities have been uncovered in imagemagick_stack v1.0.0. One of these made it possible to execute system commands that were hidden in specially-crafted SVG images.

This puts Transloadit between a rock and a hard place. On the one hand, we need to provide this robust foundation that never changes, while on the other hand, we need to upgrade vulnerable software to prevent putting our customers in harm’s way. We have approached this conundrum by taking a third option, where we deprecate vulnerable software in a graceful way. We help our customers to slowly move away from the old, while buying time to do so by containing the flawed software.

Our containment consists of:

Limiting what our encoding machines have access to. In this case, only hashed temporary files, S3 and taking jobs from a queue.
Scanning SVG (and similar) images that contain system commands and rejecting them before ImageMagick operates on them.
Running our processes as a non-privileged user that does not have access to any secrets, other than what has been injected into its process memory by the root user.

What happened

On June 20, Jeremy Matos, Senior Security Engineer at GitLab reported that a hacker had acquired root access to one of our servers. GitLab had been running a HackerOne campaign in which they invite hackers to expose vulnerabilities and offer rewards for any successful hack attempt that is responsibly disclosed to them. GitLab acquired Gitter.im in 2017, and Gitter (think Slack, but for open source projects) had been using our services since 2014. When you upload a picture along with your chat messages in Gitter, the uploading and resizing process could be handled by Transloadit’s platform.

Security researcher Sergey Kashatov entered this programme and tried to compromise GitLab’s Gitter.im by uploading an image with a malicious payload. He thought he had found a GitLab vulnerability, but unknown to him at that point, the image ended up being processed on our servers, thereby compromising Transloadit instead. Jeremy quickly picked up on this and relayed the conversation with Sergey until we were able to approach Sergey directly and learn how he had gained root access.

How it happened: the “root cause” analysis

Richard I. Cook explains in How Complex Systems Fail that

“Post-accident attribution [of the] accident to a ‘root cause’ is fundamentally wrong. Because overt failure requires multiple faults, there is no isolated ‘cause’ of an accident. There are multiple contributors to accidents.”

While perhaps there are gradations and sometimes it is sufficient to point out the primary contributing anomaly, in this case, we feel it is helpful to outline more than one.

As indicated earlier, we were aware of vulnerabilities with ImageMagick stack v1.0.0 and contained potential harm by:

Running our processes as a non-privileged user

This failsafe gave way when we introduced a new process runner earlier this year. In an attempt to mitigate slow meta reading, we wanted to better utilize the multi-core capabilities of our encoding machines and parallelize this work by starting several instances of the same process. In order to do this, we implemented an existing orchestrator that had already seen a great deal of usage in production.

However, when migrating to the new supervisor, we failed to make sure that it was still starting its processes under the user that has no privileges. Instead, it was starting the meta scanning processes as the ubuntu user, which has the privilege to access more files and even to become root. This went unnoticed for two reasons: A) we did not have monitoring in production that made sure our processes are still running as the limited user and B) in development and testing, we run everything under the same user for convenience, so this situation would look like business as usual to most developers.

Scanning SVG (and similar) images

Our filtering that would reject malicious payloads scans for all possible attack vectors, but failed to recursively do so for includes. SVG files can refer to other files with the intention to inline them. SVG files are much like HTML files in that sense: a grab bag of XML tags outlining how things should be rendered, potentially even including other images as part of it. While other hackers chose to directly deliver malicious payloads, Sergey decided to upload a valid SVG that referred to another file containing the malicious system commands. He masked this include and gave it a .jpeg extension, further tricking our system into believing this was safe to pass on to ImageMagick.

Impact

Sergey responsibly disclosed this issue and did not steal or delete any data. It is, of course, possible that another hacker exploited this vulnerability before Sergey found out about it, and did steal or delete data. Our investigation reveals zero evidence in that direction, but it cannot be ruled out completely.

If a malicious hacker did use this exploit and successfully obtained root access to an encoding machine, they would have been able to:

Inspect temporary files. These files consist of either the input or output of a Robot and exist briefly on the encoding machine. Temporary files are named by UUIDv4 without dashes, however, and cannot be traced back to the original user or customer.
Access Transloadit’s S3 buckets

A small positive during this startling discovery is that a few years earlier, we had already limited what our encoding machines have access to. So, even with root access, machines would not be able to get access to AWS resources (besides S3), our database, or our customers’ secrets.

Correcting measures

Naturally, we immediately:

Improved our filtering to accommodate for externally-referenced malicious payloads and are no longer allowing this rare type of image on a vulnerable stack. The Assembly now ends in an error and throws a warning asking you to upgrade to stack v2.0.3.
Removed the new supervisor process and restored running our process as a non-privileged user, adding monitoring to production so that we get paged should this ever change again.
Created a C library that prevents ImageMagick from making any network request on its own.
Rotated all of our keys (e.g., to our own S3 buckets) and limited IAM access for the encoding machines even further to now only have write access to the few buckets that it needs (e.g., tmp.transloadit.com). We also made key rotation much easier, so we can do this in a heartbeat as a general precaution.
Compensated Sergey for his work, given that GitLab obviously has the policy that they can’t hand out rewards for vulnerabilities uncovered on systems that they themselves do not manage.
Upgraded all of our machines to Ubuntu’s latest LTS: bionic, which ensures four more years of patches and offers additional tools to further isolate our encoding processes.

This should solve the immediate problems. We have asked Sergey to confirm that Transloadit is no longer vulnerable and he acknowledges this. This does not mean that we are done, though. Still on our list are:

Using the non-privileged user in development and testing as well. This may slow development down a bit, but having this discrepancy between production and development allowed this vulnerability to go by unnoticed.
Utilizing our new OS tools to further isolate our encoding processes and only give them access to the temporary file they are working on.
Starting the deprecation procedure to remove imagemagick_stack v1.0.0. We will be mapping who is still using it and throw warnings reminding them to test a newer stack. We will also email customers and offer any assistence to help them smoothly onboard more modern stacks. When the last customer has stopped using it, we will remove v1.0.0 completely.
Implementing a allowlist of environment variables for our encoding tools. Even if our encoding machines only have write access to our S3 buckets, there really is no need for ImageMagick to know about it.

Recommendations

We don’t believe any password rotation is necessary on your part, since there was no access to passwords or similar. We have no indication that this attack was successfully performed before Sergey. While we consider unauthorized access to temporary encoding files to be the worst possible scenario, it's not all doom and gloom, fortunately. These temporary files are anonymized, often not complete (as they are in the process of being downloaded), and removed as soon as the encoding has taken place and the upload is complete.

Still, we cannot rule out the possibility that a hacker has obtained temporary files and, of course, the media itself could contain identifying information (for instance, a photo of someone with a name tag or a street sign exposing their location, and if passed to Google or Facebook, those corporations may identify who is portrayed). For this reason, we recommend you disclose this to your customers, as we are disclosing it to you. It is also a good moment to remind you of our Best Practices, specifically Don't give us the keys to your everything.

Thankfully, no credentials were leaked as a result of this vulnerability, but that is no reason to be complacent. While we work every day to obtain the highest grades of security, 100% security will always be a myth. Therefore, it is better to only give us write access to a single bucket or folder even, than to give us root access to everything — just so that we can write to that single bucket or folder.

And if you haven't already, it is also a good idea to upgrade your /image/resize Robot to "imagemagick_stack": "v2.0.3".

Conclusion

We have been deeply concerned about this. Keeping our customers safe at all costs is vital to Transloadit's continued existence and our company means everything to us. Over the past weeks, we have worked around the clock to roll out upgrades so that something like this can never happen again. We are also working with Sergey and other vulnerability researchers like him to ensure this. More upgrades are still coming, but we already felt the need to disclose this security issue. Hopefully, this gives you as a customer enough time to handle this in a way that is appropriate for your business.

I am deeply sorry for the mistakes on our end that led to having this vulnerability in production. If you have questions or concerns after reading this, please don't hesitate to reach out — I am, of course, more than willing to provide any further clarification.

#post-mortem #imagemagick #security #image-resize-robot