"A picture is worth a thousand words" is an adage in many languages – and for good reason. Our visual cortex is arguably the most powerful part of the brain, and leveraging that to convey meaning is equally powerful. With the introduction of our /image/describe Robot, we allow programmers to unlock the visual cortex of AI models trained by prominent cloud vendors. Today, we are taking this a step further by not only recognizing objects in images, but also reading any words present.

Let's say you had a picture of a traffic sign, or a menu. Almost any human would be able to make sense of these objects, but their meaning has always been opaque to machines until the introduction of powerful new OCR models that can read – virtually – text from a collection of pixels.

Introducing the /image/ocr Robot. The newest member of our AI Robot family. It can use either AWS or GCP in the backend, with each provider being easily swappable using the provider parameter. This means that if one provider produces unfavorable results, you can switch to the other with no further configuration or pricing changes. Feel free to switch providers as you see fit. Furthermore, as each backend AI API model advances, our service will automatically utilize the improved recognition benefits.

Extracting text from images

To demonstrate how simple it is to use this new Robot, we will walk through the Template below:

  "steps": {
    ":original": {
      "robot": "/upload/handle"
    "image-ocr": {
      "use": ":original",
      "robot": "/image/ocr",
      "format": "text",
      "provider": "aws"
    "exported": {
      "use": [
      "robot": "/s3/store",
      "credentials": "YOUR_AWS_CREDENTIALS"

There are many options for getting files to Transloadit, but we will be using the /upload/handle Robot in our first Step, :original.

Our image recognition occurs in the following Step, image-ocr. Here, we pass :original to our /image/ocr Robot and specify its format parameter to return its result as a text file. If this is not enabled, the Robot will output JSON by default. Additionally, we have specified that we want to use aws as the backend AI provider. As indicated, writing gcp here would leverage the Google Cloud Platform, with no changes to the interface or structure of the data returned. Only the data itself (i.e., the text read) may vary as each AI plaform was trained independently, and continues to evolve equally independently.

After the text has been extracted and returned, we export the results to our S3 Bucket. YOUR_AWS_CREDENTIALS refers to the Template Credentials that can be set up in the Credentials tab of your Transloadit Console.

Click (here) to set up your own Amazon S3 Bucket, if you have not already done so.


Now, let's test our Template with the image below:

input file

Once our Assembly has finished encoding, we will be left with the following result:



As you can see, the Assembly was a success. Hopefully, in this short introduction, we have shown that extracting text from an image is not difficult or time-consuming when using our /image/ocr Robot.

What's more, thanks to Transloadit's composability, all of our 60 wildly different features can be strung together to create workflows unique to your use case. In other words, the OCR Step could be just one of the many cogs that make up your intricate machine. As shown, it only takes a declaratively written JSON recipe to set this up, making it a fool- and bullet-proof method of adding great value to your business.

Because this Robot is a member of our AI family, it is only available to our paying customers. If you are interested in using this feature, please consider upgrading to a premium plan, the first of which costs $49/mo and includes 10GB of encoding data.