Transcribe speech in audio or video files
🤖/speech/transcribe transcribes speech in audio or video files.

🤖/speech/transcribe transcribes speech in audio or video files.

You can use the text that we return in your application, or you can pass the text down to other Robots to filter audio or video files that contain (or do not contain) certain content, or burn the text into images or video for example.
Another common use case is automatically subtitling videos, or making audio searchable.
Set speaker_labels to true when you want JSON or meta transcription output to distinguish
recurring speakers:
{
"steps": {
"transcribed": {
"use": ":original",
"robot": "/speech/transcribe",
"provider": "aws",
"format": "json",
"speaker_labels": true,
"max_speakers": 3
}
}
}
Speaker labels are currently supported by the aws and gcp providers. If you enable
speaker_labels without setting provider, Transloadit uses aws for that Step. Labels
are normalized as speaker_1, speaker_2, and so on:
{
"text": "Hello there. Hi!",
"words": [
{ "text": "Hello", "startTime": 0, "endTime": 0.5, "speaker": "speaker_1" },
{ "text": "there", "startTime": 0.6, "endTime": 1, "speaker": "speaker_1" },
{ "text": "Hi!", "startTime": 1.2, "endTime": 1.8, "speaker": "speaker_2" }
],
"segments": [
{ "text": "Hello there", "startTime": 0, "endTime": 1, "speaker": "speaker_1" },
{ "text": "Hi!", "startTime": 1.2, "endTime": 1.8, "speaker": "speaker_2" }
]
}
Transcribe speech in French from uploaded audio or video, and save it to a text file:
{
"steps": {
"transcribed": {
"robot": "/speech/transcribe",
"use": ":original",
"provider": "replicate",
"source_language": "fr-FR",
"format": "text"
}
}
}interpolateboolean | Record<string, boolean>Controls whether Assembly Variables are interpolated for individual instruction fields.
By default, most Robot instruction fields interpolate Assembly Variables. Set this to false to treat every instruction field as literal text, or set an individual field path to false to treat only that field as literal text. For Robot-specific fields that are literal by default, set this to true or set that field path to true to opt back into interpolation.
Use field names such as path, or dotted paths such as ffmpeg.vf for nested objects.
output_metaRecord<string, boolean> | boolean | Array<string>Allows you to specify a set of metadata that is more expensive on CPU power to calculate, and thus is disabled by default to keep your Assemblies processing fast.
For images, you can add "has_transparency": true in this object to extract if the image contains transparent parts and "dominant_colors": true to extract an array of hexadecimal color codes from the image.
For images, you can also add "blurhash": true to extract a BlurHash string — a compact representation of a placeholder for the image, useful for showing a blurred preview while the full image loads.
For videos, you can add the "colorspace: true" parameter to extract the colorspace of the output video.
For videos, you can also add "interlaced": true to detect whether the video is interlaced. This combines the cheap ffprobe field_order flag with a bounded idet sampling pass over the first frames of the source, exposing interlaced, field_order, and a diagnostic interlace_detection object under file.meta. This is computationally expensive and billed accordingly.
For audio, you can add "mean_volume": true to get a single value representing the mean average volume of the audio file.
You can also set this to false to skip metadata extraction and speed up transcoding.
resultboolean (default: false)Whether the results of this Step should be present in the Assembly Status JSON
queuebatchSetting the queue to 'batch', manually downgrades the priority of jobs for this step to avoid consuming Priority job slots for jobs that don't need zero queue waiting times
force_acceptboolean (default: false)Force a Robot to accept a file type it would have ignored.
By default, Robots ignore files they are not familiar with. 🤖/video/encode, for example, will happily ignore input images.
With the force_accept parameter set to true, you can force Robots to accept all files thrown at them.
This will typically lead to errors and should only be used for debugging or combatting edge cases.
ignore_errorsboolean | Array<meta | execute> (default: [])Ignore errors during specific phases of processing.
Setting this to ["meta"] will cause the Robot to ignore errors during metadata extraction.
Setting this to ["execute"] will cause the Robot to ignore errors during the main execution phase.
Setting this to true is equivalent to ["meta", "execute"] and will ignore errors in both phases.
usestring | Array<string> | Array<object> | objectSpecifies which Step(s) to use as input.
":original" (reserved for user uploads handled by Transloadit){
"use": [
":original",
"encoded",
"resized"
]
}
as to pass semantic intent to robots:as to pass semantic intent to robots:
{
"use": [
{
"name": ":original",
"as": "image"
},
{
"name": ":original",
"as": "mask"
}
]
}
That's likely all you need to know about use, but you can view Advanced use cases.
provideraws | gcp | replicateWhich AI provider to leverage.
Defaults to "replicate", which currently uses our highest-quality deployed transcription path while ElevenLabs Scribe support is being prepared. When speaker_labels is true and provider is omitted, Transloadit defaults to "aws", because speaker labels are currently supported by the aws and gcp providers.
Transloadit abstracts the interface so you can expect the same data structures, but different latencies and information being returned. Different cloud vendors have different areas they shine in, and we recommend to try out and see what yields the best results for your use case.
granularityfull | list (default: "full")Whether to return a full response ("full"), or a flat list of descriptions ("list").
formatjson | meta | srt | meta | text | webvtt (default: "json")Output format for the transcription.
"text" outputs a plain text file that you can store and process."json" outputs a JSON file containing timestamped words. When speaker_labels is enabled, words can include speaker labels and the JSON can also include grouped segments by speaker."srt" and "webvtt" output subtitle files of those respective file types, which can be stored separately or used in other encoding Steps."meta" does not return a file, but stores the data inside Transloadit's file object (under ${file.meta.transcription.text}, ${file.meta.transcription.words}, and, when speaker labels are available, ${file.meta.transcription.segments}) that's passed around between encoding Steps, so that you can use the values to burn the data into videos, filter on them, etc.speaker_labelsboolean (default: false)When enabled, Transloadit asks the transcription provider to distinguish different speakers. JSON and meta output can then include speaker labels such as "speaker_1" on individual words, plus grouped segments by speaker. Text, SRT, and WebVTT output behavior is unchanged.
Speaker labels identify recurring voices, not real person names. Accuracy depends on audio quality, background noise, overlapping speech, and the number of speakers.
max_speakersstring | number (default: 10)The maximum number of speakers to detect when speaker_labels is enabled.
source_languagestring (default: "en-US")The spoken language of the audio or video. This will also be the language of the transcribed text.
The language should be specified in the BCP-47 format, such as "en-GB", "de-DE" or "fr-FR". Please also consult the list of supported languages for the gcp provider and the the aws provider.
target_languagestring (default: "en-US") This will also be the language of the written text.
The language should be specified in the [BCP-47](https://www.rfc-editor.org/rfc/bcp/bcp47.txt) format, such as `"en-GB"`, `"de-DE"` or `"fr-FR"`. Please consult the list of supported languages and voices.