Last updated: February 4, 2025

Automatic spoken language detection with cURL & open source

Tim Koschützki

Co-founder · Berlin, Germany · Show bio ·

In this post, we explore how to automatically detect the spoken language in an audio file using Vosk, a robust open-source speech recognition toolkit, combined with language detection capabilities. This solution provides a practical approach for developers who need to process multilingual audio content programmatically.

System requirements

Before we begin, ensure your system meets these requirements:

RAM: At least 300MB for small models
CPU: Any modern processor (i3/i5/i7 or AMD equivalent)
Disk Space: ~50MB for small models
Python 3.7 or newer

Installation

First, set up your environment with the necessary packages:

# Create a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install required packages
pip install vosk vosk-server langdetect

# Download a language model
wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip

Setting up the transcription server

Vosk provides a HTTP server that we can interact with using cURL. Start the server:

vosk-server-http --model vosk-model-small-en-us-0.15

The server will start on port 2700 by default.

Transcribing audio with cURL

With the server running, you can transcribe audio files using cURL:

curl -X POST http://localhost:2700/asr \
  --data-binary @audio.wav \
  -H "Content-Type: audio/wav" \
  -o transcript.json

The server returns a JSON response containing the transcription:

{
  "result": [
    {
      "conf": 0.96,
      "end": 1.02,
      "start": 0.0,
      "word": "hello"
    },
    {
      "conf": 0.89,
      "end": 1.68,
      "start": 1.02,
      "word": "world"
    }
  ],
  "text": "hello world"
}

Language detection

Implement reliable language detection using the langdetect library:

from langdetect import detect, DetectorFactory
import json

# Set seed for consistent results
DetectorFactory.seed = 0

def detect_language(transcript_file):
    try:
        with open(transcript_file, 'r') as f:
            data = json.load(f)

        if 'text' not in data:
            raise ValueError("No transcript text found in JSON")

        text = data['text']
        if not text.strip():
            raise ValueError("Empty transcript")

        language = detect(text)
        return language

    except Exception as e:
        print(f"Error detecting language: {str(e)}")
        return None

# Usage
language = detect_language('transcript.json')
if language:
    print(f"Detected language: {language}")

Performance optimization

To optimize your speech recognition workflow:

Use small models for quick processing or edge devices
Process audio in chunks for long files
Convert audio to 16kHz mono WAV format for best results
Consider batch processing for multiple files

# Convert audio to optimal format using FFmpeg
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

Error handling

Implement robust error handling for production use:

from vosk import Model, KaldiRecognizer
import wave
import sys

def process_audio(audio_file, model_path):
    try:
        if not os.path.exists(model_path):
            raise FileNotFoundError(f"Model not found at {model_path}")

        model = Model(model_path)

        with wave.open(audio_file, "rb") as wf:
            if wf.getnchannels() != 1:
                raise ValueError("Audio must be mono")

            rec = KaldiRecognizer(model, wf.getframerate())

            while True:
                data = wf.readframes(4000)
                if len(data) == 0:
                    break
                rec.AcceptWaveform(data)

            return rec.FinalResult()

    except Exception as e:
        print(f"Error processing audio: {str(e)}")
        sys.exit(1)

Tips and best practices

Validate audio format before processing
Monitor server memory usage with large models
Implement rate limiting for production deployments
Cache frequently used language models
Test with various audio qualities and accents

As a final note, if you need a production-ready solution with advanced features and reliable performance, check out our Speech Robot for seamless speech transcription capabilities.

Happy coding!

#curl #language-detection #audio-processing #open-source #artificial-intelligence-service