In this post, we explore how to automatically detect the spoken language in an audio file using Vosk, a robust open-source speech recognition toolkit, combined with language detection capabilities. This solution provides a practical approach for developers who need to process multilingual audio content programmatically.

System requirements

Before we begin, ensure your system meets these requirements:

  • RAM: At least 300MB for small models
  • CPU: Any modern processor (i3/i5/i7 or AMD equivalent)
  • Disk Space: ~50MB for small models
  • Python 3.7 or newer

Installation

First, set up your environment with the necessary packages:

# Create a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install required packages
pip install vosk vosk-server langdetect

# Download a language model
wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip

Setting up the transcription server

Vosk provides a HTTP server that we can interact with using cURL. Start the server:

vosk-server-http --model vosk-model-small-en-us-0.15

The server will start on port 2700 by default.

Transcribing audio with cURL

With the server running, you can transcribe audio files using cURL:

curl -X POST http://localhost:2700/asr \
  --data-binary @audio.wav \
  -H "Content-Type: audio/wav" \
  -o transcript.json

The server returns a JSON response containing the transcription:

{
  "result": [
    {
      "conf": 0.96,
      "end": 1.02,
      "start": 0.0,
      "word": "hello"
    },
    {
      "conf": 0.89,
      "end": 1.68,
      "start": 1.02,
      "word": "world"
    }
  ],
  "text": "hello world"
}

Language detection

Implement reliable language detection using the langdetect library:

from langdetect import detect, DetectorFactory
import json

# Set seed for consistent results
DetectorFactory.seed = 0

def detect_language(transcript_file):
    try:
        with open(transcript_file, 'r') as f:
            data = json.load(f)

        if 'text' not in data:
            raise ValueError("No transcript text found in JSON")

        text = data['text']
        if not text.strip():
            raise ValueError("Empty transcript")

        language = detect(text)
        return language

    except Exception as e:
        print(f"Error detecting language: {str(e)}")
        return None

# Usage
language = detect_language('transcript.json')
if language:
    print(f"Detected language: {language}")

Performance optimization

To optimize your speech recognition workflow:

  • Use small models for quick processing or edge devices
  • Process audio in chunks for long files
  • Convert audio to 16kHz mono WAV format for best results
  • Consider batch processing for multiple files
# Convert audio to optimal format using FFmpeg
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

Error handling

Implement robust error handling for production use:

from vosk import Model, KaldiRecognizer
import wave
import sys

def process_audio(audio_file, model_path):
    try:
        if not os.path.exists(model_path):
            raise FileNotFoundError(f"Model not found at {model_path}")

        model = Model(model_path)

        with wave.open(audio_file, "rb") as wf:
            if wf.getnchannels() != 1:
                raise ValueError("Audio must be mono")

            rec = KaldiRecognizer(model, wf.getframerate())

            while True:
                data = wf.readframes(4000)
                if len(data) == 0:
                    break
                rec.AcceptWaveform(data)

            return rec.FinalResult()

    except Exception as e:
        print(f"Error processing audio: {str(e)}")
        sys.exit(1)

Tips and best practices

  • Validate audio format before processing
  • Monitor server memory usage with large models
  • Implement rate limiting for production deployments
  • Cache frequently used language models
  • Test with various audio qualities and accents

As a final note, if you need a production-ready solution with advanced features and reliable performance, check out our Speech Robot for seamless speech transcription capabilities.

Happy coding!