Automatic spoken language detection with cURL & open source

In this post, we explore how to automatically detect the spoken language in an audio file using Vosk, a robust open-source speech recognition toolkit, combined with language detection capabilities. This solution provides a practical approach for developers who need to process multilingual audio content programmatically.
System requirements
Before we begin, ensure your system meets these requirements:
- RAM: At least 300MB for small models
- CPU: Any modern processor (i3/i5/i7 or AMD equivalent)
- Disk Space: ~50MB for small models
- Python 3.7 or newer
Installation
First, set up your environment with the necessary packages:
# Create a virtual environment
python3 -m venv venv
source venv/bin/activate
# Install required packages
pip install vosk vosk-server langdetect
# Download a language model
wget https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
Setting up the transcription server
Vosk provides a HTTP server that we can interact with using cURL. Start the server:
vosk-server-http --model vosk-model-small-en-us-0.15
The server will start on port 2700 by default.
Transcribing audio with cURL
With the server running, you can transcribe audio files using cURL:
curl -X POST http://localhost:2700/asr \
--data-binary @audio.wav \
-H "Content-Type: audio/wav" \
-o transcript.json
The server returns a JSON response containing the transcription:
{
"result": [
{
"conf": 0.96,
"end": 1.02,
"start": 0.0,
"word": "hello"
},
{
"conf": 0.89,
"end": 1.68,
"start": 1.02,
"word": "world"
}
],
"text": "hello world"
}
Language detection
Implement reliable language detection using the langdetect library:
from langdetect import detect, DetectorFactory
import json
# Set seed for consistent results
DetectorFactory.seed = 0
def detect_language(transcript_file):
try:
with open(transcript_file, 'r') as f:
data = json.load(f)
if 'text' not in data:
raise ValueError("No transcript text found in JSON")
text = data['text']
if not text.strip():
raise ValueError("Empty transcript")
language = detect(text)
return language
except Exception as e:
print(f"Error detecting language: {str(e)}")
return None
# Usage
language = detect_language('transcript.json')
if language:
print(f"Detected language: {language}")
Performance optimization
To optimize your speech recognition workflow:
- Use small models for quick processing or edge devices
- Process audio in chunks for long files
- Convert audio to 16kHz mono WAV format for best results
- Consider batch processing for multiple files
# Convert audio to optimal format using FFmpeg
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
Error handling
Implement robust error handling for production use:
from vosk import Model, KaldiRecognizer
import wave
import sys
def process_audio(audio_file, model_path):
try:
if not os.path.exists(model_path):
raise FileNotFoundError(f"Model not found at {model_path}")
model = Model(model_path)
with wave.open(audio_file, "rb") as wf:
if wf.getnchannels() != 1:
raise ValueError("Audio must be mono")
rec = KaldiRecognizer(model, wf.getframerate())
while True:
data = wf.readframes(4000)
if len(data) == 0:
break
rec.AcceptWaveform(data)
return rec.FinalResult()
except Exception as e:
print(f"Error processing audio: {str(e)}")
sys.exit(1)
Tips and best practices
- Validate audio format before processing
- Monitor server memory usage with large models
- Implement rate limiting for production deployments
- Cache frequently used language models
- Test with various audio qualities and accents
As a final note, if you need a production-ready solution with advanced features and reliable performance, check out our Speech Robot for seamless speech transcription capabilities.
Happy coding!