Integrating OCR in the browser with tesseract.js
Optical Character Recognition (OCR) has traditionally been a server-side task, requiring users to upload documents to a server for processing. However, with advancements in web technologies, it's now possible to perform text recognition directly in the browser. This shift towards browser-based OCR offers immediate feedback, enhanced privacy, and reduced server load. In this article, we'll explore how to integrate OCR into your web applications using the open-source Tesseract.js library, enabling instant text recognition without leaving the browser.
Why browser-based OCR?
Performing OCR in the browser offers several benefits:
- Immediate Feedback: Users receive instant results without waiting for server processing.
- Enhanced Privacy: Sensitive documents never leave the user's device, addressing privacy concerns.
- Reduced Server Load: Offloading processing to the client reduces server costs and resource usage.
- Offline Capabilities: Users can perform OCR without an internet connection, improving accessibility.
Introducing tesseract.js: a powerful open-source OCR library
Tesseract.js is an open-source JavaScript library that brings the robust capabilities of Google's Tesseract OCR engine to web applications. By running entirely in the browser, Tesseract.js enables developers to extract text from images and documents without the need for server-side processing.
Getting started with tesseract.js in your web application
Before diving into code, ensure your development environment is set up properly.
Setting up your development environment
First, add Tesseract.js to your project using npm or include it directly via a CDN.
Using npm
npm install tesseract.js
Using CDN
<script src="https://unpkg.com/tesseract.js@v4.0.2/dist/tesseract.min.js"></script>
Basic example: recognizing text from an image
Here's a simple example of how to implement OCR in your web application:
<input type="file" id="imageInput" accept="image/*" />
<div id="result"></div>
<script>
// Create a Tesseract.js worker
const worker = Tesseract.createWorker({
logger: (m) => console.log(m),
})
async function doOCR(file) {
await worker.load()
await worker.loadLanguage('eng')
await worker.initialize('eng')
const {
data: { text },
} = await worker.recognize(file)
return text
}
document.getElementById('imageInput').addEventListener('change', async (e) => {
const file = e.target.files[0]
const resultElement = document.getElementById('result')
resultElement.textContent = 'Processing...'
try {
const text = await doOCR(file)
resultElement.textContent = text
} catch (error) {
console.error('OCR Error:', error)
resultElement.textContent = `Error: ${error.message}`
}
})
</script>
This code allows users to select an image file and displays the extracted text in the browser.
Handling different languages
Tesseract.js supports multiple languages. You can specify the language when initializing the worker.
async function doOCR(file, lang = 'eng') {
await worker.load()
await worker.loadLanguage(lang)
await worker.initialize(lang)
const {
data: { text },
} = await worker.recognize(file)
return text
}
To recognize text in multiple languages simultaneously:
await worker.loadLanguage('eng+deu') // English and German
await worker.initialize('eng+deu')
Advanced use case: extracting text from PDFs in the browser
Tesseract.js can also handle PDFs by converting them into images first.
async function extractTextFromPDF(file) {
const pdfjsLib = window['pdfjs-dist/build/pdf']
// Load the PDF
const pdf = await pdfjsLib.getDocument(URL.createObjectURL(file)).promise
let fullText = ''
// Loop through each page
for (let pageNum = 1; pageNum <= pdf.numPages; pageNum++) {
const page = await pdf.getPage(pageNum)
const viewport = page.getViewport({ scale: 1.0 })
const canvas = document.createElement('canvas')
const context = canvas.getContext('2d')
canvas.height = viewport.height
canvas.width = viewport.width
await page.render({ canvasContext: context, viewport: viewport }).promise
// Perform OCR on the canvas image
const text = await doOCR(canvas)
fullText += text + '\n'
}
return fullText
}
Note: You'll need to include PDF.js in your project to handle PDF rendering.
Optimizing performance
OCR can be computationally intensive. Here are some strategies to improve performance.
Image preprocessing
Preprocessing images can improve OCR accuracy and speed.
function preprocessImage(image) {
const canvas = document.createElement('canvas')
const ctx = canvas.getContext('2d')
// Set optimal size for OCR
const maxWidth = 1000
const scale = image.width > maxWidth ? maxWidth / image.width : 1
canvas.width = image.width * scale
canvas.height = image.height * scale
// Draw and optimize
ctx.drawImage(image, 0, 0, canvas.width, canvas.height)
ctx.filter = 'grayscale(100%) contrast(200%)'
ctx.drawImage(canvas, 0, 0)
return canvas
}
Use the preprocessed image in the doOCR
function.
Progress tracking
Provide users with feedback during the OCR process.
const worker = Tesseract.createWorker({
logger: (m) => {
if (m.status === 'recognizing text') {
const progress = Math.round(m.progress * 100)
console.log(`OCR Progress: ${progress}%`)
}
},
})
Error handling and performance optimization tips
Implement robust error handling to manage common OCR issues and optimize performance accordingly.
Error handling
async function doOCR(file) {
try {
// ... setup code ...
if (!file.type.startsWith('image/')) {
throw new Error('Please select an image file.')
}
if (file.size > 5 * 1024 * 1024) {
throw new Error('Image size should be less than 5MB.')
}
// ... OCR code ...
} catch (error) {
console.error('OCR Error:', error)
throw error
}
}
Memory management
Properly terminate the worker to free up resources when done.
async function cleanup() {
await worker.terminate()
console.log('Worker terminated successfully.')
}
// Call cleanup when appropriate, e.g., before unloading the page
window.addEventListener('beforeunload', cleanup)
Security implications and best practices
Running OCR in the browser reduces the need to transmit sensitive documents over the network. However, consider the following:
- User Consent: Inform users that processing happens locally.
- Resource Usage: Be mindful of the user's device capabilities to prevent performance issues.
- Error Handling: Validate user inputs and handle errors gracefully to prevent crashes.
Conclusion
By integrating Tesseract.js into your web application, you can provide users with powerful OCR capabilities directly in the browser. This approach enhances privacy, reduces server costs, and improves user experience with immediate results. Whether you're working on a document scanner, translation app, or any application requiring text recognition, Tesseract.js offers a robust, open-source solution.
If you need a more advanced OCR solution with server-side processing and support for various document formats, consider checking out Transloadit's Document OCR service.