Building a document OCR tool using gcp OCR and Node.js
Optical Character Recognition (OCR) unlocks text content within images and PDFs, enabling features like searchable documents, automated data entry, and content analysis. In this DevTip, we'll build a document OCR tool using GCP OCR and Node.js to efficiently extract text from images and PDFs in your applications.
Introduction
GCP OCR, powered by the Google Cloud Vision API, provides robust image analysis capabilities, including OCR for text extraction. Integrating this service into your Node.js application allows you to process images and PDFs programmatically and extract text data efficiently.
This guide walks you through setting up the Google Cloud Vision API, authenticating your application, and writing Node.js code to perform OCR on images and PDFs.
Prerequisites
Ensure you have the following:
- A Google Cloud Platform (GCP) account
- Node.js installed on your machine
- Basic knowledge of JavaScript and Node.js
Setting up Google cloud vision API
1. Create a gcp project
- Go to the Google Cloud Console.
- Click on the project dropdown and select New Project.
- Enter a project name and click Create.
2. Enable the vision API
- In the Cloud Console, navigate to APIs & Services > Library.
- Search for Cloud Vision API.
- Click on Cloud Vision API and then click Enable.
3. Set up authentication
- Navigate to APIs & Services > Credentials.
- Click Create Credentials and select Service Account.
- Enter a service account name and click Create and Continue.
- For Service Account Permissions, select Basic > Viewer or customize permissions as needed.
- Click Done.
- In the Service Accounts list, find your new account, click the Actions menu (three dots), and select Manage Keys.
- Click Add Key > Create New Key.
- Select JSON and click Create to download your private key file. Save this file securely as it contains sensitive information.
Installing the Google cloud vision client library
Initialize a new Node.js project and install the necessary library:
mkdir ocr-project
cd ocr-project
npm init -y
npm install @google-cloud/vision
Writing the Node.js code
Create an index.js
file in your project directory:
// index.js
const vision = require('@google-cloud/vision');
const path = require('path');
const fs = require('fs');
// Creates a client
const client = new vision.ImageAnnotatorClient({
keyFilename: path.join(__dirname, 'path/to/your/service-account-file.json'),
});
async function extractTextFromImage(imagePath) {
// Performs text detection on the local file
const [result] = await client.textDetection(imagePath);
const detections = result.textAnnotations;
console.log('Text detections:');
detections.forEach(text => console.log(text.description));
}
// Example usage
extractTextFromImage('images/sample.jpg');
Explanation
- Imports: Import the required modules, including the Google Cloud Vision client library.
- Client Initialization: Initialize the
ImageAnnotatorClient
with your service account key file. - Function
extractTextFromImage
: An asynchronous function that takes an image path, performs text detection, and logs the detected text.
Processing PDF files
To extract text from PDFs, use the asynchronous batch annotation feature. This requires setting up a Google Cloud Storage (GCS) bucket.
Setting up Google cloud storage
- Navigate to Storage > Browser in the Google Cloud Console.
- Click Create Bucket.
- Enter a unique name for your bucket and select a location.
- Choose default settings and click Create.
Granting permissions
Ensure your service account has access to the GCS bucket:
- Go to Storage > Browser.
- Select your bucket and click Permissions.
- Click Add.
- Enter your service account email.
- Assign the role Storage Admin.
- Click Save.
Updating the code
Modify your index.js
to include PDF processing:
async function extractTextFromPDF(pdfPath) {
const inputConfig = {
mimeType: 'application/pdf',
content: fs.readFileSync(pdfPath).toString('base64'),
};
const outputConfig = {
gcsDestination: {
uri: 'gs://your-bucket-name/output/',
},
};
const features = [{ type: 'DOCUMENT_TEXT_DETECTION' }];
const request = {
requests: [
{
inputConfig: inputConfig,
features: features,
outputConfig: outputConfig,
},
],
};
const [operation] = await client.asyncBatchAnnotateFiles(request);
console.log('Waiting for operation to complete...');
const [filesResponse] = await operation.promise();
const destinationUri = filesResponse.responses[0].outputConfig.gcsDestination.uri;
console.log(`JSON output file stored at: ${destinationUri}`);
}
// Example usage
extractTextFromPDF('documents/sample.pdf');
Explanation
- Input Configuration: Specify the PDF file and its MIME type.
- Output Configuration: Set the GCS bucket destination for the OCR results.
- Features: Use
DOCUMENT_TEXT_DETECTION
for PDFs. - Request: Assemble the OCR request with input and output configurations.
- Operation: Start the asynchronous batch annotation and wait for completion.
- Results: After processing, access the output stored in GCS.
Note: You'll need to download and parse the JSON output files from your GCS bucket to access the extracted text.
Integrating OCR functionality into an application
You can integrate OCR functionality into an Express.js application to handle file uploads and text extraction.
Install dependencies
npm install express multer
Update index.js
const express = require('express');
const multer = require('multer');
const upload = multer({ dest: 'uploads/' });
const app = express();
app.post('/upload', upload.single('file'), async (req, res) => {
try {
const filePath = req.file.path;
const fileType = req.file.mimetype;
let extractedText = '';
if (fileType === 'application/pdf') {
await extractTextFromPDF(filePath);
// Assume a function to read the output JSON from GCS
extractedText = await readTextFromGCS();
} else {
const [result] = await client.textDetection(filePath);
const detections = result.textAnnotations;
extractedText = detections.map(d => d.description).join(' ');
}
res.json({ text: extractedText });
} catch (error) {
console.error('Error during file upload:', error);
res.status(500).send('An error occurred during processing.');
}
});
app.listen(3000, () => {
console.log('Server started on port 3000');
});
Explanation
- Multer Middleware: Use Multer to handle file uploads.
- Upload Endpoint: The
/upload
route accepts a file upload and performs text detection based on file type. - File Type Handling: Check the MIME type to determine if the file is an image or PDF.
- PDF Processing: For PDFs, call
extractTextFromPDF
and read the results from GCS. - Response: Send the extracted text as a JSON response.
Best practices
- Security: Keep your service account key file secure and avoid committing it to version control systems.
- Error Handling: Implement comprehensive error handling to manage API errors, file I/O errors, and other exceptions.
- Performance: For processing large volumes of files, implement batching and concurrency controls.
- Resource Management: Clean up uploaded files and temporary data after processing to conserve disk space.
Conclusion
Integrating GCP OCR into your Node.js applications enables powerful OCR capabilities. By automating text extraction from images and PDFs, you can enhance your application's functionality, streamline workflows, and provide more value to your users.
Transloadit also offers a Document OCR robot as part of our Artificial Intelligence service for seamless and scalable OCR processing.