Tesseract OCR with Java with Examples

In this article, we will learn how to work with Tesseract OCR in Java using the Tesseract API.

What is Tesseract OCR?
Tesseract OCR is an optical character reading engine developed by HP laboratories in 1985 and open sourced in 2005. Since 2006 it is developed by Google. Tesseract has Unicode (UTF-8) support and can recognize more than 100 languages “out of the box” and thus can be used for building different language scanning software also. Latest Tesseract version is Tesseract 4. It adds a new neural net (LSTM) based OCR engine which is focused on line recognition but also still supports the legacy Tesseract OCR engine which works by recognizing character patterns.

How OCR works?

Generally OCR works as follows:

  1. Pre-process image data, for example: convert to gray scale, smooth, de-skew, filter.
  2. Detect lines, words and characters.
  3. Produce ranked list of candidate characters based on trained data set. (here the setDataPath() method is used for setting path of trainer data)
  4. Post process recognized characters, choose best characters based on confidence from previous step and language data. Language data includes dictionary, grammar rules, etc.

Advantages

The advantages of OCR are numerous, but namely:

  • it increases the efficiency and effectiveness of office work
  • The ability to instantly search through content is immensely useful, especially in an office setting that has to deal with high volume scanning or high document inflow.
  • OCR is quick ensuring the document’s content remains intact while saving time as well.
  • Workflow is increased since employees no longer have to waste time on manual labour and can work quicker and more efficiently.

Disadvantages

  • The OCR is limited to language recognition.
  • There is lot of effort that is required to make trainer data of different languages and implement that.
  • One also need to do extra work on image processing as it is the most essential part that really matters when it comes to the performance of OCR.
  • After doing such a great amount of work, no OCR can offer an accuracy of 100% and even after OCR we have to determine the unrecognized character by neighbouring methods of machine learning or manually correct it.

How to use Tesseract OCR

  1. The first step is to download the Tess4J API from the link
  2. Extract the Files from the downloaded file
  3. Open your IDE and make a new project
  4. Link the jar file with your project. Refer this link .
  5. Please migrate via this path “..\Tess4J-3.4.8-src\Tess4J\dist”.

Now you are done with your linking jar in your project and ready to use tesseract engine.

Performing OCR on clear images

Now that you have linked the jar file, we can get started with our coding part. The following code reads an image file and perform OCR and display text on the console.

Include latest version in pom file

<!-- https://mvnrepository.com/artifact/net.sourceforge.tess4j/tess4j -->
<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.4.1</version>
</dependency>
import java.io.File; 
  
import net.sourceforge.tess4j.Tesseract; 
import net.sourceforge.tess4j.TesseractException; 
  
public class Test { 
    public static void main(String[] args) 
    { 
        Tesseract tesseract = new Tesseract(); 
        try { 
  
            tesseract.setDatapath("D:/Tess4J/tessdata"); 
  
            // the path of your tess data folder 
            // inside the extracted file 
            String text 
                = tesseract.doOCR(new File("image.jpg")); 
  
            // path of your image file 
            System.out.print(text); 
        } 
        catch (TesseractException e) { 
            e.printStackTrace(); 
        } 
    } 

Input:

Output:

05221859

Performing OCR on unclear images

Note that the image selected above is actually very clear and grayscaled but this doesn’t happen in most of the cases. In most of the cases, we get a noisy image and thus a very nosy output. To deal with it we need to perform some processing on the image called Image processing.

Tesseract works best when there is a very clean segmentation of the foreground text from the background. In practice, it can be extremely challenging to guarantee good segmentation. There are a variety of reasons you might not get good quality output from Tesseract if the image has noise on the background. Noise removal from image comes in the part of image processing. For this, we need to know that in what way an image should be processed.

You can refer this article for a detail understanding of how can you improve the accuracy. To implement the same in JAVA, we will make a small intelligence-based model which will scan the RGB content of the image and then convert it into the grayscaled content and also we will perform some zooming effect on the image too.

The below example is a sample code on how the image can be grayscaled based on its RGB content. So if images are very dark then they become brighter and clearer and if in case the images are whitish then they are scaled to little dark contrast so that text is visible.

import java.awt.Graphics2D; 
import net.sourceforge.tess4j.*; 
import java.awt.Image; 
import java.awt.image.*; 
import java.io.*; 
  
import javax.imageio.ImageIO; 
  
public class ScanedImage { 
  
    public static void
    processImg(BufferedImage ipimage, 
               float scaleFactor, 
               float offset) 
        throws IOException, TesseractException 
    { 
        // Making an empty image buffer 
        // to store image later 
        // ipimage is an image buffer 
        // of input image 
        BufferedImage opimage 
            = new BufferedImage(1050, 
                                1024, 
                                ipimage.getType()); 
  
        // creating a 2D platform 
        // on the buffer image 
        // for drawing the new image 
        Graphics2D graphic 
            = opimage.createGraphics(); 
  
        // drawing new image starting from 0 0 
        // of size 1050 x 1024 (zoomed images) 
        // null is the ImageObserver class object 
        graphic.drawImage(ipimage, 0, 0, 
                          1050, 1024, null); 
        graphic.dispose(); 
  
        // rescale OP object 
        // for gray scaling images 
        RescaleOp rescale 
            = new RescaleOp(scaleFactor, offset, null); 
  
        // performing scaling 
        // and writing on a .png file 
        BufferedImage fopimage 
            = rescale.filter(opimage, null); 
        ImageIO 
            .write(fopimage, 
                   "jpg", 
                   new File("D:\\Tess4J\\Testing and learning\\output.png")); 
  
        // Instantiating the Tesseract class 
        // which is used to perform OCR 
        Tesseract it = new Tesseract(); 
  
        it.setDatapath("D:\\Program Files\\Workspace\\Tess4J"); 
  
        // doing OCR on the image 
        // and storing result in string str 
        String str = it.doOCR(fopimage); 
        System.out.println(str); 
    } 
  
    public static void main(String args[]) throws Exception 
    { 
        File f 
            = new File( 
                "D:\\Tess4J\\Testing and learning\\Final Learning Results\\input.jpg"); 
  
        BufferedImage ipimage = ImageIO.read(f); 
  
        // getting RGB content of the whole image file 
        double d 
            = ipimage 
                  .getRGB(ipimage.getTileWidth() / 2, 
                          ipimage.getTileHeight() / 2); 
  
        // comparing the values 
        // and setting new scaling values 
        // that are later on used by RescaleOP 
        if (d >= -1.4211511E7 && d < -7254228) { 
            processImg(ipimage, 3f, -10f); 
        } 
        else if (d >= -7254228 && d < -2171170) { 
            processImg(ipimage, 1.455f, -47f); 
        } 
        else if (d >= -2171170 && d < -1907998) { 
            processImg(ipimage, 1.35f, -10f); 
        } 
        else if (d >= -1907998 && d < -257) { 
            processImg(ipimage, 1.19f, 0.5f); 
        } 
        else if (d >= -257 && d < -1) { 
            processImg(ipimage, 1f, 0.5f); 
        } 
        else if (d >= -1 && d < 2) { 
            processImg(ipimage, 1f, 0.35f); 
        } 
    } 
} 

Input:

input.png

Output:

output.png

Open Source Tools You Can Use To Improve OCR Accuracy

So what are your options when you want to programmatically increase the quality of your source images? At Docparser, we recommend the following open source tools for image preprocessing for improving ocr accuracy:

  • Leptonica – A general purpose image processing and image analysis library and command line tool. Leptonica is also the library used by Tesseract OCR to binarize images.
  • OpenCV – An open source image processing library with bindings for C++, C, Python and Java. OpenCV was designed for computational efficiency and with a strong focus on real-time applications.
  • ImageMagick – A general purpose image processing library and command line tool. A long list of command line options is available for any kind of image processing job.
  • unpaper – The name says all. Unpaper is a postprocessing library specifically built for eliminating all “paper” related issues from a scanned document. If you want decent results without any tweaking, don’t look further and use Unpaper.
  • Gimp – A powerful open source image editor which you can use to manually improve the quality of individual images.

More useful resources

Leave a Reply

Your email address will not be published. Required fields are marked *