#27 - OCR, or image to text | Areishi | あれ石

Optical Character Recognition, or OCR can convert printed symbols from scans or photos to digital text, so that you can save it as a .docx file, or just paste into Google Translate. But with Asian languages it's a bit different: due to thousands of characters there's a really big chance of a mistake.

Eventually, there's no way to get rid of mistakes, but there's a way to reduce them.

You'll need:

gImageReader (Click on the green box)
Tesseract (Select "Windows Installer" and "Japanese language data" for 3.02)

Part 0

Before we start, I want you to know that you can use this method for many languages (You can find the list by clicking on "Tesseract" link), not only for Japanese. Good alternative is Adobe FineReader, but it doesn't support asian languages.

Part 1

Install both programs
Launch gImageReader from your "Start" menu
Enter the directory address where you have installed Tesseract (It's usually either C:\Program Files\Tesseract-OCR or C:\Program Files (x86)\Tesseract-OCR)
And now, in "Directory, containing Tesseract languages" box enter the same address, but add \tessdata at the end.

Part 2

A test

Click "Open" and select a file
Now change the language from English to Japanese/日本語 and select ja_JP
Hit the "Recognize all button" or just select the area you need and click "Recognize selection"

Example

It seems to have detected all the selected character correctly, except this one:

And it is ok. Just select the character (but, firstly, zoom in the image) manually. I actually have never seen software that can work with Furigana.

Then click on "Save as" and that's it.

Have a good day :)

Pages

Areishi | あれ石 | 알에잇히

Sunday, November 24, 2013

#27 - OCR, or image to text

No comments:

Twitter Timeline

Subscribe via email

Search This Blog

Labels

About Me

Blog Archive