Sunday, November 24, 2013

#27 - OCR, or image to text

0 comments
 
Optical Character Recognition, or OCR can convert printed symbols from scans or photos to digital text, so that you can save it as a .docx file, or just paste into Google Translate. But with Asian languages it's a bit different: due to thousands of characters there's a really big chance of a mistake.

Eventually, there's no way to get rid of mistakes, but there's a way to reduce them.

You'll need:

  • gImageReader (Click on the green box)
  • Tesseract (Select "Windows Installer" and "Japanese language data" for 3.02)
Part 0
Before we start, I want you to know that you can use this method for many languages (You can find the list by clicking on "Tesseract" link), not only for Japanese. Good alternative is Adobe FineReader, but it doesn't support asian languages.


Part 1
  1. Install both programs
  2. Launch gImageReader from your "Start" menu
  3. Enter the directory address where you have installed Tesseract (It's usually either C:\Program Files\Tesseract-OCR or C:\Program Files (x86)\Tesseract-OCR)
  4. And now, in "Directory, containing Tesseract languages" box enter the same address, but add \tessdata at the end.
Part 2
A test
  1. Click "Open" and select a file
  2. Now change the language from English to Japanese/日本語 and select ja_JP
  3. Hit the "Recognize all button" or just select the area you need and click "Recognize selection"
Example

It seems to have detected all the selected character correctly, except this one:


And it is ok. Just select the character (but, firstly, zoom in the image) manually. I actually have never seen software that can work with Furigana.

Then click on "Save as" and that's it.

Have a good day :)


No comments:

 
2013, Blogspot