I am using Tesseract
to do OCR for some screenshots. The characters in screenshots are in raster fonts
. But Tesseract
requires True Type Font
file for training.
I can find many true type font files at Windows/Fonts
folder. I am wondering if there's one for raster fonts?
"raster fonts" aren't a real thing though: OpenType (of which truetype is one of the two internal encodings) are true fonts, conforming to a highly detailed, authoritative specification, but raster fonts are pretty much "there is no single spec, you can invent whatever you want, as long as your program knows how to unpack the thing you made". There's a whole bunch of different ways to define a raster/bitmap font, and all of them are basically of the form bitmap image + header that says which letter maps to which x/y/w/h rectangle in the image
.
OCR won't want to work with them because bitmap fonts cannot be scaled: simplest reason is "there is no official bitmap font spec", but even if there was, if you're trying to match a bitmap font to an OCR result then the entire page being even 1 pixel off in width or hight with respect to what your bitmap font needs can lead to no text being matchable at all. Bbitmap fonts are encoded to fixed to font sizes (usually only one, sometimes more than one, but still rigidly fixed) and so if the scanned document isn't exactly the right size, none of the pixels will perfectly overlap, leading to ridiculous things like the O and V matching either V and O with the same reliability, because a tiny pixel shift vertically can make V and O overlap with the same number of error pixels.
OpenType fonts on the other hand use vector outlines, and can be scaled to best-match with a variety of extremely successful algorithms. Unless the document you scanned in is "drastically too small" vector transforms will yield 90-100% matching without any problems.
What you want to do instead is hit up something like MyFont.com's What The Font! and drop in a crop of your scanned document with a sentence, maybe two, then have it tell you which font is the closest match for it, and then simply use that font for your OCR training. Super effective!
이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.
침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제
몇 마디 만하겠습니다