培训tesseract-OCR 4的问题-Empy形状表

debugcn 发表于 Dev

阿德里安

我正在尝试为Tesseract 4训练特定的图片（以读取7段的万用表），

请注意，我知道Arthur Augusto在https://github.com/arturaugusto/display_ocr上提供的训练有素的数据，但是我需要对Tesseract进行我自己的数据训练。

为了训练TESS，我遵循了不同的教程（如https://robipritrznik.medium.com/recognizing-vehicle-license-plates-on-images-using-tesseract-4-ocr-with-custom-trained-models- 4ba9861595e7或https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/）

但是我用自己的数据运行shapeclustering命令时总是遇到问题

（使用示例数据https://github.com/tesseract-ocr/tesseract/issues/1174#issuecomment-338448972，一切正常。）

确实，当我尝试执行shapeclusturing命令时，它具有以下输出屏幕截图：然后我的shape_table为空，并且trainig效率不高...

使用示例数据可以正常工作，并且shape_table填充得很好

我猜我在框文件生成方面有问题，这是我创建框文件的过程：

我用

tesseract imageFileName.tif imageFileName  batch.nochop makebox

命令生成盒子文件，然后用JtessboxEditor编辑它。

所以我看不到我的.box / .tif数据对出了什么问题。

祝您有美好的一天，并感谢您对我的帮助\ n Adrien

这是我完整的批处理脚本，用于在生成和编辑Box文件之后进行培训。

set name=sev7.exp0
set shortName=sev7

echo Run Tesseract for Training.. 
tesseract.exe %name%.tif %name% nobatch box.train 
 
echo Compute the Character Set.. 
unicharset_extractor.exe %name%.box 

shapeclustering -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
mftraining -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
echo Clustering.. 
cntraining.exe %name%.tr
echo Rename Files.. 
rename normproto %shortName%.normproto 
rename inttemp %shortName%.inttemp 
rename pffmtable %shortName%.pffmtable 
rename shapetable %shortName%.shapetable
echo Create Tessdata.. 
combine_tessdata.exe %shortName%.
echo. & pause

阿德里安

好吧，终于我实现了对tesseract的训练。

解决方案是--psm在使用命令时添加参数

tesseract.exe %name%.tif %name% nobatch box.train

如

tesseract.exe %name%.%typeFile% %name%  --psm %psm% nobatch box.train

请注意，所有psm值均为：

REM pagesegmode values are:

REM   0 = Orientation and script detection (OSD) only.
REM   1 = Automatic page segmentation with OSD.
REM   2 = Automatic page segmentation, but no OSD, or OCR
REM   3 = Fully automatic page segmentation, but no OSD. (Default)
REM   4 = Assume a single column of text of variable sizes.
REM   5 = Assume a single uniform block of vertically aligned text.
REM   6 = Assume a single uniform block of text.
REM   7 = Treat the image as a single text line.
REM   8 = Treat the image as a single word.
REM   9 = Treat the image as a single word in a circle.
REM   10 = Treat the image as a single character.
REM   11 = Sparse text. Find as much text as possible in no particular order.
REM   12    Sparse text with OSD.
REM   13    Raw line. Treat the image as a single text line bypassing hacks that are Tesseract-specific.

建立于https://github.com/tesseract-ocr/tesseract/issues/434

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。