我正在尝试为Tesseract 4训练特定的图片(以读取7段的万用表),
请注意,我知道Arthur Augusto在https://github.com/arturaugusto/display_ocr上提供的训练有素的数据,但是我需要对Tesseract进行我自己的数据训练。
为了训练TESS,我遵循了不同的教程(如https://robipritrznik.medium.com/recognizing-vehicle-license-plates-on-images-using-tesseract-4-ocr-with-custom-trained-models- 4ba9861595e7或https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/)
但是我用自己的数据运行shapeclustering命令时总是遇到问题
(使用示例数据https://github.com/tesseract-ocr/tesseract/issues/1174#issuecomment-338448972,一切正常。)
确实,当我尝试执行shapeclusturing命令时,它具有以下输出屏幕截图:然后我的shape_table为空,并且trainig效率不高...
使用示例数据可以正常工作,并且shape_table填充得很好
我猜我在框文件生成方面有问题,这是我创建框文件的过程:
我用
tesseract imageFileName.tif imageFileName batch.nochop makebox
命令生成盒子文件,然后用JtessboxEditor编辑它。
所以我看不到我的.box / .tif数据对出了什么问题。
祝您有美好的一天,并感谢您对我的帮助\ n Adrien
这是我完整的批处理脚本,用于在生成和编辑Box文件之后进行培训。
set name=sev7.exp0
set shortName=sev7
echo Run Tesseract for Training..
tesseract.exe %name%.tif %name% nobatch box.train
echo Compute the Character Set..
unicharset_extractor.exe %name%.box
shapeclustering -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
mftraining -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
echo Clustering..
cntraining.exe %name%.tr
echo Rename Files..
rename normproto %shortName%.normproto
rename inttemp %shortName%.inttemp
rename pffmtable %shortName%.pffmtable
rename shapetable %shortName%.shapetable
echo Create Tessdata..
combine_tessdata.exe %shortName%.
echo. & pause
好吧,终于我实现了对tesseract的训练。
解决方案是--psm
在使用命令时添加参数
tesseract.exe %name%.tif %name% nobatch box.train
如
tesseract.exe %name%.%typeFile% %name% --psm %psm% nobatch box.train
请注意,所有psm值均为:
REM pagesegmode values are:
REM 0 = Orientation and script detection (OSD) only.
REM 1 = Automatic page segmentation with OSD.
REM 2 = Automatic page segmentation, but no OSD, or OCR
REM 3 = Fully automatic page segmentation, but no OSD. (Default)
REM 4 = Assume a single column of text of variable sizes.
REM 5 = Assume a single uniform block of vertically aligned text.
REM 6 = Assume a single uniform block of text.
REM 7 = Treat the image as a single text line.
REM 8 = Treat the image as a single word.
REM 9 = Treat the image as a single word in a circle.
REM 10 = Treat the image as a single character.
REM 11 = Sparse text. Find as much text as possible in no particular order.
REM 12 Sparse text with OSD.
REM 13 Raw line. Treat the image as a single text line bypassing hacks that are Tesseract-specific.
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句