How do I train tesseract to ignore the wavy lines added from spelling and grammar error detection?
I am using tesseract to detect text in a variety of image types, including screenshots, it's getting confused by the wavy red and blue underlines for spelling and grammar warnings, like the example below. I end up getting either no text or a garbled mess.
I have looked at ways to eliminate these lines in imagemagick pre-processing with some success, but these methods wipe out any text which is red or blue, which is undesirable - plus they take a long time to run and I need to process over 100k images per day. I am thinking that maybe there is a way to train tesseract to recognize and discard these lines, but I'm not sure how that would work.
I have seen tutorials on how to train tesseract to recognize text, but I haven't seen anything how how to train to recognize something that isn't text. Is there a way I can train tesseract, or do something with the Leptonica setup it uses, to ignore these lines?
If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?
imagemagick tesseract-ocr
add a comment |
I am using tesseract to detect text in a variety of image types, including screenshots, it's getting confused by the wavy red and blue underlines for spelling and grammar warnings, like the example below. I end up getting either no text or a garbled mess.
I have looked at ways to eliminate these lines in imagemagick pre-processing with some success, but these methods wipe out any text which is red or blue, which is undesirable - plus they take a long time to run and I need to process over 100k images per day. I am thinking that maybe there is a way to train tesseract to recognize and discard these lines, but I'm not sure how that would work.
I have seen tutorials on how to train tesseract to recognize text, but I haven't seen anything how how to train to recognize something that isn't text. Is there a way I can train tesseract, or do something with the Leptonica setup it uses, to ignore these lines?
If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?
imagemagick tesseract-ocr
add a comment |
I am using tesseract to detect text in a variety of image types, including screenshots, it's getting confused by the wavy red and blue underlines for spelling and grammar warnings, like the example below. I end up getting either no text or a garbled mess.
I have looked at ways to eliminate these lines in imagemagick pre-processing with some success, but these methods wipe out any text which is red or blue, which is undesirable - plus they take a long time to run and I need to process over 100k images per day. I am thinking that maybe there is a way to train tesseract to recognize and discard these lines, but I'm not sure how that would work.
I have seen tutorials on how to train tesseract to recognize text, but I haven't seen anything how how to train to recognize something that isn't text. Is there a way I can train tesseract, or do something with the Leptonica setup it uses, to ignore these lines?
If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?
imagemagick tesseract-ocr
I am using tesseract to detect text in a variety of image types, including screenshots, it's getting confused by the wavy red and blue underlines for spelling and grammar warnings, like the example below. I end up getting either no text or a garbled mess.
I have looked at ways to eliminate these lines in imagemagick pre-processing with some success, but these methods wipe out any text which is red or blue, which is undesirable - plus they take a long time to run and I need to process over 100k images per day. I am thinking that maybe there is a way to train tesseract to recognize and discard these lines, but I'm not sure how that would work.
I have seen tutorials on how to train tesseract to recognize text, but I haven't seen anything how how to train to recognize something that isn't text. Is there a way I can train tesseract, or do something with the Leptonica setup it uses, to ignore these lines?
If anyone has successfully dealt with this please let me know, otherwise what would the recommended approach be?
imagemagick tesseract-ocr
imagemagick tesseract-ocr
asked Jan 13 '17 at 9:45
GdDGdD
1664
1664
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
I am currently trying to learn how to teach tesseract(im stuck on how to create lstm files for training), but i know that you can Fine tune your trained data. I use jTessBoxEditor for correcting the misteaks tesseract does during OCR, i just haven't found a way to implement the changes in a form of training, but that tool is just what you need, I think.
Using jTessBoxEditor you can see how is the OCR done on your picture, also you can edit it, but im still stuck on how to implement the training (Still waiting on response on the forum and also here) so i cant really help more, because thats how far i got and i wouldnt expect anyone to answer on your question as it is 2 years old, so your setup probably already outdated. Im trying tesseract-ocr 4.* and teaching in new version changed a lot, but also the tools evolved too, so your problem is doable with jTessBoxEditor, but i dont know how to implement it so this is not really an answer but just a partial.
I hope i could help you even if just a little bit.
1
Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.
– Kristóf Horváth
Jan 30 at 9:50
Link also includes pictures and how to download
– Kristóf Horváth
Jan 30 at 10:03
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1166768%2fhow-do-i-train-tesseract-to-ignore-the-wavy-lines-added-from-spelling-and-gramma%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I am currently trying to learn how to teach tesseract(im stuck on how to create lstm files for training), but i know that you can Fine tune your trained data. I use jTessBoxEditor for correcting the misteaks tesseract does during OCR, i just haven't found a way to implement the changes in a form of training, but that tool is just what you need, I think.
Using jTessBoxEditor you can see how is the OCR done on your picture, also you can edit it, but im still stuck on how to implement the training (Still waiting on response on the forum and also here) so i cant really help more, because thats how far i got and i wouldnt expect anyone to answer on your question as it is 2 years old, so your setup probably already outdated. Im trying tesseract-ocr 4.* and teaching in new version changed a lot, but also the tools evolved too, so your problem is doable with jTessBoxEditor, but i dont know how to implement it so this is not really an answer but just a partial.
I hope i could help you even if just a little bit.
1
Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.
– Kristóf Horváth
Jan 30 at 9:50
Link also includes pictures and how to download
– Kristóf Horváth
Jan 30 at 10:03
add a comment |
I am currently trying to learn how to teach tesseract(im stuck on how to create lstm files for training), but i know that you can Fine tune your trained data. I use jTessBoxEditor for correcting the misteaks tesseract does during OCR, i just haven't found a way to implement the changes in a form of training, but that tool is just what you need, I think.
Using jTessBoxEditor you can see how is the OCR done on your picture, also you can edit it, but im still stuck on how to implement the training (Still waiting on response on the forum and also here) so i cant really help more, because thats how far i got and i wouldnt expect anyone to answer on your question as it is 2 years old, so your setup probably already outdated. Im trying tesseract-ocr 4.* and teaching in new version changed a lot, but also the tools evolved too, so your problem is doable with jTessBoxEditor, but i dont know how to implement it so this is not really an answer but just a partial.
I hope i could help you even if just a little bit.
1
Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.
– Kristóf Horváth
Jan 30 at 9:50
Link also includes pictures and how to download
– Kristóf Horváth
Jan 30 at 10:03
add a comment |
I am currently trying to learn how to teach tesseract(im stuck on how to create lstm files for training), but i know that you can Fine tune your trained data. I use jTessBoxEditor for correcting the misteaks tesseract does during OCR, i just haven't found a way to implement the changes in a form of training, but that tool is just what you need, I think.
Using jTessBoxEditor you can see how is the OCR done on your picture, also you can edit it, but im still stuck on how to implement the training (Still waiting on response on the forum and also here) so i cant really help more, because thats how far i got and i wouldnt expect anyone to answer on your question as it is 2 years old, so your setup probably already outdated. Im trying tesseract-ocr 4.* and teaching in new version changed a lot, but also the tools evolved too, so your problem is doable with jTessBoxEditor, but i dont know how to implement it so this is not really an answer but just a partial.
I hope i could help you even if just a little bit.
I am currently trying to learn how to teach tesseract(im stuck on how to create lstm files for training), but i know that you can Fine tune your trained data. I use jTessBoxEditor for correcting the misteaks tesseract does during OCR, i just haven't found a way to implement the changes in a form of training, but that tool is just what you need, I think.
Using jTessBoxEditor you can see how is the OCR done on your picture, also you can edit it, but im still stuck on how to implement the training (Still waiting on response on the forum and also here) so i cant really help more, because thats how far i got and i wouldnt expect anyone to answer on your question as it is 2 years old, so your setup probably already outdated. Im trying tesseract-ocr 4.* and teaching in new version changed a lot, but also the tools evolved too, so your problem is doable with jTessBoxEditor, but i dont know how to implement it so this is not really an answer but just a partial.
I hope i could help you even if just a little bit.
edited Jan 30 at 9:55
answered Jan 30 at 8:35
Kristóf HorváthKristóf Horváth
12
12
1
Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.
– Kristóf Horváth
Jan 30 at 9:50
Link also includes pictures and how to download
– Kristóf Horváth
Jan 30 at 10:03
add a comment |
1
Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.
– Kristóf Horváth
Jan 30 at 9:50
Link also includes pictures and how to download
– Kristóf Horváth
Jan 30 at 10:03
1
1
Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.
– Kristóf Horváth
Jan 30 at 9:50
Sorry about that. The link for the said tool has the exact page where it describes how to set the box tesseract recognises.
– Kristóf Horváth
Jan 30 at 9:50
Link also includes pictures and how to download
– Kristóf Horváth
Jan 30 at 10:03
Link also includes pictures and how to download
– Kristóf Horváth
Jan 30 at 10:03
add a comment |
Thanks for contributing an answer to Super User!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f1166768%2fhow-do-i-train-tesseract-to-ignore-the-wavy-lines-added-from-spelling-and-gramma%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown