3

Usually when OCR an table of content the columns are separated by a large space, so the outputs are not properly order. For example, for an table like this:

The output would be:

The Rank FunctionPermutations of AtomsPure Set Theory and Axiom System ZF3.53.63.7

I'd like it to be:

3.5 The Rank Function\1123.6 Permutations of Atoms\1163.7 Pure Set Theory and Axiom System ZF\118

But different TOCs has different the output patterns, so there is no way to build a regex script to automatically fix every book. The best approach is to fix it at the first place. But how?

askedMay 20, 2018 at 6:08
Ooker's user avatar
2
  • Which OCR tool do you prefer using?CommentedJul 31, 2024 at 2:33
  • I haven't explored that tool much so I have no ideaCommentedAug 1, 2024 at 17:49

2 Answers2

2

Define what is:"fix it at the first place".

If you want to fix wrong output from OCR analysis, a simple solution on an infinite set of TOCs you will never make.You will never apply all variations. You would have to create a machine learning algorithm that would analyze each TOC variant.

Or count substrings of the same characteristics (in simple TOC).

Chapter numberChapter numberChapter numberChapter numberChapter number...

= 5

Chapter titleChapter titleChapter titleChapter titleChapter title...

= 5

If you want to fix OCR analysis, it's a good to answer:What OCR tool do you use?

For example, inTesseract you can set, that text is processed by rows instead of columns.

enter image description here

answeredJul 10, 2018 at 0:38
Stilgar Dragonclaw's user avatar
5
  • This sounds promising. What GUI do you use to make that screenshot? I've checked the3rd party projects but don't know which oneCommentedJul 10, 2018 at 5:21
  • I've opened a question in Software Recommendation:What OCR tools can recognize text by rows instead of columns?CommentedJul 10, 2018 at 6:00
  • @Ooker : I think most of the set of OCR tools have the ability to set how the page should be analyzed. The picture is my schematic illustration. But this borders also shows FineReader and the advanced GUI for TesseractgImageReader (GitHub). But in automatic analysis you do not need a GUI if you do not want to manually edit something IMHO. Because you can not use automatic batch processing over a large amount of TOC if you want to check and edit each TOC.CommentedJul 10, 2018 at 18:12
  • @Ooker your question on OCR tools recommendation is removed. Did you move it to somewhere else?CommentedAug 2, 2024 at 4:12
  • 1
    @raf it was deleted by the roomba for being inactive.I reask it. Hope it lives longerCommentedAug 2, 2024 at 14:35
-1

Not really answer the question, but some books in Google Books have TOC:

answeredMay 21, 2018 at 13:57
Ooker's user avatar

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.