Krasjet/pdf.tocgenPublic

NotificationsYou must be signed in to change notification settings
Fork27
Star796

A CLI toolset to generate table of contents for PDF files automatically.

License

GPL-3.0, AGPL-3.0 licenses found

Licenses found

796 stars 27 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.github/workflows		.github/workflows
fitzutils		fitzutils
pdftocgen		pdftocgen
pdftocio		pdftocio
pdfxmeta		pdfxmeta
recipes		recipes
spec		spec
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
LICENSE_AGPL		LICENSE_AGPL
Makefile		Makefile
README		README
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Repository files navigation

pdf.tocgen

                          in.pdf                            |                            |     +----------------------+--------------------+     |                      |                    |     V                      V                    V+----------+          +-----------+         +----------+|          |  recipe  |           |   ToC   |          || pdfxmeta +--------->| pdftocgen +-------->| pdftocio +---> out.pdf|          |          |           |         |          |+----------+          +-----------+         +----------+

pdf.tocgen is a set of command-line tools for automaticallyextracting and generating the table of contents (ToC) of a PDF file. It usesthe embedded font attributes and position of headings to deduce the basicoutline of a PDF file.

It works best for PDF files produces from a TeX document usingpdftex (andits friendspdflatex,pdfxetex, etc.), but it's designed to work with anysoftware-generated PDF files (i.e. you shouldn't expect it to work withscanned PDFs). Some examples includetroff/groff, Adobe InDesign, MicrosoftWord, and probably more.

Please see thehomepage for a detailed introduction.

Installation

pdf.tocgen is written in Python 3. It is known to work with Python 3.7 to 3.11on Linux, Windows, and macOS (On BSDs, you probably need to build PyMuPDFyourself). Use

$ pip install -U pdf.tocgen

to install the latest version systemwide. Alternatively, usepipx or

$ pip install -U --user pdf.tocgen

to install it for the current user. I would recommend the latter approach toavoid messing up the package manager on your system.

If you are using an Arch-based Linux distro, the package is also available onAUR. It can be installed using any AUR helper, for exampleyay:

$ yay -S pdf.tocgen

Workflow

The design of pdf.tocgen is influenced by theUnix philosophy. Iintentionally separated pdf.tocgen to 3 separate programs. They work together,but each of them is useful on their own.

pdfxmeta: extract the metadata (font attributes, positions) of headings tobuild arecipe file.
pdftocgen: generate a table of contents from the recipe.
pdftocio: import the table of contents to the PDF document.

You should readthe example on the homepage for a proper introduction,but the basic workflow follows like this.

First, usepdfxmeta to search for the metadata of headings, and generateheading filters using the automatic setting

$ pdfxmeta -p page -a 1 in.pdf"Section">> recipe.toml$ pdfxmeta -p page -a 2 in.pdf"Subsection">> recipe.toml

Note thatpage needs to be replaced by the page number of the search keyword.

The outputrecipe.toml file would contain several heading filters, each ofwhich specifies the attribute of a heading at a particular level should have.

An example recipe file would look like this:

[[heading]]level =1greedy =truefont.name ="Times-Bold"font.size =19.92530059814453[[heading]]level =2greedy =truefont.name ="Times-Bold"font.size =11.9552001953125

Then pass the recipe topdftocgen to generate a table of contents,

$pdftocgen in.pdf< recipe.toml"Preface" 5    "Bottom-up Design" 5    "Plan of the Book" 7    "Examples" 9    "Acknowledgements" 9"Contents" 11"The Extensible Language" 14    "1.1 Design by Evolution" 14    "1.2 Programming Bottom-Up" 16    "1.3 Extensible Software" 18    "1.4 Extending Lisp" 19    "1.5 Why Lisp (or When)" 21"Functions" 22    "2.1 Functions as Data" 22    "2.2 Defining Functions" 23    "2.3 Functional Arguments" 26    "2.4 Functions as Properties" 28    "2.5 Scope" 29    "2.6 Closures" 30    "2.7 Local Functions" 34    "2.8 Tail-Recursion" 35    "2.9 Compilation" 37    "2.10 Functions from Lists" 40"Functional Programming" 41    "3.1 Functional Design" 41    "3.2 Imperative Outside-In" 46    "3.3 Functional Interfaces" 48    "3.4 Interactive Programming" 50[--snip--]

which can be directly imported to the PDF file usingpdftocio,

$ pdftocgen in.pdf< recipe.toml| pdftocio -o out.pdf in.pdf

Or if you want to edit the table of contents before importing it,

$ pdftocgen in.pdf< recipe.toml> toc$ vim toc# edit$ pdftocio in.pdf< toc

Each of the three programs has some extra functionalities. Use the-h optionto see all the options you could pass in.

Command examples

Because of the modularity of design, each program is useful on its own, despitebeing part of the pipeline. This section will provide some more examples on howyou could use them. Feel free to come up with more.

`pdftocio`

pdftocio should best demonstrate this point, this program can do a lot on itsown.

To display existing table of contents in a PDF tostdout:

$pdftocio doc.pdf"Level 1 heading 1" 1    "Level 2 heading 1" 1        "Level 3 heading 1" 2        "Level 3 heading 2" 3    "Level 2 heading 2" 4"Level 1 heading 2" 5

To write existing table of contents in a PDF to a file namedtoc:

$pdftocio doc.pdf> toc

To write atoc file back todoc.pdf:

$pdftocio doc.pdf< toc

To specify the name of output PDF:

$pdftocio -o out.pdf doc.pdf< toc

To copy the table of contents fromdoc1.pdf todoc2.pdf:

$pdftocio -v doc1.pdf| pdftocio doc2.pdf

Note that the-v flag helps preserve the verticalpositions of headings during the copy.

To print the table of contents for reading:

$pdftocio -H doc.pdfLevel 1 heading 1 ··· 1    Level 2 heading 1 ··· 1        Level 3 heading 1 ··· 2        Level 3 heading 2 ··· 3    Level 2 heading 2 ··· 4Level 1 heading 2 ··· 5

`pdftocgen`

If you have obtained an existing recipercp.toml fordoc.pdf, you couldapply it and print the outline tostdout by

$pdftocgen doc.pdf< rcp.toml"Level 1 heading 1" 1    "Level 2 heading 1" 1        "Level 3 heading 1" 2        "Level 3 heading 2" 3    "Level 2 heading 2" 4"Level 1 heading 2" 5

To output the table of contents to a file calledtoc:

$pdftocgen doc.pdf< rcp.toml> toc

To import the generated table of contents to the PDF file, and outputtodoc_out.pdf:

$pdftocgen doc.pdf< rcp.toml| pdftocio -o doc_out.pdf doc.pdf

To print the generated table of contents for reading:

$pdftocgen -H doc.pdf< rcp.tomlLevel 1 heading 1 ··· 1    Level 2 heading 1 ··· 1        Level 3 heading 1 ··· 2        Level 3 heading 2 ··· 3    Level 2 heading 2 ··· 4Level 1 heading 2 ··· 5

If you want to include the vertical position in a page for each heading, use the-v flag

$pdftocgen -v doc.pdf< rcp.toml"Level 1 heading 1" 1 306.947998046875    "Level 2 heading 1" 1 586.3488159179688        "Level 3 heading 1" 2 586.5888061523438        "Level 3 heading 2" 3 155.66879272460938    "Level 2 heading 2" 4 435.8687744140625"Level 1 heading 2" 5 380.78875732421875

pdftocio can understand the vertical position in the output to generate tableof contents entries that link to the exact position of the heading, instead ofthe top of the page.

$pdftocgen -v doc.pdf< rcp.toml| pdftocio doc.pdf

Note that the default output ofpdftocio here isdoc_out.pdf.

`pdfxmeta`

To search forAnaphoric in the entire PDF:

$pdfxmeta onlisp.pdf"Anaphoric"14. Anaphoric Macros:    font.name = "Times-Bold"    font.size = 9.962599754333496    font.color = 0x000000    font.superscript = false    font.italic = false    font.serif = true    font.monospace = false    font.bold = true    bbox.left = 308.6400146484375    bbox.top = 307.1490478515625    bbox.right = 404.33282470703125    bbox.bottom = 320.9472351074219[--snip--]

To output the result as a heading filter with the automatic settings,

$pdfxmeta -a 1 onlisp.pdf"Anaphoric"[[heading]]#14. Anaphoric Macroslevel = 1greedy = truefont.name = "Times-Bold"font.size = 9.962599754333496#font.size_tolerance = 1e-5#font.color = 0x000000#font.superscript =false#font.italic =false#font.serif =true#font.monospace =false#font.bold =true#bbox.left = 308.6400146484375#bbox.top = 307.1490478515625#bbox.right = 404.33282470703125#bbox.bottom = 320.9472351074219#bbox.tolerance = 1e-5[--snip--]

which can be directly write to a recipe file:

$pdfxmeta -a 1 onlisp.pdf"Anaphoric">> recipe.toml

To case-insensitive search forAnaphoric in the entire PDF:

$pdfxmeta -i onlisp.pdf"Anaphoric"to compile-time. Chapter 14 introduces anaphoric macros, which allow you to:    font.name = "Times-Roman"    font.size = 9.962599754333496    font.color = 0x000000    font.superscript = false    font.italic = false    font.serif = true    font.monospace = false    font.bold = false    bbox.left = 138.60000610351562    bbox.top = 295.6583557128906    bbox.right = 459.0260009765625    bbox.bottom = 308.948486328125[--snip--]

Use regular expression to case-insensitive search search forAnaphoric in theentire PDF:

$pdfxmeta onlisp.pdf"[Aa]naphoric"to compile-time. Chapter 14 introduces anaphoric macros, which allow you to:    font.name = "Times-Roman"    font.size = 9.962599754333496    font.color = 0x000000    font.superscript = false    font.italic = false    font.serif = true    font.monospace = false    font.bold = false    bbox.left = 138.60000610351562    bbox.top = 295.6583557128906    bbox.right = 459.0260009765625    bbox.bottom = 308.948486328125[--snip--]

To search only on page 203:

$pdfxmeta -p 203 onlisp.pdf"anaphoric"anaphoric if, called:    font.name = "Times-Roman"    font.size = 9.962599754333496    font.color = 0x000000    font.superscript = false    font.italic = false    font.serif = true    font.monospace = false    font.bold = false    bbox.left = 138.60000610351562    bbox.top = 283.17822265625    bbox.right = 214.81094360351562    bbox.bottom = 296.4683532714844[--snip--]

To dump the entire page of 203:

$pdfxmeta -p 203 onlisp.pdf190:    font.name = "Times-Roman"    font.size = 9.962599754333496    font.color = 0x000000    font.superscript = false    font.italic = false    font.serif = true    font.monospace = false    font.bold = false    bbox.left = 138.60000610351562    bbox.top = 126.09941101074219    bbox.right = 153.54388427734375    bbox.bottom = 139.38951110839844[--snip--]

To dump the entire PDF document:

$pdfxmeta onlisp.pdfi:    font.name = "Times-Roman"    font.size = 9.962599754333496    font.color = 0x000000    font.superscript = false    font.italic = false    font.serif = true    font.monospace = false    font.bold = false    bbox.left = 458.0400085449219    bbox.top = 126.09941101074219    bbox.right = 460.8096008300781    bbox.bottom = 139.38951110839844[--snip--]

Development

If you want to modify the source code or contribute anything, first installpoetry, which is a dependency and package manager for Python usedby pdf.tocgen. Then run

$ poetry install

in the root directory of this repository to set up development dependencies.

If you want to test the development version of pdf.tocgen, use thepoetry run command:

$ poetry run pdfxmeta in.pdf"pattern"

Alternatively, you could also use the

$ poetry shell

command to open up a virtual environment and run the development versiondirectly:

(pdf.tocgen) $ pdfxmeta in.pdf"pattern"

Before you send a patch or pull request, make sure the unit test passes byrunning:

$ maketest

GUI front end

If you are a Emacs user, you could install Daniel Nicolai'stoc-modepackage as a GUI front end for pdf.tocgen, though it offers many morefunctionalities, such as extracting (printed) table of contents from a PDFfile. Note that it uses pdf.tocgen under the hood, so you still need to installpdf.tocgen before using toc-mode as a front end for pdf.tocgen.

License

pdf.tocgen itself a is free software. The source code of pdf.tocgen is licensedunder the GNU GPLv3 license. However, the recipes in therecipes directory isseparately licensed under theCC BY-NC-SA 4.0 License to prevent anycommercial usage, and thus not included in the distribution.

pdf.tocgen is based onPyMuPDF, licensed under the GNU GPLv3license, which is again based onMuPDF, licensed under the GNU AGPLv3license. A copy of the AGPLv3 license is included in the repository.

If you want to make any derivatives based on this project, please follow theterms of the GNU GPLv3 license.

About

A CLI toolset to generate table of contents for PDF files automatically.

krasjet.com/voice/pdf.tocgen/

Topics

cli pdf table-of-contents scraping toc-generator pdf-files pdf-document pymupdf

Resources

Readme

License

GPL-3.0, AGPL-3.0 licenses found

Releases15

release 1.3.4 Latest

Nov 26, 2023

+ 14 releases

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

pdf.tocgen

Installation

Workflow

Command examples

`pdftocio`

`pdftocgen`

`pdfxmeta`

Development

GUI front end

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases15

Contributors3

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

Krasjet/pdf.tocgen

Folders and files

Latest commit

History

Repository files navigation

pdf.tocgen

Installation

Workflow

Command examples

pdftocio

pdftocgen

pdfxmeta

Development

GUI front end

License

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases15

Contributors3

Uh oh!

Languages

`pdftocio`

`pdftocgen`

`pdfxmeta`