Movatterモバイル変換

raul23/pyebooktoolsPublic

NotificationsYou must be signed in to change notification settings
Fork4
Star23

Command-line program for organizing and managing ebook collections. It is a Python port from the original shell scripts ebook-tools

License

GPL-3.0 license

23 stars 4 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,327 Commits
docs		docs
pyebooktools		pyebooktools
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
README_long.rst		README_long.rst
README_pypi.rst		README_pypi.rst
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

🚧 Work-In-Progress

This project (version 0.1.0a3) is a Python port ofebook-tools which iswritten in Shell byna--. The Python scriptebooktools.py is a collectionof tools for automated organization and management of large ebook collections.

Check also my other projectsearch-ebooks which is based onpyebooktoolsfor searching through the content and metadata of ebooks.

⚠️

Checkorganize-ebooks which is the Python port oforganize-ebooks.sh and includes aDocker image for easy installation of all needed dependencies and Python package.

About

Theebooktools.py script is a Python port of theshell scripts fromebook-tools and makes use of the following modules:

edit_config.py edits a configuration file which can either be the mainconfig file that contains all the options definedbelow or the logging config file whosedefault values is defined indefault_logging.py. Theedit subcommandfrom theebooktools.py script uses this module.
convert_to_txt.py converts the supplied file to a text file. It canoptionally also useOCR for.pdf,.djvu and image files. Theconvert subcommand from theebooktools.py script uses this module.
find_isbns.py tries to findvalid ISBNs inside a file or in astring if no file was specified. Searching for ISBNs in files usesprogressively more resource-intensive methods until some ISBNs are found, formore details see
- thedocumentation for ebook-tools (shell scripts) or
- search_file_for_isbns() fromlib.py (Python function where ISBNssearch in files is implemented).
Thefind subcommand from theebooktools.py script uses this module.
organize_ebooks.py is used to automatically organize folders withpotentially huge amounts of unorganized ebooks. This is done by renamingthe files with proper names and moving them to other folders:
- By default itsearches the supplied ebook files forISBNs,downloads the book metadata (author, title, series, publication date,etc.) from online sources like Goodreads, Amazon and Google Books andrenames the files according to a specified template.
- If no ISBN is found, the script can optionally search for the ebooksonline by their title and author, which are extracted from the filenameor file metadata.
- Optionally an additional file that contains all the gathered ebookmetadata can be saved together with the renamed book so it can laterbe used for additional verification, indexing or processing.
- Most ebook types are supported:.epub,.mobi,.azw,.pdf,.djvu,.chm,.cbr,.cbz,.txt,.lit,.rtf,.doc,.docx,.pdb,.html,.fb2,.lrf,.odt,.prc and potentially others. Even compressed ebooks inarbitrary archive files are supported. For example a.zip,.raror other archive file that contains the.pdf or.html chaptersof an ebook can be organized without a problem.
- Optical character recognition (OCR [Wikipedia]) can beautomatically used for.pdf,.djvu and image files when no ISBNswere found in them by the fast and straightforward conversion to.txt. This is very useful for scanned ebooks that only containimages or were badly OCR-ed in the first place.
- Files are checked for corruption (zero-filled files, broken pdfs,corrupt archive, etc.) and corrupt files can optionally be moved toanother folder.
- Non-ebook documents, pamphlets and pamphlet-like documents like savedwebpages, short pdfs, etc. can also be detected and optionally moved toanother folder.
Ref.:[ORG]
Theorganize subcommand from theebooktools.py script uses thismodule.
rename_calibre_library.py traverses a calibre library folder, renamesall the book files in it by reading their metadata from calibre'smetadata.opf files. Then the book files are either moved or symlinkedto an output folder along with their corresponding metadata files.Therename subcommand from theebooktools.py script uses this module.
split_into_folders.py splits the supplied ebook files (and theaccompanying metadata files if present) into folders with consecutive namesthat each contain the specified number of files. Thesplit subcommandfrom theebooktools.py script uses this module.

Thus, you have access to varioussubcommands from within theebooktools.py script.

⭐

ebook-tools is theoriginal Shell project I ported to Python. Iused the same names for the script options (short and longer versions) sothat if you used the shell scripts, you will easily know how to run thecorrespondingsubcommand with the given options.
ebooktools.py is the name of the Python script which will always bereferred that way in this document (i.e. no hyphen and ending with.py)to distinguish from the original Shell projectebook-tools.
pyebooktools is the name of the Python package that you need toinstall to have access to theebooktools.pyscript.

Installation and dependencies

To install the scriptebooktools.py, follow these steps:

Install the dependenciesbelow.
Install thepyebooktools packagebelow.

Python dependencies

Platforms: macOS [soon linux]
Python: >= 3.6
lxml >= 4.4 for parsing Calibre'smetadata.opf files.

ℹ️

When installing thepyebooktools packagebelow, thelxml library is automaticallyinstalled if it is not found or upgraded to the correct supported version.

Other dependencies

As explained in the documentation forebook-tools, you need recentversions of:

calibre for fetching metadata from online sources, conversion to txt(for ISBN searching) and ebook metadata extraction. Versions2.84 andabove are preferred because of their ability to manually specify from whichspecific online source we want to fetch metadata. For earlier versions youhave to setisbn_metadata_fetch_order andorganize_without_isbn_sources to empty strings.
p7zip for ISBN searching in ebooks that are in archives.
Tesseract for running OCR on books - version 4 gives better results eventhough it's still in alpha. OCR is disabled by default and another enginecan be configured if preferred.
Optionallypoppler,catdoc andDjVuLibre can be installed forfaster than calibre's conversion of.pdf,.doc and.djvu filesrespectively to.txt.
⚠️
On macOS, you don't needcatdoc since it has the built-intextutilcommand-line tool that converts anytxt,html,rtf,rtfd,doc,docx,wordml,odt, orwebarchive file.
Optionally theGoodreads andWorldCat xISBN calibre plugins canbe installed for better metadata fetching.

⭐

If you only installcalibre among these dependencies, you can still havea functioning program that will organize and manage your ebookcollections:
fetching metadata from online sources will work: bydefaultcalibre comes with Amazon and Google sources among others
conversion to txt will work: calibre's ownebook-convert toolwill be used
Allsubcommands should work but accuracy and performance will beaffected as explained in the list of dependencies above.

Install`pyebooktools`

Install first thePython dependencies and othertools.
It is highly recommended to install thepyebooktools package in avirtual environment using for examplevenv orconda.
Make sure to updatepip:
```
$ pip install --upgrade pip
```

Install thepyebooktools package (bleeding-edge version) withpip:

$ pip install git+https://github.com/raul23/pyebooktools#egg=pyebooktools

⚠️

Make sure thatpip is working with the correct Python version. It might bethe case thatpip is using Python 2.x You can find what Python versionpip uses with the following:
$ pip -V
Ifpip is working with the wrong Python version, then try to usepip3which works with Python 3.x

Test installation

Test your installation by importingpyebooktools and printing itsversion:
```
$ python -c "import pyebooktools; print(pyebooktools.__version__)"
```
You can also test that you have access to theebooktools.py script byshowing the program's version:
```
$ ebooktools --version
```

Usage, options and configuration

All of the options documented below can either be passed to theebooktools.py script via command-line arguments or via the configurationfileconfig.py which is created along with the logging config filelogging.py when theebooktools.py script is run the first time with anyof the subcommands definedbelow. The default values for these config filesare taken fromdefault_config.py anddefault_logging.py, respectively.

In order to use the parameters found in the configuration fileconfig.py,use the--use-config flag. Hence, you don't need to specify a long command-linein the terminal by using this flag. See theedit subcommand to know how toedit this configuration file.

Most arguments are not required and if nothing is specified, the default valuesdefined in the default config filedefault_config.py will be used.

Theebooktools.py script consists of various subcommands for theorganization and management of ebook collections. The usage pattern for runningone of the subcommands is as followed:

ebooktools {edit,convert,find,organize,rename,split} [OPTIONS]

where[OPTIONS] includes general options (as defined in theGeneral options section) and options specific to thesubcommand (as defined in theScript usage, subcommands and options section).

⚠️

In order to avoid data loss, use the--dry-run or--symlink-onlyoption when running some of the subcommands (e.g.rename andsplit)to make sure that they would do what you expect them to do, as explained intheSecurity and safety section.

General options

Most of these options are part of the common librarylib.py and may affectsome or all of the subcommands.

General control flags

-h,--help; no config variable; default valueFalse
Show the help message and exit.
-v,--version; no config variable; default valueFalse
Show program's version number and exit.

-q,--quiet; config variablequiet; default valueFalse
Enable quiet mode, i.e. nothing will be printed.

--verbose; config variableverbose; default valueFalse
Print various debugging information, e.g. print traceback when there is anexception.

-u,--use-config; no config variable; default valueFalse
If this is enabled, the parameters found in the main config fileconfig.pywill be used instead of the command-line arguments.
ℹ️
Note that any other command-line argument that you use in the terminal withthe--use-config flag is ignored, i.e. only the parameters defined inthe main config fileconfig.py will be used.

-d,--dry-run; config variabledry_run; default valueFalse
If this is enabled, no file rename/move/symlink/etc. operations will actuallybe executed.

--sl,--symlink-only; config variablesymlink_only; default valueFalse
Instead of moving the ebook files, create symbolic links to them.

--km,--keep-metadata; config variablekeep_metadata; defaultvalueFalse
Do not delete the gathered metadata for the organized ebooks, instead save itin an accompanying file together with each renamed book. It is very usefulor for additional verification, indexing or processing at a later date.[KM]

Script usage, subcommands and options

The usage pattern for running a givensubcommand is the following:

ebooktools {edit,convert,find,organize,rename,split} [OPTIONS]

where[OPTIONS] includesgeneral options andoptions specific to the subcommand as shown below.

ℹ️

Don't forget the name of the Python scriptebooktools before thesubcommand.

All subcommands are affected by the following global options:

The-h, --help option can be applied specifically to each subcommand orto theebooktools.py script (when called without the subcommand). Thuswhen you want the help message for a specific subcommand, you do:

ebooktools {edit,convert,find,split} -h

which will show you the options that affect the choosen subcommand.

And if you want the help message for the wholeebooktools.py script:

ebooktools -h

which will show you the project description and descriptionof each subcommand without showing the subcommand options.

Examples

More examples can be found atexamples.rst.

Example 1: convert a pdf file to textwith OCR

To convert a pdf file to textwith OCR:

$ ebooktools convert --ocr always -o converted.txt pdf_to_convert.pdf

By setting--ocr toalways, the pdf file will be first OCRed beforetrying the simple conversion tools (pdftotext or calibre'sebook-convert if the former command is not found).

Running pyebooktools v0.1.0a3Verbose option disabledOCR=always, first try OCR then conversionWill run OCR on file 'pdf_to_convert.pdf' with 1 page...OCR successful!

Example 2: find ISBNs in a pdf file

Find ISBNs in a pdf file:

$ ebooktools find pdf_file.pdf

Output:

Running pyebooktools v0.1.0a3Verbose option disabledSearching file 'pdf_file.pdf' for ISBN numbers...Extracted ISBNs:97895801584481000100111

The search for ISBNs starts in the first pages of the document to increasethe likelihood that the first extracted ISBN is the correct one. Then thelast pages are analyzed in reverse. Finally, the rest of the pages aresearched.

Thus, in this example, the first extracted ISBN is the correct oneassociated with the book since it was found in the first page.

The last sequence1000100111 was found in the middle of the documentand is not an ISBN even though it is a technically valid but wrong ISBNthat the regular expressionisbn_blacklist_regex didn't catch. Maybeit is a binary sequence that is part of a problem in a book about digitalsystem.

Uninstall

To uninstall thepyebooktools package:

$ pip uninstall pyebooktools

ℹ️

When uninstalling thepyebooktools package, you might be informedthat the configuration fileslogging.py andconfig.py won't beremoved bypip. You can remove those files manually by noting their pathsreturned bypip. Or you can leave them so your saved settings can bere-used the next time you re-install the package.

Example: uninstall the package and remove the config files

$pip uninstall pyebooktoolsFound existing installation: pyebooktools 0.1.0a3Uninstalling pyebooktools-0.1.0a3:  Would remove:    /Users/test/miniconda3/envs/ebooktools_py37/bin/ebooktools    /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools-0.1.0a3.dist-info/*    /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/*  Would not remove (might be manually added):    /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/configs/config.py    /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/configs/logging.pyProceed (y/n)? y  Successfully uninstalled pyebooktools-0.1.0a3$rm -r /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/

Limitations

Samelimitations as forebook-tools apply to this project too:

Automatic organization can be slow - all the scripts are synchronousand single-threaded and metadata lookup by ISBN is not doneconcurrently. This is intentional so that the execution can be easilytraced and so that the online services are not hammered by requests.If you want to optimize the performance, run multiple copies of thescripton different folders.
The default setting forisbn_metadata_fetch_order includes twonon-standard metadata sources: Goodreads and WorldCat xISBN. Forbest results, install the plugins (1,2) for them in calibre andfine-tune the settings for metadata sources in the calibre GUI.

Security and safety

Important security and safety tips from theebook-tools documentation:

Please keep in mind that this is beta-quality software. To avoid data loss,make sure that you have a backup of any files you want to organize. You mayalso want to run the scripts with the--dry-run or--symlink-onlyoption the first time to make sure that they would do what you expect them todo.
Also keep in mind that these shell scripts parse and extract complexarbitrary media and archive files and pass them to other external programswritten in memory-unsafe languages. This is not very safe andspecially-crafted malicious ebook files can probably compromise your systemwhen you use these scripts. If you are cautious and want to organizeuntrusted or unknown ebook files, use something likeQubesOS or at leastdo it in a separate VM/jail/container/etc.

NOTE:--dry-run and--symlink-only can be applied to the followingsubcommands:

organize
rename
split: only--dry-run is applicable

Roadmap

Starting from first priority tasks

Short-term

Port allebook-tools shell scripts into Python
- ~~organize-ebooks.sh~~:done,seeorganize_ebooks.py
- interactive-organizer.sh
- ~~find-isbns.sh~~:done,seefind_isbns.py
- ~~convert-to-txt.sh~~:done,seeconvert_to_txt.py
- ~~rename-calibre-library.sh~~:done,seerename_calibre_library.py
- ~~split-into-folders.sh~~:done,seesplit_into_folders.py
Status: onlyinteractive-organizer.sh remaining, will port later
Addcache support when converting files to txt
Status: working on it since it is also needed for my other projectsearch-ebookswhich makes heavy use ofpyebooktools
Test on linux
Create adocker image for this project

Medium-term

Add tests onTravis CI
Eventually add documentation onRead the Docs
Add afix subcommand that will try to fix corrupted PDF files based onone of the following utilities:
- ~~gs: Ghostscript~~; done,seefix_file_for_corruption()
- pdftocairo: from Poppler
- mutool: it does not "print" the PDF file
- cpdf
It will also check PDF files based on one of the following utilities:
- pdfinfo
- pdftotext
- qpdf
- jhove
Add aremove subcommand that can remove annotations (incl. highlights,comments, notes, arrows), bookmarks, attachments and metadata from PDF filesbased on thecpdf utility
NOTE:pdftk can also remove annotations

Credits

Special thanks tona--, the developer ofebook-tools, for having madethese very useful tools. I learned a lot (speciallybash) while portingthem to Python.
Thanks to all the developers of the different programs used by this projectsuch ascalibre,Tesseract, text converters (djvutxt andpdftotext) and many other utilities!