- Notifications
You must be signed in to change notification settings - Fork3
Command-line program for organizing and managing ebook collections. It is a Python port from the original shell scripts ebook-tools
License
raul23/pyebooktools
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This project (version 0.1.0a3) is a Python port ofebook-tools which iswritten in Shell byna--. The Python scriptebooktools.py is a collectionof tools for automated organization and management of large ebook collections.
Check also my other projectsearch-ebooks which is based onpyebooktoolsfor searching through the content and metadata of ebooks.
Checkorganize-ebooks which is the Python port oforganize-ebooks.sh and includes aDocker image for easy installation of all needed dependencies and Python package.
Theebooktools.py script is a Python port of theshell scripts fromebook-tools and makes use of the following modules:
edit_config.py
edits a configuration file which can either be the mainconfig file that contains all the options definedbelow or the logging config file whosedefault values is defined indefault_logging.py. Theedit subcommandfrom theebooktools.py
script uses this module.convert_to_txt.py
converts the supplied file to a text file. It canoptionally also useOCR for.pdf
,.djvu
and image files. Theconvert subcommand from theebooktools.py
script uses this module.find_isbns.py
tries to findvalid ISBNs inside a file or in astring
if no file was specified. Searching for ISBNs in files usesprogressively more resource-intensive methods until some ISBNs are found, formore details see- thedocumentation for ebook-tools (shell scripts) or
- search_file_for_isbns() from
lib.py
(Python function where ISBNssearch in files is implemented).
Thefind subcommand from the
ebooktools.py
script uses this module.organize_ebooks.py
is used to automatically organize folders withpotentially huge amounts of unorganized ebooks. This is done by renamingthe files with proper names and moving them to other folders:- By default itsearches the supplied ebook files forISBNs,downloads the book metadata (author, title, series, publication date,etc.) from online sources like Goodreads, Amazon and Google Books andrenames the files according to a specified template.
- If no ISBN is found, the script can optionally search for the ebooksonline by their title and author, which are extracted from the filenameor file metadata.
- Optionally an additional file that contains all the gathered ebookmetadata can be saved together with the renamed book so it can laterbe used for additional verification, indexing or processing.
- Most ebook types are supported:
.epub
,.mobi
,.azw
,.pdf
,.djvu
,.chm
,.cbr
,.cbz
,.txt
,.lit
,.rtf
,.doc
,.docx
,.pdb
,.html
,.fb2
,.lrf
,.odt
,.prc
and potentially others. Even compressed ebooks inarbitrary archive files are supported. For example a.zip
,.rar
or other archive file that contains the.pdf
or.html
chaptersof an ebook can be organized without a problem. - Optical character recognition (OCR [Wikipedia]) can beautomatically used for
.pdf
,.djvu
and image files when no ISBNswere found in them by the fast and straightforward conversion to.txt
. This is very useful for scanned ebooks that only containimages or were badly OCR-ed in the first place. - Files are checked for corruption (zero-filled files, broken pdfs,corrupt archive, etc.) and corrupt files can optionally be moved toanother folder.
- Non-ebook documents, pamphlets and pamphlet-like documents like savedwebpages, short pdfs, etc. can also be detected and optionally moved toanother folder.
Ref.:[ORG]
Theorganize subcommand from the
ebooktools.py
script uses thismodule.rename_calibre_library.py
traverses a calibre library folder, renamesall the book files in it by reading their metadata from calibre'smetadata.opf
files. Then the book files are either moved or symlinkedto an output folder along with their corresponding metadata files.Therename subcommand from theebooktools.py
script uses this module.split_into_folders.py
splits the supplied ebook files (and theaccompanying metadata files if present) into folders with consecutive namesthat each contain the specified number of files. Thesplit subcommandfrom theebooktools.py
script uses this module.
Thus, you have access to varioussubcommands from within theebooktools.py
script.
⭐
- ebook-tools is theoriginal Shell project I ported to Python. Iused the same names for the script options (short and longer versions) sothat if you used the shell scripts, you will easily know how to run thecorrespondingsubcommand with the given options.
- ebooktools.py is the name of the Python script which will always bereferred that way in this document (i.e. no hyphen and ending with
.py
)to distinguish from the original Shell projectebook-tools
.- pyebooktools is the name of the Python package that you need toinstall to have access to the
ebooktools.py
script.
To install the scriptebooktools.py
, follow these steps:
- Platforms: macOS [soon linux]
- Python: >= 3.6
lxml
>= 4.4 for parsing Calibre'smetadata.opf
files.
ℹ️
When installing thepyebooktools
packagebelow, thelxml
library is automaticallyinstalled if it is not found or upgraded to the correct supported version.
As explained in the documentation forebook-tools, you need recentversions of:
calibre for fetching metadata from online sources, conversion to txt(for ISBN searching) and ebook metadata extraction. Versions2.84 andabove are preferred because of their ability to manually specify from whichspecific online source we want to fetch metadata. For earlier versions youhave to set
isbn_metadata_fetch_order
andorganize_without_isbn_sources
to empty strings.p7zip for ISBN searching in ebooks that are in archives.
Tesseract for running OCR on books - version 4 gives better results eventhough it's still in alpha. OCR is disabled by default and another enginecan be configured if preferred.
Optionallypoppler,catdoc andDjVuLibre can be installed forfaster than calibre's conversion of
.doc
and.djvu
filesrespectively to.txt
.
⚠️ On macOS, you don't needcatdoc since it has the built-intextutilcommand-line tool that converts any
txt
,html
,rtf
,rtfd
,doc
,docx
,wordml
,odt
, orwebarchive
file.Optionally theGoodreads andWorldCat xISBN calibre plugins canbe installed for better metadata fetching.
⭐
If you only installcalibre among these dependencies, you can still havea functioning program that will organize and manage your ebookcollections:
- fetching metadata from online sources will work: bydefaultcalibre comes with Amazon and Google sources among others
- conversion to txt will work: calibre's ownebook-convert toolwill be used
Allsubcommands should work but accuracy and performance will beaffected as explained in the list of dependencies above.
Install first thePython dependencies and othertools.
It is highly recommended to install the
pyebooktools
package in avirtual environment using for examplevenv orconda.Make sure to updatepip:
$ pip install --upgrade pip
Install the
pyebooktools
package (bleeding-edge version) withpip:$ pip install git+https://github.com/raul23/pyebooktools#egg=pyebooktools
Make sure thatpip is working with the correct Python version. It might bethe case thatpip is using Python 2.x You can find what Python versionpip uses with the following:
$ pip -VIfpip is working with the wrong Python version, then try to usepip3which works with Python 3.x
Test installation
Test your installation by importing
pyebooktools
and printing itsversion:$ python -c "import pyebooktools; print(pyebooktools.__version__)"
You can also test that you have access to the
ebooktools.py
script byshowing the program's version:$ ebooktools --version
All of the options documented below can either be passed to theebooktools.py script via command-line arguments or via the configurationfileconfig.py
which is created along with the logging config filelogging.py
when theebooktools.py
script is run the first time with anyof the subcommands definedbelow. The default values for these config filesare taken fromdefault_config.py anddefault_logging.py, respectively.
In order to use the parameters found in the configuration fileconfig.py
,use the--use-config flag. Hence, you don't need to specify a long command-linein the terminal by using this flag. See theedit subcommand to know how toedit this configuration file.
Most arguments are not required and if nothing is specified, the default valuesdefined in the default config filedefault_config.py
will be used.
Theebooktools.py
script consists of various subcommands for theorganization and management of ebook collections. The usage pattern for runningone of the subcommands is as followed:
ebooktools {edit,convert,find,organize,rename,split} [OPTIONS]
where[OPTIONS]
includes general options (as defined in theGeneral options section) and options specific to thesubcommand (as defined in theScript usage, subcommands and options section).
In order to avoid data loss, use the--dry-run or--symlink-onlyoption when running some of the subcommands (e.g.rename
andsplit
)to make sure that they would do what you expect them to do, as explained intheSecurity and safety section.
Most of these options are part of the common librarylib.py and may affectsome or all of the subcommands.
-h
,--help
; no config variable; default valueFalse
Show the help message and exit.
-v
,--version
; no config variable; default valueFalse
Show program's version number and exit.
-q
,--quiet
; config variablequiet
; default valueFalse
Enable quiet mode, i.e. nothing will be printed.
--verbose
; config variableverbose
; default valueFalse
Print various debugging information, e.g. print traceback when there is anexception.
-u
,--use-config
; no config variable; default valueFalse
If this is enabled, the parameters found in the main config fileconfig.pywill be used instead of the command-line arguments.
ℹ️
Note that any other command-line argument that you use in the terminal withthe
--use-config
flag is ignored, i.e. only the parameters defined inthe main config fileconfig.py will be used.
-d
,--dry-run
; config variabledry_run
; default valueFalse
If this is enabled, no file rename/move/symlink/etc. operations will actuallybe executed.
--sl
,--symlink-only
; config variablesymlink_only
; default valueFalse
Instead of moving the ebook files, create symbolic links to them.
--km
,--keep-metadata
; config variablekeep_metadata
; defaultvalueFalse
Do not delete the gathered metadata for the organized ebooks, instead save itin an accompanying file together with each renamed book. It is very usefulor for additional verification, indexing or processing at a later date.[KM]
The usage pattern for running a givensubcommand is the following:
ebooktools {edit,convert,find,organize,rename,split} [OPTIONS]
where[OPTIONS]
includesgeneral options andoptions specific to the subcommand as shown below.
ℹ️
Don't forget the name of the Python scriptebooktools
before thesubcommand.
All subcommands are affected by the following global options:
The-h, --help option can be applied specifically to each subcommand orto theebooktools.py
script (when called without the subcommand). Thuswhen you want the help message for a specific subcommand, you do:
ebooktools {edit,convert,find,split} -h
which will show you the options that affect the choosen subcommand.
And if you want the help message for the wholeebooktools.py
script:
ebooktools -h
which will show you the project description and descriptionof each subcommand without showing the subcommand options.
More examples can be found atexamples.rst.
To convert a pdf file to textwith OCR:
$ ebooktools convert --ocr always -o converted.txt pdf_to_convert.pdf
By setting--ocr toalways
, the pdf file will be first OCRed beforetrying the simple conversion tools (pdftotext
or calibre'sebook-convert
if the former command is not found).
Running pyebooktools v0.1.0a3Verbose option disabledOCR=always, first try OCR then conversionWill run OCR on file 'pdf_to_convert.pdf' with 1 page...OCR successful!
Find ISBNs in a pdf file:
$ ebooktools find pdf_file.pdf
Output:
Running pyebooktools v0.1.0a3Verbose option disabledSearching file 'pdf_file.pdf' for ISBN numbers...Extracted ISBNs:97895801584481000100111
The search for ISBNs starts in the first pages of the document to increasethe likelihood that the first extracted ISBN is the correct one. Then thelast pages are analyzed in reverse. Finally, the rest of the pages aresearched.
Thus, in this example, the first extracted ISBN is the correct oneassociated with the book since it was found in the first page.
The last sequence1000100111
was found in the middle of the documentand is not an ISBN even though it is a technically valid but wrong ISBNthat the regular expressionisbn_blacklist_regex didn't catch. Maybeit is a binary sequence that is part of a problem in a book about digitalsystem.
To uninstall thepyebooktools package:
$ pip uninstall pyebooktools
ℹ️
When uninstalling the
pyebooktools
package, you might be informedthat the configuration fileslogging.py andconfig.py won't beremoved bypip. You can remove those files manually by noting their pathsreturned bypip. Or you can leave them so your saved settings can bere-used the next time you re-install the package.Example: uninstall the package and remove the config files
$pip uninstall pyebooktoolsFound existing installation: pyebooktools 0.1.0a3Uninstalling pyebooktools-0.1.0a3: Would remove: /Users/test/miniconda3/envs/ebooktools_py37/bin/ebooktools /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools-0.1.0a3.dist-info/* /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/* Would not remove (might be manually added): /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/configs/config.py /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/configs/logging.pyProceed (y/n)? y Successfully uninstalled pyebooktools-0.1.0a3$rm -r /Users/test/miniconda3/envs/ebooktools_py37/lib/python3.7/site-packages/pyebooktools/
Samelimitations as forebook-tools
apply to this project too:
- Automatic organization can be slow - all the scripts are synchronousand single-threaded and metadata lookup by ISBN is not doneconcurrently. This is intentional so that the execution can be easilytraced and so that the online services are not hammered by requests.If you want to optimize the performance, run multiple copies of thescripton different folders.
- The default setting forisbn_metadata_fetch_order includes twonon-standard metadata sources: Goodreads and WorldCat xISBN. Forbest results, install the plugins (1,2) for them in calibre andfine-tune the settings for metadata sources in the calibre GUI.
Important security and safety tips from theebook-tools documentation:
Please keep in mind that this is beta-quality software. To avoid data loss,make sure that you have a backup of any files you want to organize. You mayalso want to run the scripts with the--dry-run or--symlink-onlyoption the first time to make sure that they would do what you expect them todo.
Also keep in mind that these shell scripts parse and extract complexarbitrary media and archive files and pass them to other external programswritten in memory-unsafe languages. This is not very safe andspecially-crafted malicious ebook files can probably compromise your systemwhen you use these scripts. If you are cautious and want to organizeuntrusted or unknown ebook files, use something likeQubesOS or at leastdo it in a separate VM/jail/container/etc.
NOTE:--dry-run
and--symlink-only
can be applied to the followingsubcommands:
Starting from first priority tasks
Port allebook-tools shell scripts into Python
:done,seeorganize_ebooks.pyorganize-ebooks.sh
interactive-organizer.sh
:done,seefind_isbns.pyfind-isbns.sh
:done,seeconvert_to_txt.pyconvert-to-txt.sh
:done,seerename_calibre_library.pyrename-calibre-library.sh
:done,seesplit_into_folders.pysplit-into-folders.sh
Status: only
interactive-organizer.sh
remaining, will port laterAddcache support when converting files to txt
Status: working on it since it is also needed for my other projectsearch-ebookswhich makes heavy use ofpyebooktools
Test on linux
Create adocker image for this project
Add tests onTravis CI
Eventually add documentation onRead the Docs
Add a
fix
subcommand that will try to fix corrupted PDF files based onone of the following utilities:; done,seefix_file_for_corruption()gs
: Ghostscriptpdftocairo
: from Popplermutool
: it does not "print" the PDF filecpdf
It will also check PDF files based on one of the following utilities:
pdfinfo
pdftotext
qpdf
jhove
Add a
remove
subcommand that can remove annotations (incl. highlights,comments, notes, arrows), bookmarks, attachments and metadata from PDF filesbased on thecpdf utilityNOTE:pdftk can also remove annotations
- Special thanks tona--, the developer ofebook-tools, for having madethese very useful tools. I learned a lot (specially
bash
) while portingthem to Python. - Thanks to all the developers of the different programs used by this projectsuch as
calibre
,Tesseract
, text converters (djvutxt
andpdftotext
) and many other utilities!
This program is licensed under the GNU General Public License v3.0. For moredetails see theLICENSE file in the repository.
[IBR] | https://github.com/na--/ebook-tools#options-related-to-extracting-isbns-from-files-and-finding-metadata-by-isbn |
[IDGF] | https://github.com/na--/ebook-tools#options-related-to-extracting-isbns-from-files-and-finding-metadata-by-isbn |
[IIF] | https://github.com/na--/ebook-tools#options-related-to-extracting-isbns-from-files-and-finding-metadata-by-isbn |
[KM] | https://github.com/na--/ebook-tools#general-control-flags |
[MFO] | https://github.com/na--/ebook-tools#options-related-to-extracting-isbns-from-files-and-finding-metadata-by-isbn |
[OCR] | https://github.com/na--/ebook-tools#options-for-ocr |
[OCRC] | https://github.com/na--/ebook-tools#options-for-ocr |
[OCROP] | https://github.com/na--/ebook-tools#options-for-ocr |
[OFT] | https://github.com/na--/ebook-tools#options-related-to-the-input-and-output-files |
[OME] | https://github.com/na--/ebook-tools#options-related-to-the-input-and-output-files |
[ORG] | https://github.com/na--/ebook-tools#ebook-tools |
[ORGD] | https://github.com/na--/ebook-tools#description |
[OWI] | https://github.com/na--/ebook-tools#specific-options-for-organizing-files |
[OWIS] | https://github.com/na--/ebook-tools#options-related-to-extracting-and-searching-for-non-isbn-metadata |
[PIF] | https://github.com/na--/ebook-tools#specific-options-for-organizing-files |
[RCL] | https://bit.ly/3sPJ9kT |
[RFFG] | https://github.com/na--/ebook-tools#options-related-to-extracting-isbns-from-files-and-finding-metadata-by-isbn |
[SM] | https://bit.ly/3sPJ9kT |
[TI] | https://github.com/na--/ebook-tools#options-related-to-extracting-and-searching-for-non-isbn-metadata |
[TML] | https://github.com/na--/ebook-tools#options-related-to-extracting-and-searching-for-non-isbn-metadata |
[WII] | https://github.com/na--/ebook-tools#specific-options-for-organizing-files |
About
Command-line program for organizing and managing ebook collections. It is a Python port from the original shell scripts ebook-tools