niqdev/packtpub-crawlerPublic

NotificationsYou must be signed in to change notification settings
Fork176
Star757

Download your daily free Packt Publishing eBookhttps://www.packtpub.com/packt/offers/free-learning

License

MIT license

757 stars 176 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
config		config
dev		dev
script		script
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

packtpub-crawler

Download FREE eBook every day fromwww.packtpub.com

This crawler automates the following step:

access to private account
claim the daily free eBook and weekly Newsletter
parse title, description and useful information
download favorite format.pdf .epub .mobi
download source code and book cover
upload files to Google Drive, OneDrive or via scp
store data on Firebase
notify via email, IFTTT or Join (on success and errors)
schedule daily job on Heroku or with Docker

Default command

# upload pdf to googledrive, store data and notify via emailpython script/spider.py -c config/prod.cfg -u googledrive -s firebase -n gmail

Other options

# download all formatpython script/spider.py --config config/prod.cfg --all# download only one format: pdf|epub|mobipython script/spider.py --config config/prod.cfg --type pdf# download also additional material: source code (if exists) and book coverpython script/spider.py --config config/prod.cfg -t pdf --extras# equivalent (default is pdf)python script/spider.py -c config/prod.cfg -e# download and then upload to Google Drive (given the download url anyone can download it)python script/spider.py -c config/prod.cfg -t epub --upload googledrivepython script/spider.py --config config/prod.cfg --all --extras --upload googledrive# download and then upload to OneDrive (given the download url anyone can download it)python script/spider.py -c config/prod.cfg -t epub --upload onedrivepython script/spider.py --config config/prod.cfg --all --extras --upload onedrive# download and notify: gmail|ifttt|joinpython script/spider.py -c config/prod.cfg --notify gmail# only claim book (no downloads):python script/spider.py -c config/prod.cfg --notify gmail --claimOnly

Basic setup

Before you start you should

Verify that your currently installed version of Python is2.x withpython --version
Clone the repositorygit clone https://github.com/niqdev/packtpub-crawler.git
Install all the dependencies (you might needsudo privilege)pip install -r requirements.txt
Create aconfig filecp config/prod_example.cfg config/prod.cfg
Change your Packtpub credentials in the config file

[credential]credential.email=PACKTPUB_EMAILcredential.password=PACKTPUB_PASSWORD

Now you should be able to claim and download your first eBook

python script/spider.py --config config/prod.cfg

Google Drive

From the documentation, Google Drive API requires OAuth2.0 for authentication, so to upload files you should:

Go toGoogle APIs Console and create a newGoogle Drive project namedPacktpubDrive
OnAPI manager > Overview menu
- Enable Google Drive API
OnAPI manager > Credentials menu
- InOAuth consent screen tab setPacktpubDrive as the product name shown to users
- InCredentials tab create credentials of typeOAuth client ID and choose Application typeOther namedPacktpubDriveCredentials
ClickDownload JSON and save the fileconfig/client_secrets.json
Change your Google Drive credentials in the config file

[googledrive]...googledrive.client_secrets=config/client_secrets.jsongoogledrive.gmail=GOOGLE_DRIVE@gmail.com

Now you should be able to upload your eBook to Google Drive

python script/spider.py --config config/prod.cfg --upload googledrive

Only the first time you will be prompted to login in a browser which has javascript enabled (no text-based browser) to generateconfig/auth_token.json.You should also copy and paste in the config theFOLDER_ID, otherwise every time a new folder with the same name will be created.

[googledrive]...googledrive.default_folder=packtpubgoogledrive.upload_folder=FOLDER_ID

Documentation:OAuth,Quickstart,example andpermissions

OneDrive

From the documentation, OneDrive API requires OAuth2.0 for authentication, so to upload files you should:

Go to theMicrosoft Application Registration Portal.
When prompted, sign in with your Microsoft account credentials.
FindMy applications and clickAdd an app.
EnterPacktpubDrive as the app's name and clickCreate application.
Scroll to the bottom of the page and check theLive SDK support box.
Change your OneDrive credentials in the config file
- Copy yourApplication Id into the config file toonedrive.client_id
- ClickGenerate New Password and copy the password shown into the config file toonedrive.client_secret
- ClickAdd Platform and selectWeb
- Enterhttp://localhost:8080/ as theRedirect URL
- ClickSave at the bottom of the page

[onedrive]...onedrive.client_id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxonedrive.client_secret=XxXxXxXxXxXxXxXxXxXxXxX

Now you should be able to upload your eBook to OneDrive

python script/spider.py --config config/prod.cfg --upload onedrive

Only the first time you will be prompted to login in a browser which has javascript enabled (no text-based browser) to generateconfig/session.onedrive.pickle.

[onedrive]...onedrive.folder=packtpub

Documentation:Registration,Python API

Scp

To upload your eBook viascp on a remote server update the configs

[scp]scp.host=SCP_HOSTscp.user=SCP_USERscp.password=SCP_PASSWORDscp.path=SCP_UPLOAD_PATH

Now you should be able to upload your eBook

python script/spider.py --config config/prod.cfg --upload scp

Note:

the destination folderscp.path on the remote server must exists in advance
the option--upload scp is incompatible with--store and--notify

Firebase

Create a new Firebaseproject, copy the database secret from your settings

https://console.firebase.google.com/project/PROJECT_NAME/settings/database

and update the configs

[firebase]firebase.database_secret=DATABASE_SECRETfirebase.url=https://PROJECT_NAME.firebaseio.com

Now you should be able to store your eBook details on Firebase

python script/spider.py --config config/prod.cfg --upload googledrive --store firebase

Gmail notification

Tosend a notification via email using Gmail you should:

Allow"less secure apps" and"DisplayUnlockCaptcha" on your account
Troubleshoot sign-in problems andexamples
Change your Gmail credentials in the config file

[gmail]...gmail.username=EMAIL_USERNAME@gmail.comgmail.password=EMAIL_PASSWORDgmail.from=FROM_EMAIL@gmail.comgmail.to=TO_EMAIL_1@gmail.com,TO_EMAIL_2@gmail.com

Now you should be able to notify your accounts

python script/spider.py --config config/prod.cfg --notify gmail

IFTTT notification

Get an account onIFTTT
Go to your Makersettings and activate the channel
Create a new applet using the Maker service with the trigger "Receive a web request" and the event name "packtpub-crawler"
Change your IFTTTkey in the config file

[ifttt]ifttt.event_name=packtpub-crawlerifttt.key=IFTTT_MAKER_KEY

Now you should be able to trigger the applet

python script/spider.py --config config/prod.cfg --notify ifttt

Value mappings:

value1: title
value2: description
value3: image URL

Join notification

Get the JoinChrome extension and/orApp
You can find your device idshere
(Optional) You can use multiple devices or groups (group.all, group.android, group.chrome, group.windows10, group.phone, group.tablet, group.pc) separated by comma
Change your Join credentials in the config file

[join]join.device_ids=DEVICE_IDS_COMMA_SEPARATED_OR_GROUP_NAMEjoin.api_key=API_KEY

Now you should be able to trigger the event

python script/spider.py --config config/prod.cfg --notify join

Heroku

Create a new branch

git checkout -b heroku-scheduler

Update the.gitignore and commit your changes

# removeconfig/prod.cfgconfig/client_secrets.jsonconfig/auth_token.json# adddev/config/dev.cfgconfig/prod_example.cfg

Create, config and deploy the scheduler

heroku login# create a new appheroku create APP_NAME# or if you already have an existing appheroku git:remote -a APP_NAME# deploy your appgit push -u heroku heroku-scheduler:masterheroku ps:scale clock=1# useful commandsheroku psheroku logs --ps clock.1heroku logs --tailheroku run bash

Updatescript/scheduler.py with your own preferences.

More info about HerokuScheduler,Clock Processes,Add-on andAPScheduler

Docker

Build your image

docker build -t niqdev/packtpub-crawler:2.3.0 .

Run manually

docker run \  --rm \  --name my-packtpub-crawler \  niqdev/packtpub-crawler:2.3.0 \  python script/spider.py --config config/prod.cfg

Run scheduled crawler in background

docker run \  --detach \  --name my-packtpub-crawler \  niqdev/packtpub-crawler:2.3.0# useful commandsdocker exec -i -t my-packtpub-crawler bashdocker logs -f my-packtpub-crawler

Alternatively you can pull fromDocker Hub thisfork

docker pull kuchy/packtpub-crawler

Cron job

Add this to your crontab to run the job daily at 9 AM:

crontab -e00 09 * * * cd PATH_TO_PROJECT/packtpub-crawler && /usr/bin/python script/spider.py --config config/prod.cfg >> /tmp/packtpub.log 2>&1

Systemd service

Create two files in /etc/systemd/system:

packtpub-crawler.service

[Unit]Description=run packtpub-crawler[Service]User=USER_THAT_SHOULD_RUN_THE_SCRIPTExecStart=/usr/bin/python2.7 PATH_TO_PROJECT/packtpub-crawler/script/spider.py -c config/prod.cfg[Install]WantedBy=multi-user.target

packtpub-crawler.timer

[Unit]Description=Runs packtpub-crawler every day at 7[Timer]OnBootSec=10minOnActiveSec=1sOnCalendar=*-*-* 07:00:00Unit=packtpub_crawler.servicePersistent=true[Install]WantedBy=multi-user.target

Enable the script withsudo systemctl enable packtpub_crawler.timer.You can test the service withsudo systemctl start packtpub_crawler.timer and see the output withsudo journalctl -u packtpub_crawler.service -f.

The script downloads also the free ebooks from the weekly packtpub newsletter.TheURL is generated by a Google Apps Script which parses all the mails.You can get the codehere, if you want to see the actual script, please clone thespreadsheet and go toTools > Script editor....

To use your own source, modify in the config

url.bookFromNewsletter=https://goo.gl/kUciut

The URL should point to a file containing only the URL (no semicolons, HTML, JSON, etc).

You can also clone thespreadsheet to use your own Gmail account. Subscribe to thenewsletter (on the bottom of the page) and create a filter to tag your mails accordingly.

Troubleshooting

ImportError: No module named paramiko

Install paramiko withsudo -H pip install paramiko --ignore-installed

Failed building wheel for cryptography

Install missing dependencies as describedhere

Development (only for spidering)

Run a simple static server with

node dev/server.js

and test the crawler with

python script/spider.py --dev --config config/dev.cfg --all

Disclaimer

This project is just a Proof of Concept and not intended for any illegal usage. I'm not responsible for any damage or abuse, use it at your own risk.

About

Download your daily free Packt Publishing eBookhttps://www.packtpub.com/packt/offers/free-learning

Movatterモバイル変換

License

niqdev/packtpub-crawler

Folders and files

Latest commit

History

Repository files navigation

packtpub-crawler

Download FREE eBook every day fromwww.packtpub.com

Default command

Other options

Basic setup

Google Drive

OneDrive

Scp

Firebase

Gmail notification

IFTTT notification

Join notification

Heroku

Docker

Cron job

Systemd service

Newsletter

Troubleshooting

Development (only for spidering)

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors9

Uh oh!

Languages

Packages