- Notifications
You must be signed in to change notification settings - Fork177
Download your daily free Packt Publishing eBookhttps://www.packtpub.com/packt/offers/free-learning
License
niqdev/packtpub-crawler
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Download FREE eBook every day fromwww.packtpub.com
This crawler automates the following step:
- access to private account
- claim the daily free eBook and weekly Newsletter
- parse title, description and useful information
- download favorite format.pdf .epub .mobi
- download source code and book cover
- upload files to Google Drive, OneDrive or via scp
- store data on Firebase
- notify via Gmail, IFTTT, Join or Pushover (on success and errors)
- schedule daily job on Heroku or with Docker
# upload pdf to googledrive, store data and notify via emailpython script/spider.py -c config/prod.cfg -u googledrive -s firebase -n gmail
# download all formatpython script/spider.py --config config/prod.cfg --all# download only one format: pdf|epub|mobipython script/spider.py --config config/prod.cfg --type pdf# download also additional material: source code (if exists) and book coverpython script/spider.py --config config/prod.cfg -t pdf --extras# equivalent (default is pdf)python script/spider.py -c config/prod.cfg -e# download and then upload to Google Drive (given the download url anyone can download it)python script/spider.py -c config/prod.cfg -t epub --upload googledrivepython script/spider.py --config config/prod.cfg --all --extras --upload googledrive# download and then upload to OneDrive (given the download url anyone can download it)python script/spider.py -c config/prod.cfg -t epub --upload onedrivepython script/spider.py --config config/prod.cfg --all --extras --upload onedrive# download and notify: gmail|ifttt|join|pushoverpython script/spider.py -c config/prod.cfg --notify gmail# only claim book (no downloads):python script/spider.py -c config/prod.cfg --notify gmail --claimOnly
Before you start you should
- Verify that your currently installed version of Python is2.x with
python --version
- Clone the repository
git clone https://github.com/niqdev/packtpub-crawler.git
- Install all the dependencies
pip install -r requirements.txt
(see alsovirtualenv) - Create aconfig file
cp config/prod_example.cfg config/prod.cfg
- Change your Packtpub credentials in the config file
[credential]credential.email=PACKTPUB_EMAILcredential.password=PACKTPUB_PASSWORD
Now you should be able to claim and download your first eBook
python script/spider.py --config config/prod.cfg
From the documentation, Google Drive API requires OAuth2.0 for authentication, so to upload files you should:
- Go toGoogle APIs Console and create a newGoogle Drive project namedPacktpubDrive
- OnAPI manager > Overview menu
- Enable Google Drive API
- OnAPI manager > Credentials menu
- InOAuth consent screen tab setPacktpubDrive as the product name shown to users
- InCredentials tab create credentials of typeOAuth client ID and choose Application typeOther namedPacktpubDriveCredentials
- ClickDownload JSON and save the file
config/client_secrets.json
- Change your Google Drive credentials in the config file
[googledrive]...googledrive.client_secrets=config/client_secrets.jsongoogledrive.gmail=GOOGLE_DRIVE@gmail.com
Now you should be able to upload your eBook to Google Drive
python script/spider.py --config config/prod.cfg --upload googledrive
Only the first time you will be prompted to login in a browser which has javascript enabled (no text-based browser) to generateconfig/auth_token.json
.You should also copy and paste in the config theFOLDER_ID, otherwise every time a new folder with the same name will be created.
[googledrive]...googledrive.default_folder=packtpubgoogledrive.upload_folder=FOLDER_ID
Documentation:OAuth,Quickstart,example andpermissions
From the documentation, OneDrive API requires OAuth2.0 for authentication, so to upload files you should:
- Go to theMicrosoft Application Registration Portal.
- When prompted, sign in with your Microsoft account credentials.
- FindMy applications and clickAdd an app.
- EnterPacktpubDrive as the app's name and clickCreate application.
- Scroll to the bottom of the page and check theLive SDK support box.
- Change your OneDrive credentials in the config file
- Copy yourApplication Id into the config file toonedrive.client_id
- ClickGenerate New Password and copy the password shown into the config file toonedrive.client_secret
- ClickAdd Platform and selectWeb
- Enterhttp://localhost:8080/ as theRedirect URL
- ClickSave at the bottom of the page
[onedrive]...onedrive.client_id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxonedrive.client_secret=XxXxXxXxXxXxXxXxXxXxXxX
Now you should be able to upload your eBook to OneDrive
python script/spider.py --config config/prod.cfg --upload onedrive
Only the first time you will be prompted to login in a browser which has javascript enabled (no text-based browser) to generateconfig/session.onedrive.pickle
.
[onedrive]...onedrive.folder=packtpub
Documentation:Registration,Python API
To upload your eBook viascp
on a remote server update the configs
[scp]scp.host=SCP_HOSTscp.user=SCP_USERscp.password=SCP_PASSWORDscp.path=SCP_UPLOAD_PATH
Now you should be able to upload your eBook
python script/spider.py --config config/prod.cfg --upload scp
Note:
- the destination folder
scp.path
on the remote server must exists in advance - the option
--upload scp
is incompatible with--store
and--notify
Create a new Firebaseproject, copy the database secret from your settings
https://console.firebase.google.com/project/PROJECT_NAME/settings/database
and update the configs
[firebase]firebase.database_secret=DATABASE_SECRETfirebase.url=https://PROJECT_NAME.firebaseio.com
Now you should be able to store your eBook details on Firebase
python script/spider.py --config config/prod.cfg --upload googledrive --store firebase
Tosend a notification via email using Gmail you should:
- Allow"less secure apps" and"DisplayUnlockCaptcha" on your account
- Troubleshoot sign-in problems andexamples
- Change your Gmail credentials in the config file
[gmail]...gmail.username=EMAIL_USERNAME@gmail.comgmail.password=EMAIL_PASSWORDgmail.from=FROM_EMAIL@gmail.comgmail.to=TO_EMAIL_1@gmail.com,TO_EMAIL_2@gmail.com
Now you should be able to notify your accounts
python script/spider.py --config config/prod.cfg --notify gmail
- Get an account onIFTTT
- Go to your Makersettings and activate the channel
- Create a new applet using the Maker service with the trigger "Receive a web request" and the event name "packtpub-crawler"
- Change your IFTTTkey in the config file
[ifttt]ifttt.event_name=packtpub-crawlerifttt.key=IFTTT_MAKER_KEY
Now you should be able to trigger the applet
python script/spider.py --config config/prod.cfg --notify ifttt
Value mappings:
- value1: title
- value2: description
- value3: landing page URL
- Get the JoinChrome extension and/orApp
- You can find your device idshere
- (Optional) You can use multiple devices or groups (group.all, group.android, group.chrome, group.windows10, group.phone, group.tablet, group.pc) separated by comma
- Change your Join credentials in the config file
[join]join.device_ids=DEVICE_IDS_COMMA_SEPARATED_OR_GROUP_NAMEjoin.api_key=API_KEY
Now you should be able to trigger the event
python script/spider.py --config config/prod.cfg --notify join
- Get yourUSER_KEY
- Create anew application
- (Optional) Add anicon
- Change your pushover credentials in the config file
[pushover]pushover.user_key=PUSHOVER_USER_KEYpushover.api_key=PUSHOVER_API_KEY
Create a new branch
git checkout -b heroku-scheduler
Update the.gitignore
and commit your changes
# removeconfig/prod.cfgconfig/client_secrets.jsonconfig/auth_token.json# adddev/config/dev.cfgconfig/prod_example.cfg
Create, config and deploy the scheduler
heroku login# create a new appheroku create APP_NAME --region eu# or if you already have an existing appheroku git:remote -a APP_NAME# deploy your appgit push -u heroku heroku-scheduler:masterheroku ps:scale clock=1# useful commandsheroku psheroku logs --ps clock.1heroku logs --tailheroku run bash
Updatescript/scheduler.py
with your own preferences.
More info about HerokuScheduler,Clock Processes,Add-on andAPScheduler
Build your image
docker build -t niqdev/packtpub-crawler:2.4.0 .
Run manually
docker run \ --rm \ --name my-packtpub-crawler \ niqdev/packtpub-crawler:2.4.0 \ python script/spider.py --config config/prod.cfg
Run scheduled crawler in background
docker run \ --detach \ --name my-packtpub-crawler \ niqdev/packtpub-crawler:2.4.0# useful commandsdocker exec -i -t my-packtpub-crawler bashdocker logs -f my-packtpub-crawler
Alternatively you can pull fromDocker Hub thisfork
docker pull kuchy/packtpub-crawler
Add this to your crontab to run the job daily at 9 AM:
crontab -e00 09 * * * cd PATH_TO_PROJECT/packtpub-crawler && /usr/bin/python script/spider.py --config config/prod.cfg >> /tmp/packtpub.log 2>&1
Create two files in /etc/systemd/system:
- packtpub-crawler.service
[Unit]Description=run packtpub-crawler[Service]User=USER_THAT_SHOULD_RUN_THE_SCRIPTExecStart=/usr/bin/python2.7 PATH_TO_PROJECT/packtpub-crawler/script/spider.py -c config/prod.cfg[Install]WantedBy=multi-user.target
- packtpub-crawler.timer
[Unit]Description=Runs packtpub-crawler every day at 7[Timer]OnBootSec=10minOnActiveSec=1sOnCalendar=*-*-* 07:00:00Unit=packtpub_crawler.servicePersistent=true[Install]WantedBy=multi-user.target
Enable the script withsudo systemctl enable packtpub_crawler.timer
.You can test the service withsudo systemctl start packtpub_crawler.timer
and see the output withsudo journalctl -u packtpub_crawler.service -f
.
The script downloads also the free ebooks from the weekly packtpub newsletter.TheURL is generated by a Google Apps Script which parses all the mails.You can get the codehere, if you want to see the actual script, please clone thespreadsheet and go toTools > Script editor...
.
To use your own source, modify in the config
url.bookFromNewsletter=https://goo.gl/kUciut
The URL should point to a file containing only the URL (no semicolons, HTML, JSON, etc).
You can also clone thespreadsheet to use your own Gmail account. Subscribe to thenewsletter (on the bottom of the page) and create a filter to tag your mails accordingly.
- ImportError: No module named paramiko
Install paramiko withsudo -H pip install paramiko --ignore-installed
- Failed building wheel for cryptography
Install missing dependencies as describedhere
# install pip + setuptoolscurl https://bootstrap.pypa.io/get-pip.py | python -# upgrade pippip install -U pip# install virtualenv globally sudo pip install virtualenv# create virtualenvvirtualenv env# activate virtualenvsource env/bin/activate# verify virtualenvwhich pythonpython --version# deactivate virtualenvdeactivate
Run a simple static server with
node dev/server.js
and test the crawler with
python script/spider.py --dev --config config/dev.cfg --all
This project is just a Proof of Concept and not intended for any illegal usage. I'm not responsible for any damage or abuse, use it at your own risk.
About
Download your daily free Packt Publishing eBookhttps://www.packtpub.com/packt/offers/free-learning
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.