- Notifications
You must be signed in to change notification settings - Fork11
A simple and easy to use web crawler for Python
License
NotificationsYou must be signed in to change notification settings
DataCrawl-AI/datacrawl
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A simple and efficient web crawler for Python.
- Crawl web pages and extract links starting from a root URL recursively
- Concurrent workers and custom delay
- Handle relative and absolute URLs
- Designed with simplicity in mind, making it easy to use and extend for various web crawling tasks
Install using pip:
pip install tiny-web-crawler
fromtiny_web_crawlerimportSpiderfromtiny_web_crawlerimportSpiderSettingssettings=SpiderSettings(root_url='http://github.com',max_links=2)spider=Spider(settings)spider.start()# Set workers and delay (default: delay is 0.5 sec and verbose is True)# If you do not want delay, set delay=0settings=SpiderSettings(root_url='https://github.com',max_links=5,max_workers=5,delay=1,verbose=False)spider=Spider(settings)spider.start()
Crawled output sample forhttps://github.com
{"http://github.com": {"urls": ["http://github.com/","https://githubuniverse.com/","..." ],"https://github.com/solutions/ci-cd": {"urls": ["https://github.com/solutions/ci-cd/","https://githubuniverse.com/","..." ] } }}
Thank you for considering to contribute.
- If you are a first time contributor you can pick a
good-first-issue
and get started. - Please feel free to ask questions.
- Before starting to work on an issue. Please get it assigned to you so that we can avoid multiple people from working on the same issue.
- We are working on doing our first major release. Please check this
issue
and see if anything interests you.
- Install poetry in your system
pipx install poetry
- Clone the repo you forked
- Create a venv or use
poetry shell
- Run
poetry install --with dev
pre-commit install
(see)pre-commit install --hook-type pre-push
- An issue exists or is created which address the PR
- Tests are written for the changes
- All lint/test passes
About
A simple and easy to use web crawler for Python
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
No packages published
Uh oh!
There was an error while loading.Please reload this page.
Contributors2
Uh oh!
There was an error while loading.Please reload this page.