courlan 1.3.2

pip install courlan

Latest version

Released: Oct 29, 2024

Clean, filter and sample URLs to optimize data collection – includes spam, content type and language filters.

Navigation

Verified details

These details have beenverified by PyPI

Maintainers

adbar

Unverified details

These details havenot been verified by PyPI

Project links

Classifiers

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
License
- OSI Approved :: Apache Software License
Operating System
Programming Language
Topic
Typing
- Typed

Report project as malware

Project description

coURLan: Clean, filter, normalize, and sample URLs

Why coURLan?

"It is important for the crawler to visit 'important' pages first,so that the fraction of the Web that is visited (and kept up to date)is more meaningful." (Cho et al. 1998)
"Given that the bandwidth for conducting crawls is neither infinitenor free, it is becoming essential to crawl the Web in not only ascalable, but efficient way, if some reasonable measure of quality orfreshness is to be maintained." (Edwards et al. 2001)

This library provides an additional "brain" for web crawling, scrapingand document management. It facilitates web navigation through a set offilters, enhancing the quality of resulting document collections:

Save bandwidth and processing time by steering clear of pages deemedlow-value
Identify specific pages based on language or text content
Pinpoint pages relevant for efficient link gathering

Additional utilities needed include URL storage, filtering, anddeduplication.

Features

Separate the wheat from the chaff and optimize document discovery andretrieval:

URL handling
- Validation
- Normalization
- Sampling
Heuristics for link filtering
- Spam, trackers, and content-types
- Locales and internationalization
- Web crawling (frontier, scheduling)
Data store specifically designed for URLs
Usable with Python or on the command-line

Let the coURLan fish up juicy bits for you!

Here is acourlan (source:Limpkin at Harn's Marsh byRuss,CC BY 2.0).

Installation

This package is compatible with with all common versions of Python, itis tested on Linux, macOS and Windows systems.

Courlan is available on the package repositoryPyPIand can notably be installed with the Python package managerpip:

$pipinstallcourlan# pip3 install on systems where both Python 2 and 3 are installed$pipinstall--upgradecourlan# to make sure you have the latest version$pipinstallgit+https://github.com/adbar/courlan.git# latest available code (see build status above)

The last version to support Python 3.6 and 3.7 iscourlan==1.2.0.

Python

Most filters revolve around thestrict andlanguage arguments.

check_url()

All useful operations chained incheck_url(url):

>>>fromcourlanimportcheck_url# return url and domain name>>>check_url('https://github.com/adbar/courlan')('https://github.com/adbar/courlan','github.com')# filter out bogus domains>>>check_url('http://666.0.0.1/')>>># tracker removal>>>check_url('http://test.net/foo.html?utm_source=twitter#gclid=123')('http://test.net/foo.html','test.net')# use strict for further trimming>>>my_url='https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.org'>>>check_url(my_url,strict=True)('https://httpbin.org/redirect-to','httpbin.org')# check for redirects (HEAD request)>>>url,domain_name=check_url(my_url,with_redirects=True)# include navigation pages instead of discarding them>>>check_url('http://www.example.org/page/10/',with_nav=True)# remove trailing slash>>>check_url('https://github.com/adbar/courlan/',trailing_slash=False)

Language-aware heuristics, notably internationalization in URLs, areavailable inlang_filter(url, language):

# optional language argument>>>url='https://www.un.org/en/about-us'# success: returns clean URL and domain name>>>check_url(url,language='en')('https://www.un.org/en/about-us','un.org')# failure: doesn't return anything>>>check_url(url,language='de')>>># optional argument: strict>>>url='https://en.wikipedia.org/'>>>check_url(url,language='de',strict=False)('https://en.wikipedia.org','wikipedia.org')>>>check_url(url,language='de',strict=True)>>>

Define stricter restrictions on the expected content type withstrict=True. This also blocks certain platforms and page typeswhere machines get lost.

# strict filtering: blocked as it is a major platform>>>check_url('https://www.twitch.com/',strict=True)>>>

Sampling by domain name

>>>fromcourlanimportsample_urls>>>my_urls=['https://example.org/'+str(x)forxinrange(100)]>>>my_sample=sample_urls(my_urls,10)# optional: exclude_min=None, exclude_max=None, strict=False, verbose=False

Web crawling and URL handling

Link extraction and preprocessing:

>>>fromcourlanimportextract_links>>>doc='<html><body><a href="test/link.html">Link</a></body></html>'>>>url="https://example.org">>>extract_links(doc,url){'https://example.org/test/link.html'}# other options: external_bool, no_filter, language, strict, redirects, ...

Thefilter_links() function provides additional filters for crawling purposes:use of robots.txt rules and link priorization. Seecourlan.core for details.

Determine if a link leads to another host:

>>>fromcourlanimportis_external>>>is_external('https://github.com/','https://www.microsoft.com/')True# default>>>is_external('https://google.com/','https://www.google.co.uk/',ignore_suffix=True)False# taking suffixes into account>>>is_external('https://google.com/','https://www.google.co.uk/',ignore_suffix=False)True

Other useful functions dedicated to URL handling:

extract_domain(url, fast=True): find domain and subdomain or justdomain withfast=False
get_base_url(url): strip the URL of some of its parts
get_host_and_path(url): decompose URLs in two parts: protocol +host/domain and path
get_hostinfo(url): extract domain and host info (protocol +host/domain)
fix_relative_urls(baseurl, url): prepend necessary information torelative links

>>>fromcourlanimport*>>>url='https://www.un.org/en/about-us'>>>get_base_url(url)'https://www.un.org'>>>get_host_and_path(url)('https://www.un.org','/en/about-us')>>>get_hostinfo(url)('un.org','https://www.un.org')>>>fix_relative_urls('https://www.un.org','en/about-us')'https://www.un.org/en/about-us'

Other filters dedicated to crawl frontier management:

is_not_crawlable(url): check for deep web or pages generally notusable in a crawling context
is_navigation_page(url): check for navigation and overview pages

>>>fromcourlanimportis_navigation_page,is_not_crawlable>>>is_navigation_page('https://www.randomblog.net/category/myposts')True>>>is_not_crawlable('https://www.randomblog.net/login')True

See alsoURL management pageof the Trafilatura documentation.

Python helpers

Helper function, scrub and normalize:

>>>fromcourlanimportclean_url>>>clean_url('HTTPS://WWW.DWDS.DE:80/')'https://www.dwds.de'

Basic scrubbing only:

>>>fromcourlanimportscrub_url

Basic canonicalization/normalization only, i.e. modifying andstandardizing URLs in a consistent manner:

>>>fromurllib.parseimporturlparse>>>fromcourlanimportnormalize_url>>>my_url=normalize_url(urlparse(my_url))# passing URL strings directly also works>>>my_url=normalize_url(my_url)# remove unnecessary components and re-order query elements>>>normalize_url('http://test.net/foo.html?utm_source=twitter&post=abc&page=2#fragment',strict=True)'http://test.net/foo.html?page=2&post=abc'

Basic URL validation only:

>>>fromcourlanimportvalidate_url>>>validate_url('http://1234')(False,None)>>>validate_url('http://www.example.org/')(True,ParseResult(scheme='http',netloc='www.example.org',path='/',params='',query='',fragment=''))

Troubleshooting

Courlan uses an internal cache to speed up URL parsing. It can be resetas follows:

>>>fromcourlan.metaimportclear_caches>>>clear_caches()

UrlStore class

TheUrlStore class allow for storing and retrieving domain-classifiedURLs, where a URL likehttps://example.org/path/testpage is stored asthe path/path/testpage within the domainhttps://example.org. Itfeatures the following methods:

URL management
- add_urls(urls=[], appendleft=None, visited=False): Add alist of URLs to the (possibly) existing one. Optional:append certain URLs to the left, specify if the URLs havealready been visited.
- add_from_html(htmlstring, url, external=False, lang=None, with_nav=True):Extract and filter links in a HTML string.
- discard(domains): Declare domains void and prune the store.
- dump_urls(): Return a list of all known URLs.
- print_urls(): Print all URLs in store (URL + TAB + visited or not).
- print_unvisited_urls(): Print all unvisited URLs in store.
- get_all_counts(): Return all download counts for the hosts in store.
- get_known_domains(): Return all known domains as a list.
- get_unvisited_domains(): Find all domains for which there are unvisited URLs.
- total_url_number(): Find number of all URLs in store.
- is_known(url): Check if the given URL has already been stored.
- has_been_visited(url): Check if the given URL has already been visited.
- filter_unknown_urls(urls): Take a list of URLs and return the currently unknown ones.
- filter_unvisited_urls(urls): Take a list of URLs and return the currently unvisited ones.
- find_known_urls(domain): Get all already known URLs for thegiven domain (ex.https://example.org).
- find_unvisited_urls(domain): Get all unvisited URLs for the given domain.
- get_unvisited_domains(): Return all domains which have not been all visited.
- reset(): Re-initialize the URL store.
Crawling and downloads
- get_url(domain): Retrieve a single URL and consider it tobe visited (with corresponding timestamp).
- get_rules(domain): Return the stored crawling rules for the given website.
- store_rules(website, rules=None): Store crawling rules for a given website.
- get_crawl_delay(): Return the delay as extracted from robots.txt, or a given default.
- get_download_urls(max_urls=100, time_limit=10): Get a list of immediatelydownloadable URLs according to the given time limit per domain.
- establish_download_schedule(max_urls=100, time_limit=10):Get up to the specified number of URLs along with a suitablebackoff schedule (in seconds).
- download_threshold_reached(threshold): Find out if thedownload limit (in seconds) has been reached for one of thewebsites in store.
- unvisited_websites_number(): Return the number of websitesfor which there are still URLs to visit.
- is_exhausted_domain(domain): Tell if all known URLs forthe website have been visited.
Persistance
- write(filename): Save the store to disk.
- load_store(filename): Read a UrlStore from disk (separate function, not class method).
Optional settings:
- compressed=True: activate compression of URLs and rules
- language=XX: focus on a particular target language (two-letter code)
- strict=True: stricter URL filtering
- verbose=True: dump URLs if interrupted (requires use ofsignal)

Command-line

The main fonctions are also available through a command-line utility:

$courlan--inputfileurl-list.txt--outputfilecleaned-urls.txt$courlan--helpusage:courlan[-h]-iINPUTFILE-oOUTPUTFILE[-dDISCARDEDFILE][-v][-pPARALLEL][--strict][-lLANGUAGE][-r][--sampleSAMPLE][--exclude-maxEXCLUDE_MAX][--exclude-minEXCLUDE_MIN]Command-lineinterfaceforCourlanoptions:-h,--helpshowthishelpmessageandexitI/O:Manageinputandoutput-iINPUTFILE,--inputfileINPUTFILEnameofinputfile(required)-oOUTPUTFILE,--outputfileOUTPUTFILEnameofoutputfile(required)-dDISCARDEDFILE,--discardedfileDISCARDEDFILEnameoffiletostorediscardedURLs(optional)-v,--verboseincreaseoutputverbosity-pPARALLEL,--parallelPARALLELnumberofparallelprocesses(notusedforsampling)Filtering:ConfigureURLfilters--strictperformmorerestrictivetests-lLANGUAGE,--languageLANGUAGEuselanguagefilter(ISO639-1code)-r,--redirectscheckredirectsSampling:Usesamplingbyhost,configuresamplesize--sampleSAMPLEsizeofsampleperdomain--exclude-maxEXCLUDE_MAXexcludedomainswithmorethannURLs--exclude-minEXCLUDE_MINexcludedomainswithlessthannURLs

License

coURLan is distributed under theApache 2.0license.

Versions prior to v1 were under GPLv3+ license.

Settings

courlan is optimized for English and German but its generic approachis also usable in other contexts.

Details of strict URL filtering can be reviewed and changed in the filesettings.py. To override the default settings, clone the repository andre-install the packagelocally.

Contributing

Contributionsare welcome!

Feel free to file issues on thededicatedpage.

Author

Developed with practical applications of academic research in mind, this softwareis part of a broader effort to derive information from web documents.Extracting and pre-processing web texts to the exacting standards ofscientific research presents a substantial challenge.This software package simplifies text data collection and enhances corpus quality,it is currently used to buildtext databases for research.

Barbaresi, A. "Trafilatura: A Web Scraping Library andCommand-Line Tool for Text Discovery andExtraction."Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, pp. 122-131.

Contact: seehomepage.

Software ecosystem: seethisgraphic.

Similar work

These Python libraries perform similar handling and normalization tasksbut do not entail language or content filters. They also do notprimarily focus on crawl optimization:

References

Cho, J., Garcia-Molina, H., & Page, L. (1998). Efficient crawlingthrough URL ordering.Computer networks and ISDN systems, 30(1-7),161–172.
Edwards, J., McCurley, K. S., and Tomlin, J. A. (2001). "Anadaptive model for optimizing performance of an incremental webcrawler". InProceedings of the 10th international conference onWorld Wide Web - WWW'01, pp. 106–113.

Project details

Verified details

These details have beenverified by PyPI

Maintainers

adbar

Unverified details

These details havenot been verified by PyPI

Project links

Classifiers

Development Status
- 5 - Production/Stable
Environment
- Console
Intended Audience
License
- OSI Approved :: Apache Software License
Operating System
Programming Language
Topic
Typing
- Typed

Release historyRelease notifications |RSS feed

This version

1.3.2

Oct 29, 2024

1.3.1

Sep 4, 2024

1.3.0

Jul 25, 2024

1.2.0

Jun 4, 2024

1.1.0

Apr 30, 2024

1.0.0

Feb 1, 2024

0.9.5

Nov 28, 2023

0.9.4

Sep 6, 2023

0.9.3

May 31, 2023

0.9.2

May 2, 2023

0.9.1

Apr 24, 2023

0.9.0

Mar 7, 2023

0.8.3

Jul 28, 2022

0.8.2

Jul 26, 2022

0.8.1

Jul 11, 2022

0.8.0

Jun 30, 2022

0.7.2

May 17, 2022

0.7.1

Mar 29, 2022

0.7.0

Mar 21, 2022

0.6.0

Nov 11, 2021

0.5.0

Oct 13, 2021

0.4.2

Jul 28, 2021

0.4.1

Jun 10, 2021

0.4.0

May 25, 2021

0.3.1

Feb 19, 2021

0.3.0

Jan 4, 2021

0.2.3

Oct 20, 2020

0.2.2

Sep 21, 2020

0.2.1

Sep 2, 2020

0.2.0

Sep 1, 2020

0.1.0

Aug 27, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more aboutinstalling packages.

Source Distribution

courlan-1.3.2.tar.gz (206.4 kBview details)

Uploaded Oct 29, 2024Source

Built Distribution

Filter files by name, interpreter, ABI, and platform.

If you're not sure about the file name format, learn more aboutwheel file names.

Copy a direct link to the current filters

File name

Interpreter

ABI

Platform

courlan-1.3.2-py3-none-any.whl (33.8 kBview details)

Uploaded Oct 29, 2024Python 3

File details

Details for the filecourlan-1.3.2.tar.gz.

File metadata

Download URL:courlan-1.3.2.tar.gz
Upload date: Oct 29, 2024
Size: 206.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.6

File hashes

Hashes for courlan-1.3.2.tar.gz
Algorithm	Hash digest
SHA256	`0b66f4db3a9c39a6e22dd247c72cfaa57d68ea660e94bb2c84ec7db8712af190`
MD5	`41a346e13ee6d3251bdf5c9eb500ffcb`
BLAKE2b-256	`6f546d6ceeff4bed42e7a10d6064d35ee43a810e7b3e8beb4abeae8cff4713ae`

See more details on using hashes here.

File details

Details for the filecourlan-1.3.2-py3-none-any.whl.

File metadata

Download URL:courlan-1.3.2-py3-none-any.whl
Upload date: Oct 29, 2024
Size: 33.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.11.6

File hashes

Hashes for courlan-1.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d0dab52cf5b5b1000ee2839fbc2837e93b2514d3cb5bb61ae158a55b7a04c6be`
MD5	`b34adf6ba2547baceb119284730536ad`
BLAKE2b-256	`8eca6a667ccbe649856dcd3458bab80b016681b274399d6211187c6ab969fc50`

See more details on using hashes here.

Movatterモバイル変換

courlan 1.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

coURLan: Clean, filter, normalize, and sample URLs

Why coURLan?

Features

Installation

Python

check_url()

Sampling by domain name

Web crawling and URL handling

Python helpers

Troubleshooting

UrlStore class

Command-line

License

Settings

Contributing

Author

Similar work

References

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release historyRelease notifications |RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes