- Notifications
You must be signed in to change notification settings - Fork122
World’s single largest Internet domains dataset
License
tb0hdan/domains
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Domains Project: Processing petabytes of data so you don't have to
Phish is a new monitoring service for battling phishing attacks.
LLMC is a text-based URL classifier that usesOllama under the hood.
This public dataset contains freely available sorted list of Internet domains.
You can support this project by doing any combination of the following:
- Posting a link on your website toDomainsProject
- Opening issue and attaching other domain datasets that are not here yet (be sure to scroll through this README first)
- Publishing research work and linking toDomainsProject
- Sponsoring this project. SeeSubscriptions
- 10 Million
- 100 Million
- 1 Billion
- 1.7 BillionGithub
- 2.1 BillionPatreon only
- 2.3 BillionPatreon only
- 2.4 BillionPatreon only
- 500TB
- 925TB
- 1PB
- 1.3PB
- 1.5PB
- 5.7PB
- 8.1PB
- More than 1TB of Internet traffic is just 3 Mbytes of compressed data
- 1 million domains is just 5 Mbytes compressed
- More than 5.7PB of Internet traffic is necessary to crawl 1.7 billion domains (3.4TB / 1 million).
- Only 4.6Gb of disk space is required to store 1.7 billion domains in compressed form
- 1Gbit fully saturated link is good for about 2 million new domains every day
- 8c/16t and 64 Gbytes of RAM machine is good for about 2 million new domains every day
- 2ISC Bind9 instances (>400 Mbytes RSS each) are required to get 2 million new domains every day
- After reaching 9 million domains repository was switched to compressed files.Please use freely availableXZ to unpack files.
- After reaching 30 million records, files were moved to
/data
so repository doesn't have it's README at the very bottom.
This repository empoysGit LFS technology, therefore userhas to use bothgit lfs
andxz
to retrieve data. Cloning procedure is as follows:
git clone https://github.com/tb0hdan/domains.gitcd domainsgit lfs install./unpack.sh
Subscribers have access to raw data is available athttps://dataset.domainsproject.org
Some other available features:
- TLD only
- Websocket for new domains
- DNS JSON (with historical data)
wget -m https://dataset.domainsproject.org
After unpacking, domain lists are just text files (~49Gb at 1.7 bil) with one domain per line.Sample fordata/afghanistan/domain2multi-af.txt
:
1tv.af1tvnews.af3rdeye.af8am.afaan.afacaa.gov.afacb.afacbr.gov.afacci.org.afach.afacku.edu.afacsf.afadras.afaeiti.af
Domains Project uses crawler and DNS checks to get new domains.
DNS checks client is in early stages and is used by select few. It is calledFreya and I'm workingon making it stable and good enough for general public.
HTTP crawler is being rewritten as well. It is calledIdun
Typical user agent for Domains Project bot looks like this:
Mozilla/5.0 (compatible; Domains Project/1.0.8; +https://domainsproject.org)
Some older versions have set to Github repo:
Mozilla/5.0 (compatible; Domains Project/1.0.4; +https://github.com/tb0hdan/domains)
All data in this dataset is gathered usingScrapy andColly frameworks.
Starting with version1.0.7
crawler has partialrobots.txt
supportand rate limiting. Please open issue if you experience any problems. Don't forget to include your domain.
Add this to your robots.txt:
User-agent: domainsproject.orgDisallow:/
or this:
User-agent: Domains ProjectDisallow:/
bot checks for both.
Yacy is a great opensource search engine. Here's my poston Yacy forum:https://searchlab.eu/t/domain-list-for-easier-search-bootstrapping/231
Rapid7 Sonar FDNS - no longer open
List of .FR domains from AfNIC.fr
bigdatanews extract from Common Crawl (circa 2012)
Common Crawl - March/April 2020
The CAIDA UCSD IPv4 Routed /24 DNS Names Dataset - January/July 2019
This dataset can be used for research. There are papers that cover different topics.I'm just going to leave links to them here for reference.
Understanding and Characterizing the Adoption of Internationalized Domain Names in Practice
Phishing Protection SPF, DKIM, DMARC
Email address analysis (Czech)
Proteus: A Self-Designing Range Filter
Large Scale String Analytics in Arkouda
Fake Phishing: Setup, detection, and take-down
Cloudy with a Chance of Cyberattacks: Dangling Resources Abuse on Cloud Platforms
Drupal and the Open Web in the Australian Government - 2022 edition
The Internet of Names: A DNS Big Dataset
Enabling Network Security Through Active DNS Datasets
Analysis of the Internet Domain Names Re-registration Market
Detection of malicious domains through lexical analysis
Malicious Domain Names Detection Algorithm Based on Lexical Analysis and Feature Quantification
About
World’s single largest Internet domains dataset