tb0hdan/domainsPublic

NotificationsYou must be signed in to change notification settings
Fork122
Star753

World’s single largest Internet domains dataset

License

BSD-3-Clause license

753 stars 122 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
.github		.github
_layouts		_layouts
data		data
logos		logos
.gitattributes		.gitattributes
.gitignore		.gitignore
7bd86b236d7727307831aa078c9ac545.txt		7bd86b236d7727307831aa078c9ac545.txt
CNAME		CNAME
LICENSE		LICENSE
NEWS.md		NEWS.md
README.md		README.md
STATS.md		STATS.md
SUBSCRIPTIONS.md		SUBSCRIPTIONS.md
_config.yml		_config.yml
ads.txt		ads.txt
robots.txt		robots.txt
unpack.sh		unpack.sh

Repository files navigation

Domains Project: Processing petabytes of data so you don't have to

Products and services

Phish is a new monitoring service for battling phishing attacks.

LLMC is a text-based URL classifier that usesOllama under the hood.

World's single largest Internet domains dataset

This public dataset contains freely available sorted list of Internet domains.

Dataset statistics

Project news

Support needed!

You can support this project by doing any combination of the following:

Posting a link on your website toDomainsProject
Opening issue and attaching other domain datasets that are not here yet (be sure to scroll through this README first)
Publishing research work and linking toDomainsProject
Sponsoring this project. SeeSubscriptions

Milestones:

Domains

10 Million
100 Million
1 Billion
1.7 BillionGithub
2.1 BillionPatreon only
2.3 BillionPatreon only
2.4 BillionPatreon only

(Wasted) Internet traffic:

500TB
925TB
1PB
1.3PB
1.5PB
5.7PB
8.1PB

Random facts:

More than 1TB of Internet traffic is just 3 Mbytes of compressed data
1 million domains is just 5 Mbytes compressed
More than 5.7PB of Internet traffic is necessary to crawl 1.7 billion domains (3.4TB / 1 million).
Only 4.6Gb of disk space is required to store 1.7 billion domains in compressed form
1Gbit fully saturated link is good for about 2 million new domains every day
8c/16t and 64 Gbytes of RAM machine is good for about 2 million new domains every day
2ISC Bind9 instances (>400 Mbytes RSS each) are required to get 2 million new domains every day
After reaching 9 million domains repository was switched to compressed files.Please use freely availableXZ to unpack files.
After reaching 30 million records, files were moved to/dataso repository doesn't have it's README at the very bottom.

Used by

Using dataset

This repository empoysGit LFS technology, therefore userhas to use bothgit lfs andxz to retrieve data. Cloning procedure is as follows:

git clone https://github.com/tb0hdan/domains.gitcd domainsgit lfs install./unpack.sh

Getting unfiltered dataset

Subscribers have access to raw data is available athttps://dataset.domainsproject.org

Some other available features:

TLD only
Websocket for new domains
DNS JSON (with historical data)

wget -m https://dataset.domainsproject.org

Data format

After unpacking, domain lists are just text files (~49Gb at 1.7 bil) with one domain per line.Sample fordata/afghanistan/domain2multi-af.txt:

1tv.af1tvnews.af3rdeye.af8am.afaan.afacaa.gov.afacb.afacbr.gov.afacci.org.afach.afacku.edu.afacsf.afadras.afaeiti.af

Search engines and crawlers

Crawlers

Domains Project bot

Domains Project uses crawler and DNS checks to get new domains.

DNS checks client is in early stages and is used by select few. It is calledFreya and I'm workingon making it stable and good enough for general public.

HTTP crawler is being rewritten as well. It is calledIdun

Typical user agent for Domains Project bot looks like this:

Mozilla/5.0 (compatible; Domains Project/1.0.8; +https://domainsproject.org)

Some older versions have set to Github repo:

Mozilla/5.0 (compatible; Domains Project/1.0.4; +https://github.com/tb0hdan/domains)

All data in this dataset is gathered usingScrapy andColly frameworks.

Starting with version1.0.7 crawler has partialrobots.txt supportand rate limiting. Please open issue if you experience any problems. Don't forget to include your domain.

Disabling Domains Project bot access to your website

Add this to your robots.txt:

User-agent: domainsproject.orgDisallow:/

or this:

User-agent: Domains ProjectDisallow:/

bot checks for both.

Others

Yacy

Yacy is a great opensource search engine. Here's my poston Yacy forum:https://searchlab.eu/t/domain-list-for-easier-search-bootstrapping/231

Additional sources

Rapid7 Sonar FDNS - no longer open

List of .FR domains from AfNIC.fr

Majestic Million

Internetstiftelsen Zone Data

DNS Census 2013

bigdatanews extract from Common Crawl (circa 2012)

Common Crawl - March/April 2020

The CAIDA UCSD IPv4 Routed /24 DNS Names Dataset - January/July 2019

GSA Data

OpenPageRank 10m hosts

Switch.ch Open Data

Slovak domains - Open Data

Research

This dataset can be used for research. There are papers that cover different topics.I'm just going to leave links to them here for reference.

Published works based on this dataset

Understanding and Characterizing the Adoption of Internationalized Domain Names in Practice

Phishing Protection SPF, DKIM, DMARC

Email address analysis (Czech)

Proteus: A Self-Designing Range Filter

Large Scale String Analytics in Arkouda

Fake Phishing: Setup, detection, and take-down

Cloudy with a Chance of Cyberattacks: Dangling Resources Abuse on Cloud Platforms

Data bouncing - thecontractor

Data bouncing - exampleone

GlyphNet: Homoglyph domains dataset and detection using attention-based Convolutional Neural Networks

Drupal and the Open Web in the Australian Government - 2022 edition

Useful resources

The Internet of Names: A DNS Big Dataset

Enabling Network Security Through Active DNS Datasets

Analysis of the Internet Domain Names Re-registration Market

Detection of malicious domains through lexical analysis

Malicious Domain Names Detection Algorithm Based on Lexical Analysis and Feature Quantification

Detecting Malicious URLs Using Lexical Analysis

About

World’s single largest Internet domains dataset

domainsproject.org

Releases

2tags

Sponsor this project

Packages

No packages published

Movatterモバイル変換

License

tb0hdan/domains

Folders and files

Latest commit

History

Repository files navigation

Domains Project: Processing petabytes of data so you don't have to

Products and services

World's single largest Internet domains dataset

Support needed!

Milestones:

Domains

(Wasted) Internet traffic:

Random facts:

Used by

Using dataset

Getting unfiltered dataset

Data format

Search engines and crawlers

Crawlers

Domains Project bot

Disabling Domains Project bot access to your website

Others

Yacy

Additional sources

Research

Published works based on this dataset

Useful resources

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages0

Contributors2

Languages

Packages