- Notifications
You must be signed in to change notification settings - Fork20
Kanji usage frequency data collected from various sources
License
NotificationsYou must be signed in to change notification settings
scriptin/kanji-frequency
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Datasets built from various Japanese language corpora
https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.
You can download the datasets here:https://github.com/scriptin/kanji-frequency/tree/master/data
You'll need Node.js 18 or later.
Seescripts
section inpackage.json.
Aozora:
aozora:download
- use crawler/scraper to collect the dataaozora:gaiji:extract
- extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent themaozora:gaiji:replacements
- build gaiji replacements file - produces only partial results, which may need to be manually completedaozora:clean
- clean the scraped pages (apply gaiji replacements)aozora:count
- create the dataset
Wikipedia:
wikipedia:fetch
- fetch random pages using MediaWiki APIwikipedia:count
- create the dataset
News:
news:wikinews:fetch
- fetch random pages from Wikinews using MediaWiki APInews:count
- create the datasetnews:dates
- create additional file with dates of articles
SeeAstrodocs and thescripts
section inpackage.json.
About
Kanji usage frequency data collected from various sources
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
No releases published
Packages0
No packages published
Uh oh!
There was an error while loading.Please reload this page.