Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Kanji usage frequency data collected from various sources

License

NotificationsYou must be signed in to change notification settings

scriptin/kanji-frequency

Repository files navigation

Datasets built from various Japanese language corpora

https://scriptin.github.io/kanji-frequency/ - see this website for the dataset description. This readme describes only technical aspects.

You can download the datasets here:https://github.com/scriptin/kanji-frequency/tree/master/data

Building the datasets

You'll need Node.js 18 or later.

Seescripts section inpackage.json.

Aozora:

  • aozora:download - use crawler/scraper to collect the data
  • aozora:gaiji:extract - extract gaiji notations data from scraped pages. Gaiji refers to kanji charasters which are replaced with images in the documents, because Shift-JIS encoding cannot represent them
  • aozora:gaiji:replacements - build gaiji replacements file - produces only partial results, which may need to be manually completed
  • aozora:clean - clean the scraped pages (apply gaiji replacements)
  • aozora:count - create the dataset

Wikipedia:

  • wikipedia:fetch - fetch random pages using MediaWiki API
  • wikipedia:count - create the dataset

News:

  • news:wikinews:fetch - fetch random pages from Wikinews using MediaWiki API
  • news:count - create the dataset
  • news:dates - create additional file with dates of articles

Building the website

SeeAstrodocs and thescripts section inpackage.json.


[8]ページ先頭

©2009-2025 Movatter.jp