Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Topologically ordered lists of kanji for effective learning

NotificationsYou must be signed in to change notification settings

scriptin/topokanji

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

30 seconds explanation for people who want to learn kanji:

It is best to learn kanji starting from simple characters and then learning complex ones as compositions of "parts", which are called "radicals" or "components". For example:

  • 一 → 二 → 三
  • 丨 → 凵 → 山 → 出
  • 言 → 五 → 口 → 語

It is also smart to learn more common kanji first.

This project is based on those two ideas and provides properly ordered lists of kanji to make your learning process as fast, simple, and effective as possible.

Motivation for this project initially came from reading this article:The 5 Biggest Mistakes People Make When Learning Kanji.

First 100 kanji fromlists/aozora.txt (formatted for convenience):

人一丨口日目儿見凵山出十八木未丶来大亅了子心土冂田思二丁彳行寸寺時卜上丿刀分厶禾私中彐尹事可亻何自乂又皮彼亠方生月門間扌手言女本乙气気干年三耂者刂前勹勿豕冖宀家今下白勺的云牛物立小文矢知入乍作聿書学合

These lists can be found inlists directory. They only differ in order of kanji. Each file contains a list of kanji, ordered as described in following sections. There are few options (seeUsed data for details):

  • aozora.(json|txt) - ordered by kanji frequency in Japanese fiction and non-fiction books; I recommend this list if you're starting to learn kanji
  • news.(json|txt) - ordered by kanji frequency in online news
  • twitter.(json|txt) - ordered by kanji frequency in Twitter messages
  • wikipedia.(json|txt) - ordered by kanji frequency in Wikipedia articles
  • all.(json|txt) - combined "average" version of all previous; this one is experimental, I don't recommend using it

You can use these lists to build anAnki deck or just as a guidance. If you're looking for "names" or meanings of kanji, you might want to check mykanji-keys project.

What is a properly ordered list of kanji?

If you look at a kanji like 語, you can see it consists of at least three distinct parts: 言, 五, 口. Those are kanji by themselves too. The idea behind this project is to find the order of about 2000-2500 common kanji, in which no kanji appears before its' parts, so you only learn a new kanji when you already know its' components.

Properties of properly ordered lists

  1. No kanji appear before it's parts (components). In fact, in you treat kanji as nodes in agraph structure, and connect them with directed edges, where each edge means "kanji A includes kanji B as a component", it all forms adirected acyclic graph (DAG). For any DAG, it is possible to build atopological order, which is basically what "no kanji appear before it's parts" means.
  2. More common kanji come first. That way you learn useful characters as soon as possible.

Algorithm

Topological sorting is done by using a modified version ofKahn (1962) algorithm with intermediate sorting step which deals with the second property above. This intermediate sorting uses the "weight" of each character: common kanji (lighter) tend appear before rare kanji (heavier). See source code for details.

Used data

Initial unsorted list contains only kanji which are present inKanjiVG project, so for each character there is a data of its' shape and stroke order.

Characters are split into components usingCJK Decompositions Data project, along with "fixes" to simplify final lists and avoid characters which are not present in initial list.

Statistical data of kanji usage frequencies was collected by processing raw textual data from various sources. Seekanji-frequency repository for details.

Which kanji are (not) included?

Kanji list covers about 95-99% of kanji found in various Japanese texts. Generally, the goal is provide something similar toJōyō kanji, but based on actual data. Radicals are also included, but only those which are parts of some kanji in the list.

Kanji/radical mustNOT appear in this list if it is:

  • not included in KanjiVG character set
  • primarily used in names (people, places, etc.) or in some specific terms (religion, mythology, etc.)
  • mostly used because of its' shape, e.g. a part of text emoticons/kaomoji like( ^ω^)个
  • a part of currently popular meme, manga/anime/dorama/movie title, #hashtag, etc., and otherwise is not commonly used

Files and formats

lists directory

Files inlists directory are final lists.

  • *.txt files contain lists as plain text, one character per line; those files can be interpreted as CSV/TSV files with a single column
  • *.json files contain lists asJSON arrays

All files are encoded in UTF-8, withoutbyte order mark (BOM), and have unix-styleline endings,LF.

dependencies directory

Files independencies directory are "flat" equivalents of CJK-decompositions (see below). "Dependency" here roughly means "a component of the visual decomposition" for kanji.

  • 1-to-1.txt has a format compatible withtsort command line utility; first character in each line is "target" kanji, second character is target's dependency or0
  • 1-to-1.json contains a JSON array with the same data as in1-to-1.txt
  • 1-to-N.txt is similar, but lists all "dependecies" at once
  • 1-to-N.json contains a JSON object with the same data as in1-to-N.txt

All files are encoded in UTF-8, withoutbyte order mark (BOM), and have unix-styleline endings,LF.

data directory

  • kanji.json - data for kanji included in final ordered lists, includingradicals
  • kanjivg.txt - list of kanji fromKanjiVG
  • cjk-decomp-{VERSION}.txt - data fromCJK Decompositions Data, without any modifications
  • cjk-decomp-override.txt - data to override some CJK's decompositions
  • kanji-frequency/*.json - kanji frequency tables

All files are encoded in UTF-8, withoutbyte order mark (BOM). All files, except forcjk-decomp-{VERSION}.txt, have unix-styleline endings,LF.

data/kanji.json

Contains table with data for kanji, including radicals. Columns are:

  1. Character itself
  2. Stroke count
  3. Frequency flag:
    • true if it is a common kanji
    • false if it is primarily used as a radical/component and unlikely to be seen within top 3000 in kanji usage frequency tables. In this case character is only listed because it's useful for decomposition, not as a standalone kanji

Resrictions:

  • No duplicates
  • Each character must be listed inkanjivg.txt
  • Each character must be listed on the left hand side in exactly one line incjk-decomp-{VERSION}.txt
  • Each charactermay be listed on the left hand side in exactly one line incjk-decomp-override.txt

data/kanjivg.txt

Simple list of characters which are present in KanjiVG project. Those are from the list of*.svg files inKanjiVG's Github repository.

data/cjk-decomp-{VERSION}.txt

Data file fromCJK Decompositions Data project, seedescription of its' format.

data/cjk-decomp-override.txt

Same format ascjk-decomp-{VERSION}.txt, except:

  • comments starting with# allowed
  • purpose of each record in this file is to override the one fromcjk-decomp-{VERSION}.txt
  • type of decomposition is alwaysfix, which just means "fix a record for the same character from original file"

Special character0 is used to distinguish invalid decompositions (which lead to characters with no graphical representation) from those which just can't be decomposed further into something meaningful. For example,一:fix(0) means that this kanji can't be further decomposed, since it's just a single stroke.

NOTE: Strictly speaking, records in this file are not always "visual decompositions" (but most of them are). Instead, it's just an attempt to provide meaningful recommendations of kanji learning order.

data/kanji-frequency/*.json

Seekanji-frequency repository for details.

Usage

You must have Node.js and Git installed

  1. git clone https://github.com/THIS/REPO.git
  2. npm install
  3. node build.js + commands and arguments described below

Command-line commands and arguments

  • show - only display sorted list without writing into files
    • (optional)--per-line=NUM - explicitly tell how many characters per line to display.50 by default. Applicable only to (no arguments)
    • (optional)--freq-table=TABLE_NAME - use only one frequency table. Table names are file names fromdata/kanji-frequency directory, without.json extension, e.g.all ("combined" list),aozora, etc. When omitted, all frequency tables are used
  • coverage - show tables coverage, i.e. which fraction of characters from each frequency table is included into kanji list
  • suggest-add - suggest kanji to add in a list, based on coverage within kanji usage frequency tables
    • (required)--num=NUM - how many
    • (optional)--mean-type=MEAN_TYPE - same as previous, sort by given mean type:arithmetic (most "extreme"),geometric,harmonic (default, most "conservative"). SeePythagorean means for details
  • suggest-remove - suggest kanji to remove from a list, reverse ofsuggest-add
    • (required)--num=NUM - see above
    • (optional)--mean-type=MEAN_TYPE - see above
  • save - update files with final lists

License

This is a multi-license project. Choose any license from this list:

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp