Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

find unicode characters based on their names

License

NotificationsYou must be signed in to change notification settings

hackerb9/ugrep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

☙ ugrep ❧

Find unicode characters based on their names

ugrep is essentiallygrep forthe Unicode table. It prints out the resulting unicode charactersliterally, so you can easily cut-and-paste. Ugrep is useful forlooking up Emojis 😤, finding obscure symbols ⚸⅗ℏ℞☧☭, or beautifulglyphs to decorate your text. 🙶❡✯🟔❢🙷

You can also use it for the reverse operation to lookup a singlecharacter (or a string of them) you've pasted into the terminal.

As a bonus, it can list which fonts are installed that contain aparticular unicode character and — through the magic of sixels — willshow a rendering in each font.

Installation

It's just a Python 3 shell script. Download it to/usr/local/bin or~/binand make it executable.

cd /usr/local/binwget https://github.com/hackerb9/ugrep/raw/master/ugrepchmod +x ugrep

Usage

  • Search by name:ugrep [-w]regex

    Look up a character name whereregex is a regularexpression. If you don't knowregularexpressions,don't worry. Just use plain strings and you'll rarely bewrong.

      ugrep runic

    If you find ugrep returning too many hits because the phrase you usedis found in other terms, e.g.,thema found inmathematical, usethe-w option to limit the search to complete words.

  • Search by number:ugrepcodepoint[..codepoint[..increment]]

    Look up a character (or a range of them) using Unicode code points inhexadecimal. For example,

      ugrep 03c0  ugrep 23b0..f  ugrep 0..10ffff..1000
  • Search by character:ugrep [-c]character string

    Look up each character in a string. Note that if the string is asingle character, e.g.,ugrep X, then-c is implied and need notbe specified.

      ugrep -c "(゚∀゚)"
  • List fonts for a character:ugrep [-l]character

    After showing the usual character information, list installedfonts that contain that character and show an example in each:

      ugrep -l mho

    Whensshed to another machine,ugrep shows the fontsinstalled on the remote machine.

  • List fonts, scaled larger:ugrep [-Lscale]character

    Same as-l, but scale up the example rendering in each font tobe easier to read:

      ugrep -L2 -w om

    Useful scale values range from 2 to 8.

Examples

Note: output from all examples has been excerpted. (You'd be amazedhow many heart emojis Unicode has. 😜)

Fun things to try:

To see some useful and lovely glyphs, try this:

ugrep face ugrep alchemical ugrep ornamentugrep bulletugrep '(vine|bud)'ugrep vaiugrep heavyugrep drawingugrep combining

Plain text search is simple:

    $ ugrep heart    ☙U+2619REVERSED ROTATED FLORAL HEART BULLET    ❣U+2763HEAVY HEART EXCLAMATION MARK ORNAMENT    ❤U+2764HEAVY BLACK HEART    ⋮[ ... truncated for brevity ... ]    💞U+1F49E REVOLVING HEARTS    💟U+1F49F HEART DECORATION    😍U+1F60D SMILING FACE WITH HEART-SHAPED EYES    😻U+1F63BSMILING CAT FACE WITH HEART-SHAPED EYES

Paste in a single character to lookup its codepoint:

    $ ugrep ☺    ☺       U+263A  WHITE SMILING FACE

Arguments on the command line have an implicit wildcard between them:

    $ ugrep right.*gle    $ ugrep right gle       # Equivalent    »U+00BBRIGHT-POINTING DOUBLE ANGLE QUOTATION MARK    ’U+2019RIGHT SINGLE QUOTATION MARK    ∟U+221FRIGHT ANGLE    ⊿U+22BFRIGHT TRIANGLE

You can use regular expressions for fancier searches:

    $ ugrep -w '(wo|hu)?m(a|e)ns?'    ᛗU+16D7RUNIC LETTER MANNAZ MAN M    ⛀U+26C0WHITE DRAUGHTS MAN    ⛂U+26C2BLACK DRAUGHTS MAN    ⼈U+2F08KANGXI RADICAL MAN    ⼥U+2F25KANGXI RADICAL WOMAN    𝌂U+1D302DIGRAM FOR HUMAN EARTH    𝌄U+1D304DIGRAM FOR EARTHLY HUMAN    🕴U+1F574MAN IN BUSINESS SUIT LEVITATING    🕺U+1F57AMAN DANCING    🚹U+1F6B9MENS SYMBOL    🚺U+1F6BAWOMENS SYMBOL    🤰U+1F930PREGNANT WOMAN    🤵U+1F935MAN IN TUXEDO        $ ugrep ^x    #  Regex anchors ^ and $ work    ⊻U+22BBXOR    ⌧U+2327X IN A RECTANGLE BOX (clear key)

Use the-w flag to search only for complete words:

    $ ugrep -w R    # The letter R used as a word    $ ugrep "\bR\b"    # (regex equivalent)    RU+0052LATIN CAPITAL LETTER R    ŖU+0156LATIN CAPITAL LETTER R WITH CEDILLA    ℛU+211BSCRIPT CAPITAL R (Script r)    ℜU+211CBLACK-LETTER CAPITAL R (Black-letter r)    ℝU+211DDOUBLE-STRUCK CAPITAL R (Double-struck r)

Use -c to display info for each character in a string.

    $ ugrep -c "ᕕ( ᐛ )ᕗ"    ᕕ   U+1555  CANADIAN SYLLABICS FI    (   U+0028  LEFT PARENTHESIS (opening parenthesis)        U+0020  SPACE    ᐛ   U+141B  CANADIAN SYLLABICS NASKAPI WAA        U+0020  SPACE    )   U+0029  RIGHT PARENTHESIS (closing parenthesis)    ᕗ   U+1557  CANADIAN SYLLABICS FO

Aliases (alternate names) are also searched:

    $ ugrep backslash    \U+005CREVERSE SOLIDUS (backslash)

Use.. to browse through a range of Unicode characters:

    $ ugrep 26b3..b    ⚳U+26B3CERES    ⚴U+26B4PALLAS    ⚵U+26B5JUNO    ⚶U+26B6VESTA    ⚷U+26B7CHIRON    ⚸U+26B8BLACK MOON LILITH    ⚹U+26B9SEXTILE    ⚺U+26BASEMISEXTILE    ⚻U+26BBQUINCUNX    $ ugrep 1f470..ff  |  less    👰U+1F470BRIDE WITH VEIL    👱U+1F471PERSON WITH BLOND HAIR    👲U+1F472MAN WITH GUA PI MAO    👳U+1F473MAN WITH TURBAN    👴U+1F474OLDER MAN    👵U+1F475OLDER WOMAN    👶U+1F476BABY    👷U+1F477CONSTRUCTION WORKER    👸U+1F478PRINCESS    👹U+1F479JAPANESE OGRE    👺U+1F47AJAPANESE GOBLIN    👻U+1F47BGHOST    👼U+1F47CBABY ANGEL    👽U+1F47DEXTRATERRESTRIAL ALIEN    ⋮[ ... truncated for brevity ... ]    📼U+1F4FCVIDEOCASSETTE    📽U+1F4FDFILM PROJECTOR    📾U+1F4FEPORTABLE STEREO    📿U+1F4FFPRAYER BEADSSometimes it's useful (or just fun) to page through the Unicodetable and see what characters are defined in a region. (`ugrep2700..ff`) Ranges are convenient, but very slow. Use regularexpressions if you want speed. (`ugrep U+27..`)

Ranges can have an optional increment:

$ ugrep 0..ffff..1000   �    U+0000  <control> (null)   က    U+1000  MYANMAR LETTER KA  [ ]   U+2000  EN QUAD  [ ]  U+3000  IDEOGRAPHIC SPACE   䀀   U+4000  cups; small cups ( M: fàn, C: fan3 fan4 fan6 )   倀   U+5000  bewildered; rash, wildly ( M: chāng, C: caang1 caang4 coeng1 zaang1, J: KURUU TAORERU, K: CHANG, V: trành )   怀   U+6000  bosom, breast; carry in bosom ( M: huái, C: waai4 )   瀀   U+7000  [CJK Unified Ideographs] ( M: yōu, J: ATSUI )   耀   U+8000  shine, sparkle, dazzle; glory ( M: yào, C: jiu6, J: KAGAYAKU, K: YO )   退   U+9000  step back, retreat, withdraw ( M: tuì, C: teoi3, J: SHIRIZOKU SHIRIZOKERU, K: THOY, V: thoái )   ꀀ   U+A000  YI SYLLABLE IT   뀀   U+B000  Block: [Hangul Syllables]   쀀   U+C000  Block: [Hangul Syllables]   퀀   U+D000  Block: [Hangul Syllables]   �    U+E000  <Private Use, First>       U+F000  Block: [Private Use Area]
  • Tip: pipe long output toless and search for a code point bypressing/U\+A60F.

Use -l to list which installed fonts contain a certain glyph:

  ugrep -l swash amp
  • Requires FontConfig. (Most GNU/Linux boxes should already be set).

  • The requested character may also be displayed in each of thelisted typefaces, but only if your terminal supports sixelgraphics (e.g.,xterm -ti vt340) and you have ImageMagickinstalled.

Use -L to scale up the font examples when listing fonts

ugrep -L4 fdfd   ﷽    U+FDFD  ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM                  Aldhabi                  Trutypewriter PolyglOTT                  Unifont
  • Note that increasing the glyph size also increased the text size.Not all terminals are capable of "double height" text. If yoursshows two lines of the same text in the usual size, try using--never-double-text.

Copy whitespace from the terminal

    $ ugrep -w space  [ ]   U+0020  SPACE (SP)  [ ]   U+00A0  NO-BREAK SPACE (non-breaking space) (NBSP)  [ ]   U+1680  OGHAM SPACE MARK  [ ]   U+2002  EN SPACE  [ ]   U+2003  EM SPACE  [ ]   U+2004  THREE-PER-EM SPACE  [ ]   U+2005  FOUR-PER-EM SPACE  [ ]   U+2006  SIX-PER-EM SPACE  [ ]   U+2007  FIGURE SPACE  [ ]   U+2008  PUNCTUATION SPACE  [ ]   U+2009  THIN SPACE  [ ]   U+200A  HAIR SPACE

Whitespace characters are printed with square brackets around themto make it easy to highlight and copy them from the terminal. Theywill also be shown with a yellow background, if the terminal allows.

Determine if an alias is actually a correction

Ugrep shows the character name in all caps and aliases are usuallylowercase in parentheses. Some aliases are treated differently.For aesthetic reasons, abbreviations are also shown in uppercase.For example:

� U+FEFF ZERO WIDTH NO-BREAK SPACE (byte order mark) (BOM) (ZWNBSP)

There are 31 characters in Unicode which have the wrong name in theUnicodeData.txt database. Unicode includes the correct name as analias in NameAliases.txt. If that file exists on your system, thenugrep will show the correction in Title Case Letters and in redletters, if the terminal supports color text.

︘ U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET (Presentation Form For Vertical Right White Lenticular Bracket)

View CJK (Chinese-Japanese-Korean) characters

Unicode does not actually define most CJK characters, exceptindirectly via Unihan, which maps certain blocks of characters toother standards.

  • Ugrep allows one to specify the code point or paste in an examplecharacter to look up.

      $ ugrep 𰻞     𰻞   U+30EDE biangbiang noodles ( M: biáng )  $ ugrep 8000  耀  U+8000  shine, sparkle, dazzle; glory ( M: yào, C: jiu6, J: KAGAYAKU, K: YO )

Viewall characters defined by Unicode:

    $ ugrep .?  |  less    ⋮[ ... over 30,000 glyphs elided for brevity ... ]
  • Want just Unicode glyphs without the description? Please usefonttable. It shows alldefined Unicode characters by default.

Show all possible code points, even the onesnot defined in Unicode:

$ ugrep 0..10FFFF | less    ⋮[ ... over a million lines elided for brevity ... ]

☝ This is currently very slow due to the wayugrep is implemented.You likely want to usefonttable -u instead.

Prerequisite: UnicodeData.txt

Ugrep requires the Unicode data fileUnicodeData.txtwhich can be installed on your system, in your home, or in the currentdirectory.

Easiest: On Ubuntu and Debian GNU/Linux, simplyapt install unicode-data.

Still easy: Or, you can download it by hand fromunicode.organd place it in~/.local/share/unicode/UnicodeData.txt

Not hard: Or, if you wish the file to be accessible to all users onyour machine, place it in/usr/local/share/unicode/UnicodeData.txt.

Unihan CJK Support

If the fileUnihan_Readings.txt exists, then ugrep willautomatically use it to show an English gloss describing a characterin the CJK (Chinese-Japanese-Korean) Ideographs region.

Your OS may make it easy to install (e.g.,apt install unicode-data).On other systems, you can do this

mkdir -p ~/.local/share/unicodecd ~/.local/share/unicodewget ftp://ftp.unicode.org/Public/UNIDATA/Unihan.zipunzip Unihan.zip

CJK example

Example 1: Unicode code point

$ ugrep 8000   耀   U+8000  shine, sparkle, dazzle; glory ( M: yào, C: jiu6, J: KAGAYAKU, K: YO )

The parenthesized text at the end shows the romanized pronunciation ofthe character inMandarin (pinyin),Cantonese (jyutping),Japanese (Hepburn), andKorean (Yale).

Example 2: Using -c to see characters in a string

$ ugrep -c 「⿺辶⿳穴⿰月⿰⿲⿱幺長⿱言馬⿱幺長刂心」   「   U+300C  LEFT CORNER BRACKET (opening corner bracket)   ⿺   U+2FFA  IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT   辶   U+8FB6  walk; walking; KangXi radical 162 ( M: chuò, J: SHINNYOU )   ⿳   U+2FF3  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO MIDDLE AND BELOW   穴   U+7A74  cave, den, hole; KangXi radical 116 ( M: xué, C: jyut6, J: ANA, K: HYEL, V: huyệt )   ⿰   U+2FF0  IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT   月   U+6708  moon; month; KangXi radical 74 ( M: yuè, C: jyut6, J: TSUKI, K: WEL, V: nguyệt )   ⿰   U+2FF0  IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT   ⿲   U+2FF2  IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO MIDDLE AND RIGHT   ⿱   U+2FF1  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW   幺   U+5E7A  one; tiny, small ( M: yāo, C: jiu1, J: CHIISAI, K: YO )   長   U+9577  long; length; excel in; leader ( M: zhǎng, C: coeng4 zoeng2, J: NAGAI TAKERU OSA, K: CANG, V: trường )   ⿱   U+2FF1  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW   言   U+8A00  words, speech; speak, say ( M: yán, C: jin4, J: KOTO IU KOTOBA, K: EN UN, V: ngôn )   馬   U+99AC  horse; surname; KangXi radical 187 ( M: mǎ, C: maa5, J: UMA, K: MA, V: mã )   ⿱   U+2FF1  IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW   幺   U+5E7A  one; tiny, small ( M: yāo, C: jiu1, J: CHIISAI, K: YO )   長   U+9577  long; length; excel in; leader ( M: zhǎng, C: coeng4 zoeng2, J: NAGAI TAKERU OSA, K: CANG, V: trường )   刂   U+5202  knife; radical number 18 ( M: dāo, C: dou1, J: RITSUTOU, K: TO )   心   U+5FC3  heart; mind, intelligence; soul ( M: xīn, C: sam1, J: KOKORO, K: SIM, V: tâm )   」   U+300D  RIGHT CORNER BRACKET (closing corner bracket)

Note 1: A "definition" is not a translation

Unihan calls the English gloss the character's "definition", but thatis meant in a very loose sense. CJK characters change meaning basedupon the context they are used in. For example, most Chinese words aremade of two characters, such as "蜂鳥", which means "hummingbird", butugrep would shows it as:

$ ugrep -c 蜂鳥   蜂   U+8702  bee, wasp, hornet ( M: fēng, C: fung1, J: HACHI, K: PONG, V: ong )   鳥   U+9CE5  bird; KangXi radical 196 ( M: niǎo, C: niu5, J: TORI, K: CO, V: điểu )

Note 2: Not all characters have readings

Unihan refers to this supplemental information — both the Englishgloss and the romanizations — as "readings". Readings are meant to behelpful, but are not normative and are only available for somecharacters.

CountPercent
All CJK Characters93,858100%
Have any reading47,42951%
Mandarin Pinyin41,37844%
Cantonese Jyutping23,11225%
English definition21,07623%
Japanese Hepburn11,29312%
Korean Yale9,05110%
Vietnamese8,3019%

Example of CJK with no Mandarin

$ ugrep 2bac3   𫫃   U+2BAC3 (Cant.) sarcastic interrogative ( C: e1 )

Example of CJK with no pronunciation

$ ugrep 20015   𠀕   U+20015 Variant of U+4E99 亙

Example of CJK with no English definition

$ ugrep 20016   𠀖   U+20016 [CJK Unified Ideographs Extension B] ( V: khạng )

Example of CJK with no readings whatsoever

$ ugrep 2abcd   𪯍   U+2ABCD [CJK Unified Ideographs Extension C]

Note that ugrep currently prints just the name of the block thecharacter is in [within square brackets] if it has no better way toidentify the character.


Boring Implementation notes

This is a rewrite of b9's AWK ugrep into Python. While AWK makes moresense for what this program does (comparing fields based on regexps),a rewrite was necessary because GNU awk, while plenty powerful, uses\y for word edges instead of the standard\b. Gawk does this forbackwards compatibility with historic AWK, but lacks a way to disableit for new scripts.

Switching to Python did have the benefit of allowing more powerfulPerlesque regexes (not that anyone has requested that).

Why not use the unicodedata module?

I do not use Python'sunicodedata module because it is woefullyinsufficient. It allows one to search by character name only byspecifying it fully and exactly:unicodedata.lookup("ROTATED HEAVY BLACK HEART BULLET").

Future Work

Rename this project

Although I believe thisugrep existed first, there is now anotherugrep which is quite widely known— with good reason as it looks pretty nifty — which hasnothing to dowith looking up Unicode characters. The 'U' appears to stand forUltra-fast as it is a very speedygrep with lots of bells andwhistles.

What shall this project's new name be?ug is also taken by the otherugrep. How aboutugre? It's an ugly, ogreish name, but it's probablya safe bet nobody is going to use that name for something else.

Maybe use Unihan_Readings.txt for grepping

Currently ifUnihan_Readings.txt is installed — which is the default ifthe user has doneapt install unicode-data) — and the user requests acharacter that is not in UnicodeData.txt, then the Readings data isused to show information about the character. However, Unihan_Readingscould be used in the future for searching for characters to show.

Example data from Unihan_Readings for U+9B44 (魄):

U+9B44kCantonesebok3 paak3 tok3U+9B44kDefinitionvigor; body; dark part of moonU+9B44kHangul백:0NU+9B44kHanyuPinlupò(11)U+9B44kHanyuPinyin74431.090:pò,bó,tuòU+9B44kJapaneseKunTAMASHIIU+9B44kJapaneseOnHAKU BAKUU+9B44kKoreanPAYKU+9B44kMandarinpòU+9B44kTGHZ2013287.140:pòU+9B44kTang*pækU+9B44kVietnamesepháchU+9B44kXHC19830084.110:bó 0887.020:pò 1175.020:tuò

SeeUAX #38: Unicode Han Database.

Two levels of Unihan support:

  1. Show kDefinition if block name isCJK Ideographs
  2. Search Unihan_Readings when searching for a word. Possible example:$ ugrep mononoke魅U+9B45MONONOKE BAKEMONO SUDAMA (kind of forest demon, elf)

Number 1 is finished and working, but number 2 may require a commandline switch or some other way of enabling/disabling it as searchingthrough the Readings file may be slow or cause other problems.

Maybe use NamesList.txt

It looks likeNamesList.txtmight be useful to also parse as it allows multiple aliases for acharacter. For example (fromgrep -B1 [=%] NamesList.txt):

0023    NUMBER SIGN        = pound sign, hash, crosshatch, octothorpe002E    FULL STOP        = period, dot, decimal point--002F    SOLIDUS        = slash, virgule1F70A   ALCHEMICAL SYMBOL FOR VINEGAR        = crucible; acid; distill; atrament; vitriol; red          sulfur; borax; wine; alkali salt; mercurius vivus,          quick silver

I'm not sure how useful this will be (who is going to look up thenumber sign by searching on "octothorpe"), but it'd be nice to be ableto at least show them as aliases.

Also, NamesList.txt has a fascinating "cross reference" feature:

0021    EXCLAMATION MARK        = factorial        = bang        x (inverted exclamation mark - 00A1)        x (latin letter retroflex click - 01C3)        x (double exclamation mark - 203C)        x (interrobang - 203D)        x (heavy exclamation mark ornament - 2762)

How would one find the interrobang (‽) without such a cross reference?

Note that the NamesList.txt file actually starts with a warningnotto parse it as it says it is generated mechanically fromUnicodeData.txt plus "manually created annotations". However, thoseannotations are what is interesting about the file (the aliases andcross references) and there appears to be no other official source ofthat data.

Bugs, Misfeatures, and Workarounds

  • ugrep 3400 shows the text defined in UnicodeData.txt, which statesthat it is "<CJK Ideograph Extension A, First>". Now that ugrep canshow ideograph definitions using Unihan_Readings.txt, we should(probably) replace any string in angle brackets with more useful info.

  • Brace expansion is confusing because of needing to be quoted fromthe shell. It is supported for ranges (not sequences), but is notcurrently documented because usage is tricky and the functionalityis not actually that helpful. For example, the following works:

    ugrep {0..F}{0,4,8,C}00

    but is easier to understand using range expansion:

    ugrep 0..FFFF..400
  • Range expansion and a seemingly equivalent regular expression searchwill give different results.

    ugrep 0..FFFF..400 | wc -l 64ugrep U+[0-9A-F][048C]00 | wc -l22

    This is because regexes currently only return valid code points fromthe UnicodeData.txt file, whereas range expansions can generate codepoints which are in regions not directly defined by Unicode. Forexample, the range from U+4E00 to U+9FEF is a block of CJK Ideographs.Both are useful: regexes are blazingly fast, while range expansionshave more functionality.

  • [Note: The following is not a problem for people who are willing touse vector fonts (truetype, opentype, postscript) that may beantialiased. Xterm uses fontconfig just fine.]

    For bitmap fonts, Xterm (as of version 369) seems to be able to onlyuse one font at a time, which means a single font must have all theglyphs you want shown. (Yes, you can have a second bitmap font for"wide" CJK, but that's still not enough.)

    The author (hackerb9) currently prefers using the Neep bitmap fontlike so in~/.Xresources:

    ! Neep looks nice, has good unicode coverage. Requires xfonts-jmk.xterm*vt100.font        :       *neep-medium-r-normal--20*10646*! Neep lacks Asian charactersxterm*vt100.wideFont    :       *fixed-medium-r-normal-ja-18*10646*

    Neep has two major downsides. 1. It is a bitmap font with only onesize well implemented, so you can't zoom in or out. 2. It is limitedto 65536 characters, which means it cannot show characters outsideof Unicode's Basic Multilingual Plane, such as new emojis. Neep canbe installed on Debian GNU/Linux systems withapt install xfonts-jmk.

  • Mlterm appears to have the same single font limitation as Xterm.Also, it right aligns text that has even a single character in aright-to-left alphabet, such as Arabic, so the output from ugrepwill look a little funny.

  • Gnome-terminal usesfont-config, so it has very nice Unicodesupport and can easily zoom in with Ctrl-+⃣ and Ctrl--⃣. Olderversions had a bug where combining characters were combined with thefollowing character instead of the previous, but this is now fixed.

    It does not support sixel graphics, so the -l option cannot showexamples of the character in different fonts.


[8]ページ先頭

©2009-2025 Movatter.jp