Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A powerful text cleaner for Japanese web texts

License

NotificationsYou must be signed in to change notification settings

ku-nlp/text-cleaning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Description

This project cleans dirty Japanese texts, which include a lot of emoji and kaomojiin a whitelist method.

Cleaning Example

INPUT: これはサンプルです(≧∇≦*)!見てみて→http://a.bc/defGHIjklOUTPUT: これはサンプルです!見てみて。INPUT: 一緒に応援してるよ(o^^o)。ありがとう😃OUTPUT: 一緒に応援してるよ。ありがとう。INPUT: いいぞ〜⸜(* ॑꒳ ॑*  )⸝⋆*OUTPUT: いいぞ。INPUT: えっ((((;゚Д゚)))))))OUTPUT: えっ。INPUT: 確かに「嘘でしょww」って笑ってたねOUTPUT: 確かに「嘘でしょ。」って笑ってたね。INPUT: おはようございますヽ(*´∀`)ノ。。今日は雨ですね・・・・・(T_T)OUTPUT: おはようございます。今日は雨ですね。INPUT: (灬º﹃º灬)おいしそうです♡OUTPUT: おいしそうです。INPUT: 今日の夜、友達とラーメン行くよ(((o(*゚▽゚*)o)))OUTPUT: 今日の夜、友達とラーメン行くよ。# When using the twitter option.INPUT: @abcde0123 おっとっとwwそうでした✋!!よろしくお願いします♪‼ #挨拶OUTPUT: おっとっと。そうでした!よろしくお願いします。

Requirements

  • Python 3.7+
  • mojimoji
  • neologdn
  • joblib

How to Run

Using python script directly

cat input.txt| python src/text_cleaning/main.py<options>> output.txt

Using makefile

When input files are located in directories hierarchically you can cleanthem keeping directory structure by using makefile.If input is compressed files, Makefile detect their format from theirsuffix and output cleaned files in the same format.

make INPUT_DIR=/somewhere/in OUTPUT_DIR=/somewhere/out PYTHON=/somewhere/.venv/bin/python

Options:

  • FILE_FORMAT=txt: Format of input file (txt or csv or tsv)
  • NUM_JOBS_PER_MACHINE=10: The maximum number of concurrently running jobs per machine
  • TWITTER=1: Perform twitter specific cleaning
  • PYTHON: Path to python interpreter of virtual environment

About

A powerful text cleaner for Japanese web texts

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp