letuananh/chirptextPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star6

ChirpText is a collection of text processing tools for Python.

License

MIT license

6 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 335 Commits
chirptext		chirptext
docs		docs
test		test
.gitignore		.gitignore
CHANGES.md		CHANGES.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
TODO.md		TODO.md
_config.yml		_config.yml
chirp.py		chirp.py
cov.sh		cov.sh
demo.py		demo.py
leutile.demo.py		leutile.demo.py
logging.json		logging.json
release.sh		release.sh
requirements-dev.txt		requirements-dev.txt
requirements-optional.txt		requirements-optional.txt
requirements.txt		requirements.txt
setup.py		setup.py
test.sh		test.sh

Repository files navigation

ChirpText is a collection of text processing tools for Python 3.

It is not meant to be a powerful tank like the popular NTLK but a small package which you can pip-install anywhere and write a few lines of code to process textual data.

Main features

Simple file data manipulation using an enhancedopen() function (txt, gz, binary, etc.)
CSV helper functions
Parse Japanese text with mecab library (Does not requiremecab-python3 package even on Windows, only a binary release (i.e.mecab.exe) is required)
Built-in "lite"text annotation formats (texttaglib TTL/CSV and TTL/JSON)
Helper functions and useful data for processing English, Japanese, Chinese and Vietnamese.
Application configuration files management which can make educated guess about config files' whereabouts
Quick text-based report generation

Installation

chirptext is available onPyPI and can be installed using pip

pip install chirptext

Parsing Japanese text

chirptext supports parsing Japanese text using different parsers (mecab, Janome, and igo-python)

>>>fromchirptextimportdeko>>>sent=deko.parse('猫が好きです。')>>>sent.tokens['`猫`<0:1>','`が`<1:2>','`好き`<2:4>','`です`<4:6>','`。`<6:7>']>>>sent.tokens.values()['猫','が','好き','です','。']>>>sent[0]`猫`<0:1>>>>sent[0].pos'名詞'>>>sent[1].lemma'が'>>>sent[2].reading'スキ'# tokenize>>>deko.tokenize('猫が好きです。')['猫','が','好き','です','。']# split sentences>>>deko.tokenize_sent("猫が好きです。\n犬も好きです。")['猫が好きです。','犬も好きです。']# parse a document (i.e. multiple sentences)>>>doc=deko.parse_doc("猫が好きです。\n犬も好きです。")>>>forsentindoc:...print(sent,sent.tokens.values())...猫が好きです。 ['猫','が','好き','です','。']犬も好きです。 ['犬','も','好き','です','。']

Notes: At least one of the following tools must be installed to use chirptext Japanese parsing:

mecab:http://taku910.github.io/mecab/#download
Janome: available on PyPI, install withpip install Janome
igo-python: available on PyPI, install withpip install igo-python

Convenient IO APIs

>>>fromchirptextimportchio>>>chio.write_tsv('data/test.tsv', [['a','b'], ['c','d']])>>>chio.read_tsv('data/tes.tsv')[['a','b'], ['c','d']]>>>chio.write_file('data/content.tar.gz','Support writing to .tar.gz file')>>>chio.read_file('data/content.tar.gz')'Support writing to .tar.gz file'>>>forrowinchio.read_tsv_iter('data/test.tsv'):...print(row)... ['a','b']['c','d']

Sample TextReport

# a string reportrp=TextReport()# by default, TextReport will write to standard output, i.e. terminalrp=TextReport(TextReport.STDOUT)# same as aboverp=TextReport('~/tmp/my-report.txt')# output to a filerp=TextReport.null()# ouptut to /dev/null, i.e. nowhererp=TextReport.string()# output to a string. Call rp.content() to get the stringrp=TextReport(TextReport.STRINGIO)# same as above# TextReport will close the output stream automatically by using the with statementwithTextReport.string()asrp:rp.header("Lorem Ipsum Analysis",level="h0")rp.header("Raw",level="h1")rp.print(LOREM_IPSUM)rp.header("Top 5 most common letters")ct.summarise(report=rp,limit=5)print(rp.content())

Output

+---------------------------------------------------------------------------------- | Lorem Ipsum Analysis +----------------------------------------------------------------------------------  Raw ------------------------------------------------------------ Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.  Top 5 most common letters------------------------------------------------------------ i: 42 e: 37 t: 32 o: 29 a: 29

Useful links

Documentation:https://chirptext.readthedocs.io
Source code:https://github.com/letuananh/chirptext/
PyPI:https://pypi.org/project/chirptext/

About

ChirpText is a collection of text processing tools for Python.

chirptext.readthedocs.io

Releases13

chirptext version 0.1.2 maintenance release Latest

May 20, 2021

+ 12 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Main features

Installation

Parsing Japanese text

Convenient IO APIs

Sample TextReport

Output

Useful links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases13

Packages

Uh oh!

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

letuananh/chirptext

Folders and files

Latest commit

History

Repository files navigation

Main features

Installation

Parsing Japanese text

Convenient IO APIs

Sample TextReport

Output

Useful links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases13

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages