- Notifications
You must be signed in to change notification settings - Fork0
An open-source NLP library: fast text cleaning and preprocessing
License
iaramer/dobbi
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Takes care of all of this boring NLP stuff
An open-source NLP library: fast text cleaning and preprocessing.
This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization.You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.
To downloaddobbi, either fork this GitHub repo or simply usePypi via pip:
$ pip install dobbi
Import the library:
importdobbi
The library uses method chaining in order to simplify text processing:
importpandasaspdd= {'text': ['#fun #lol Why @Alex33 is so funny here: https://some-url.com','#looool =) 😍 such lovely!?*!!!%&']}df=pd.DataFrame(d)cln_func=dobbi.clean() \ .hashtag() \ .nickname() \ .url() \ .function()df['text']=df['text'].map(cln_func)repl_func=dobbi.replace() \ .emoji() \ .emoticon() \ .punctuation() \ .function()df['text']=df['text'].map(repl_func)
Result:
print(df['text'][0])# 'Why is so funny here'print(df['text'][1])# 'TOKEN_EMOTICON_HAPPY_FACE_OR_SMILEY TOKEN_EMOJI_SMILING_FACE_WITH_HEART_EYES such lovely'
The process consists of three stages:
- Initialization methods: initialize adobbi Work object
- Intermediate methods: chain patterns in the needed order
- Terminal methods: choose if you need a function or a result
Initialization functions:
dobbi.clean()dobbi.collect()dobbi.replace()
Intermediate methods (pattern processing choice):
regexp()- custom regular expressionsurl()- URLshtml()- HTML and "<...>" type markupspunctuation()- punctuationhashtag()- hashtagsemoji()-emojiemoticons()-emoticonswhitespace()- any type of whitespacesnickname()- @-starting nicknames
Terminal methods:
execute(str)- executes chosen methods on the provided string.function()- returns a function which is a combination of the chosen methods.
dobbi.clean() \ .hashtag() \ .nickname() \ .url() \ .execute('#fun #lol Why @Alex33 is so funny? Check here: https://some-url.com')
Result:
'Why is so funny? Check here:'dobbi.replace() \ .hashtag('') \ .nickname() \ .url('__CUSTOM_URL_TOKEN__') \ .execute('#fun #lol Why @Alex33 is so funny? Check here: https://some-url.com')
Result:
'Why TOKEN_NICKNAME is so funny? Check here: __CUSTOM_URL_TOKEN__'func=dobbi.clean() \ .url() \ .hashtag() \ .punctuation() \ .whitespace() \ .html() \ .function()func('\t #fun #lol Why @Alex33 is so... funny? <tag>\nCheck\there: https://some-url.com')
Result:
'Why Alex33 is so funny Check here'- Chain regexp methods
dobbi.clean() \ .regexp('#\w+') \ .regexp('@\w+') \ .regexp('https?://\S+') \ .execute('#fun #lol Why @Alex33 is so funny? Check here: https://some-url.com')
Result:
'Why is so funny? Check here:'- Remove emoji and emoticons
em_func=dobbi.clean() \ .emoji() \ .emoticon() \ .punctuation() \ .function()em_func('Great! =) :D 😍 😋such lovely!?*!!!%&')
Result:
'Great such lovely'Please pay attention that the functions are applied in the order you've specified them.So, you're better to chain.punctuation() as one of the last functions.
If you enjoyed the project I would be grateful if you supported it :)
Below is the list of useful features I would be happy to share with you:
- Finding bugs
- Making code optimizations
- Writing tests
- Help with new features development
About
An open-source NLP library: fast text cleaning and preprocessing
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.