- Notifications
You must be signed in to change notification settings - Fork19
Essential NLP & ML, short & fast pure Python code
License
textgain/grasp
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Grasp is a lightweight AI toolkit for Python, with tools for data mining, natural language processing (NLP), machine learning (ML) and network analysis. It has 300+ fast and essential algorithms, with ~25 lines of code per function, self-explanatory function names, no dependencies, bundled into one well-documented file:grasp.py (250KB). Or install withpip, including language models (50MB):
$ pip install git+https://github.com/textgain/grasp
Download stuff withdownload(url)
(ordl
), with built-in caching and logging:
src=dl('https://www.textgain.com',cached=True)
Parse HTML withdom(html)
into anElement
tree and search it withCSS Selectors:
foreindom(src)('a[href^="http"]'):# external linksprint(e.href)
Strip HTML withplain(Element)
to get a plain text string:
forword,countinwc(plain(dom(src))).items():print(word,count)
Find articles withwikipedia(str)
, in HTML:
foreindom(wikipedia('cat',language='en'))('p'):print(plain(e))
Find opinions withbluesky(str)
:
forpostinfirst(10,bluesky('cats')):# latest 10print(post.id,post.text,post.date)
Deploy APIs withApp
. Works with WSGI and Nginx:
app=App()
@app.route('/')defindex(*path,**query):return'Hi! %s %s'% (path,query)
app.run('127.0.0.1',8080,debug=True)
Once this app is up, go checkhttp://127.0.0.1:8080/app?q=cat.
Get language withlang(str)
for 40+ languages and ~92.5% accuracy:
print(lang('The cat sat on the mat.'))# {'en': 0.99}
Get locations withloc(str)
for 25K+ EU cities:
print(loc('The cat lives in Catena.'))# {('Catena', 'IT', 43.8, 11.0): 1}
Get words & sentences withtok(str)
(tokenize) at ~125K words/sec:
print(tok("Mr. etc. aren't sentence breaks! ;) This is:.",language='en'))
Get word polarity withpov(str)
(point-of-view). Is it a positive or negative opinion?
print(pov(tok('Nice!',language='en')))# +0.6print(pov(tok('Dumb.',language='en')))# -0.4
- For de, en, es, fr, nl, with ~75% accuracy.
- You'll need the language models ingrasp/lm.
Tag word types withtag(str)
in 10+ languages using robust ML models fromUD:
forword,posintag(tok('The cat sat on the mat.'),language='en'):print(word,pos)
- Parts-of-speech include
NOUN
,VERB
,ADJ
,ADV
,DET
,PRON
,PREP
, ... - For ar, da, de, en, es, fr, it, nl, no, pl, pt, ru, sv, tr, with ~95% accuracy.
- You'll need the language models ingrasp/lm.
Tag keywords withtrie
, a compiled dict that scans ~250K words/sec:
t=trie({'cat*':1,'mat' :2})
fori,j,k,vint.search('Cats love catnip.',etc='*'):print(i,j,k,v)
Get answers withgpt()
. You'll need anOpenAI API key.
print(gpt("Why do cats sit on mats? (you're a psychologist)",key='...'))
Machine Learning (ML) algorithms learn by example. If you show them 10K spam and 10K real emails (i.e., train a model), they can predict whether other emails are also spam or not.
Each training example is a{feature: weight}
dict with a label. For text, the features could be words, the weights could be word count, and the label might bereal orspam.
Quantify text withvec(str)
(vectorize) into a{feature: weight}
dict:
v1=vec('I love cats! 😀',features=('c3','w1'))v2=vec('I hate cats! 😡',features=('c3','w1'))
c1
,c2
,c3
count consecutive characters. Forc2
,cats → 1xca, 1xat, 1xts.w1
,w2
,w3
count consecutive words.
Train models withfit(examples)
, save as JSON, predict labels:
m=fit([(v1,'+'), (v2,'-')],model=Perceptron)# DecisionTree, KNN, ...
m.save('opinion.json')
m=fit(open('opinion.json'))
print(m.predict(vec('She hates dogs.'))# {'+': 0.4: , '-': 0.6}
Once trained,Model.predict(vector)
returns a dict with label probabilities (0.0–1.0).
Map networks withGraph
, a{node1: {node2: weight}}
dict subclass:
g=Graph(directed=True)
g.add('a','b')# a → bg.add('b','c')# b → cg.add('b','d')# b → dg.add('c','d')# c → d
print(g.sp('a','d'))# shortest path: a → b → d
print(top(pagerank(g)))# strongest node: d, 0.8
See networks withviz(graph)
:
withopen('g.html','w')asf:f.write(viz(g,src='graph.js'))
You'll need to setsrc
to thegrasp/graph.js lib.
Easy date handling withdate(v)
, wherev
is an int, a str, or another date:
print(date('Mon Jan 31 10:00:00 +0000 2000',format='%Y-%m-%d'))
Easy path handling withcd(...)
, which always points to the script's folder:
print(cd('kb','en-loc.csv')
Easy CSV handling withcsv([path])
, a list of lists of values:
forcode,country,_,_,_,_,_incsv(cd('kb','en-loc.csv')):print(code,country)
data=csv()data.append(('cat','Kitty'))data.append(('cat','Simba'))data.save(cd('cats.csv'))
A challenge in AI is bias introduced by human trainers. Remember theModel
trained earlier? Grasp has tools toexplain how & why it makes decisions:
print(explain(vec('She hates dogs.'),m))# why so negative?
In the returned dict, the model's explanation is: “you wrotehat +ate (hate)”.
About
Essential NLP & ML, short & fast pure Python code