- Notifications
You must be signed in to change notification settings - Fork11
📐 Hidden alignment conditional random field for classifying string pairs.
License
dedupeio/pyhacrf
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Hidden alignment conditional random field for classifying string pairs -a learnable edit distance.
Part of the Dedupe.io cloud service and open source toolset for de-duplicating and finding fuzzy matches in your data:https://dedupe.io
This package aims to implement the HACRF machine learning model with asklearn
-like interface. It includes ways to fit a model to trainingexamples and score new example.
The model takes string pairs as input and classify them into any numberof classes. In McCallum's original paper the model was applied to thedatabase deduplication problem. Each database entry was paired withevery other entry and the model then classified whether the pair was a'match' or a 'mismatch' based on training examples of matches andmismatches.
I also tried to use it as learnable string edit distance for normalizingnoisy text. SeeA Conditional Random Field for Discriminatively-trainedFinite-state String Edit Distance by McCallum, Bellare, and Pereira,and the reportConditional Random Fields for Noisy text normalisationby Dirko Coetsee.
frompyhacrfimportStringPairFeatureExtractor,Hacrftraining_X= [('helloooo','hello'),# Matching examples ('h0me','home'), ('krazii','crazy'), ('non matching string example','no really'),# Non-matching examples ('and another one','yep')]training_y= ['match','match','match','non-match','non-match']# Extract featuresfeature_extractor=StringPairFeatureExtractor(match=True,numeric=True)training_X_extracted=feature_extractor.fit_transform(training_X)# Train modelmodel=Hacrf(l2_regularization=1.0)model.fit(training_X_extracted,training_y)# Evaluatefromsklearn.metricsimportconfusion_matrixpredictions=model.predict(training_X_extracted)print(confusion_matrix(training_y,predictions))> [[03]> [20]]print(model.predict_proba(training_X_extracted))> [[0.949148120.05085188]> [0.923977110.07602289]> [0.867560340.13243966]> [0.054388120.94561188]> [0.026412750.97358725]]
This package depends onnumpy
. The LBFGS optimizer inpylbfgs
isused, but alternative optimizers can be passed.
Install by running:
python setup.py install
or from pypi:
pip install pyhacrf
Clone from repository, then
pip install -r requirements.txtcython pyhacrf/*.pyxpython setup.py install
To deploy to pypi, make sure you have compiled the *.pyx files to *.c
About
📐 Hidden alignment conditional random field for classifying string pairs.
Topics
Resources
License
Stars
Watchers
Forks
Packages0
Languages
- Python72.1%
- Cython27.6%
- C0.3%