Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

📐 Hidden alignment conditional random field for classifying string pairs.

License

NotificationsYou must be signed in to change notification settings

dedupeio/pyhacrf

 
 

Repository files navigation

https://travis-ci.org/dedupeio/pyhacrf.svg?branch=masterhttps://ci.appveyor.com/api/projects/status/kibqrd7wnsk2ilpf/branch/master?svg=true

Hidden alignment conditional random field for classifying string pairs -a learnable edit distance.

Part of the Dedupe.io cloud service and open source toolset for de-duplicating and finding fuzzy matches in your data:https://dedupe.io

This package aims to implement the HACRF machine learning model with asklearn-like interface. It includes ways to fit a model to trainingexamples and score new example.

The model takes string pairs as input and classify them into any numberof classes. In McCallum's original paper the model was applied to thedatabase deduplication problem. Each database entry was paired withevery other entry and the model then classified whether the pair was a'match' or a 'mismatch' based on training examples of matches andmismatches.

I also tried to use it as learnable string edit distance for normalizingnoisy text. SeeA Conditional Random Field for Discriminatively-trainedFinite-state String Edit Distance by McCallum, Bellare, and Pereira,and the reportConditional Random Fields for Noisy text normalisationby Dirko Coetsee.

Example

frompyhacrfimportStringPairFeatureExtractor,Hacrftraining_X= [('helloooo','hello'),# Matching examples              ('h0me','home'),              ('krazii','crazy'),              ('non matching string example','no really'),# Non-matching examples              ('and another one','yep')]training_y= ['match','match','match','non-match','non-match']# Extract featuresfeature_extractor=StringPairFeatureExtractor(match=True,numeric=True)training_X_extracted=feature_extractor.fit_transform(training_X)# Train modelmodel=Hacrf(l2_regularization=1.0)model.fit(training_X_extracted,training_y)# Evaluatefromsklearn.metricsimportconfusion_matrixpredictions=model.predict(training_X_extracted)print(confusion_matrix(training_y,predictions))> [[03]>  [20]]print(model.predict_proba(training_X_extracted))> [[0.949148120.05085188]>  [0.923977110.07602289]>  [0.867560340.13243966]>  [0.054388120.94561188]>  [0.026412750.97358725]]

Dependencies

This package depends onnumpy. The LBFGS optimizer inpylbfgs isused, but alternative optimizers can be passed.

Install

Install by running:

python setup.py install

or from pypi:

pip install pyhacrf

Developing

Clone from repository, then

pip install -r requirements.txtcython pyhacrf/*.pyxpython setup.py install

To deploy to pypi, make sure you have compiled the *.pyx files to *.c

About

📐 Hidden alignment conditional random field for classifying string pairs.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python72.1%
  • Cython27.6%
  • C0.3%

[8]ページ先頭

©2009-2025 Movatter.jp