Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A streamlined workflow for probabilistic entity resolution using Splink.

NotificationsYou must be signed in to change notification settings

enginux/splink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to theSplink Workflow Guide!

This repository provides a structured approach to entity resolution usingSplink, enabling efficient record linkage and deduplication at scale.

📌 Overview

Splink is a Python package for probabilistic record linkage (entity resolution) that allows you to deduplicate and link records from datasets without unique identifiers.

This workflow demonstrates best practices to:

  • Clean and preprocess data for entity resolution.
  • Configure Splink for optimal matching.
  • Perform scalable and efficient record linkage.
  • Evaluate and fine-tune results for accuracy.

🏗️ Installation

To get started, install the required dependencies with my chosen backend - spark:

!pip install'splink[spark]'

🚦 Getting Started

Follow these steps to use this workflow:

  1. Prepare your data – Ensure datasets are cleaned and formatted properly.
  2. Define linkage rules – Configure comparison and scoring functions.
  3. Run Splink – Execute entity resolution with your chosen backend (Spark, DuckDB, etc.).
  4. Evaluate results – Analyze and refine matches for accuracy.

🚀 Quickstart

To get a basic Splink model up and running, use the following code. It demonstrates how to:

Estimate the parameters of a deduplication modelUse the parameter estimates to identify duplicate recordsUse clustering to generate an estimated unique person ID.

importsplink.comparison_libraryasclfromsplinkimportDuckDBAPI,Linker,SettingsCreator,block_on,splink_datasetsdb_api=DuckDBAPI()df=splink_datasets.fake_1000settings=SettingsCreator(link_type="dedupe_only",comparisons=[cl.NameComparison("first_name"),cl.JaroAtThresholds("surname"),cl.DateOfBirthComparison("dob",input_is_string=True,        ),cl.ExactMatch("city").configure(term_frequency_adjustments=True),cl.EmailComparison("email"),    ],blocking_rules_to_generate_predictions=[block_on("first_name","dob"),block_on("surname"),    ])linker=Linker(df,settings,db_api)linker.training.estimate_probability_two_random_records_match(    [block_on("first_name","surname")],recall=0.7,)linker.training.estimate_u_using_random_sampling(max_pairs=1e6)linker.training.estimate_parameters_using_expectation_maximisation(block_on("first_name","surname"))linker.training.estimate_parameters_using_expectation_maximisation(block_on("email"))pairwise_predictions=linker.inference.predict(threshold_match_weight=-5)clusters=linker.clustering.cluster_pairwise_predictions_at_threshold(pairwise_predictions,0.95)df_clusters=clusters.as_pandas_dataframe(limit=10)

🔧 Configuration

You can tweak the following settings for better results:

  • Blocking rules: Define rules to limit the number of comparisons.
  • Scoring functions: Adjust parameters for similarity calculations.
  • Threshold tuning: Optimize match acceptance criteria.

📂 Project Structure

📁 splink/├── 📜 README.md├── 📦 requirements.txt├── 📁 data/         # Sample datasets├── 📁 notebooks/    # Jupyter notebooks  │   ├── 📁 tutorials/  # Step-by-step guides and explanations  │   ├── 📁 examples/   # Practical use cases and sample workflows  ├── 📁 scripts/      # Python scripts for automation  └── 📁 results/      # Output match results

🎯 Use Cases

This workflow is useful for:

  • Customer data deduplication
  • Fraud detection through identity matching
  • Merging datasets across different sources
  • etc. imagination is the limit

🤝 Contributing

Contributions are welcome! Feel free to submit issues, suggestions, or pull requests.


Happy Linking! 🚀

About

A streamlined workflow for probabilistic entity resolution using Splink.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp