enginux/splinkPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star0

A streamlined workflow for probabilistic entity resolution using Splink.

moj-analytical-services.github.io/splink/

0 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
data		data
notebooks		notebooks
results		results
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

🚀 Splink Workflow Guide

Welcome to theSplink Workflow Guide!

This repository provides a structured approach to entity resolution usingSplink, enabling efficient record linkage and deduplication at scale.

📌 Overview

Splink is a Python package for probabilistic record linkage (entity resolution) that allows you to deduplicate and link records from datasets without unique identifiers.

This workflow demonstrates best practices to:

Clean and preprocess data for entity resolution.
Configure Splink for optimal matching.
Perform scalable and efficient record linkage.
Evaluate and fine-tune results for accuracy.

🏗️ Installation

To get started, install the required dependencies with my chosen backend - spark:

!pip install'splink[spark]'

🚦 Getting Started

Follow these steps to use this workflow:

Prepare your data – Ensure datasets are cleaned and formatted properly.
Define linkage rules – Configure comparison and scoring functions.
Run Splink – Execute entity resolution with your chosen backend (Spark, DuckDB, etc.).
Evaluate results – Analyze and refine matches for accuracy.

🚀 Quickstart

To get a basic Splink model up and running, use the following code. It demonstrates how to:

Estimate the parameters of a deduplication modelUse the parameter estimates to identify duplicate recordsUse clustering to generate an estimated unique person ID.

importsplink.comparison_libraryasclfromsplinkimportDuckDBAPI,Linker,SettingsCreator,block_on,splink_datasetsdb_api=DuckDBAPI()df=splink_datasets.fake_1000settings=SettingsCreator(link_type="dedupe_only",comparisons=[cl.NameComparison("first_name"),cl.JaroAtThresholds("surname"),cl.DateOfBirthComparison("dob",input_is_string=True,        ),cl.ExactMatch("city").configure(term_frequency_adjustments=True),cl.EmailComparison("email"),    ],blocking_rules_to_generate_predictions=[block_on("first_name","dob"),block_on("surname"),    ])linker=Linker(df,settings,db_api)linker.training.estimate_probability_two_random_records_match(    [block_on("first_name","surname")],recall=0.7,)linker.training.estimate_u_using_random_sampling(max_pairs=1e6)linker.training.estimate_parameters_using_expectation_maximisation(block_on("first_name","surname"))linker.training.estimate_parameters_using_expectation_maximisation(block_on("email"))pairwise_predictions=linker.inference.predict(threshold_match_weight=-5)clusters=linker.clustering.cluster_pairwise_predictions_at_threshold(pairwise_predictions,0.95)df_clusters=clusters.as_pandas_dataframe(limit=10)

🔧 Configuration

You can tweak the following settings for better results:

Blocking rules: Define rules to limit the number of comparisons.
Scoring functions: Adjust parameters for similarity calculations.
Threshold tuning: Optimize match acceptance criteria.

📂 Project Structure

📁 splink/├── 📜 README.md├── 📦 requirements.txt├── 📁 data/         # Sample datasets├── 📁 notebooks/    # Jupyter notebooks  │   ├── 📁 tutorials/  # Step-by-step guides and explanations  │   ├── 📁 examples/   # Practical use cases and sample workflows  ├── 📁 scripts/      # Python scripts for automation  └── 📁 results/      # Output match results

🎯 Use Cases

This workflow is useful for:

Customer data deduplication
Fraud detection through identity matching
Merging datasets across different sources
etc. imagination is the limit

🤝 Contributing

Contributions are welcome! Feel free to submit issues, suggestions, or pull requests.

Happy Linking! 🚀

About

A streamlined workflow for probabilistic entity resolution using Splink.

moj-analytical-services.github.io/splink/

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🚀 Splink Workflow Guide

📌 Overview

🏗️ Installation

🚦 Getting Started

🚀 Quickstart

🔧 Configuration

📂 Project Structure

🎯 Use Cases

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

enginux/splink

Folders and files

Latest commit

History

Repository files navigation

🚀 Splink Workflow Guide

📌 Overview

🏗️ Installation

🚦 Getting Started

🚀 Quickstart

🔧 Configuration

📂 Project Structure

🎯 Use Cases

🤝 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages