Both the hosted OpenSanctions API and the yente open source application provide a selection of different mechanisms that can be used to score result matches.
When you send amatch query to the API, it will be processed in two stages: first, a search index is used to locate possible candidate results. This process is meant to optimise for recall, i.e. find a broad selection of result candidates. In a second stage, these candidates are scored against the query that has been provided by the API consumer. The following URL query parameters are used to configure that process:
algorithm
query parameter is used to select a scoring algorithm. The different algorithms are described below, but you can also retrieve metadata about them programmatically using the/algorithms
endpoint.best
to use the highest-quality algorithm available at any time. This will produce good results, but may mean that specific scores for matched entities change significantly over time.threshold
is defined as the numeric score limit above which a result should be considered amatch
. This parameter may need to be adapted in conjunction with thealgorithm
to avoid producing too many false positive matches.cutoff
describes the lower bound of the result score that should still be returned by the API. Lower this parameter to see more candidates that have been down-ranked by the scoring system.limit
gives the maximum number of matches returned. The OpenSanctions dataset is de-duplicated, so there can usually only be one matching record for each query. Returning a large number of results therefore does not make sense like it would in afull-text search.fuzzy
is a boolean flag that can be used to disable fuzzy matching in the candidate finding stage. This has proven to be largely ineffective compared to other techniques (e.g. the search for phonetic and normalised forms of the names). We recommend disabling fuzzy candidate finding.Recommended default: ?algorithm=best&fuzzy=false
The API supports several scoring mechanisms ("algorithms") that can be used to compute and rank the results of amatch query. Below is a narrative overview of the supported algorithms, please also refer to thetechnical documentation.
logic-v1
(currently also:best
) implements a large number of deterministic rules to generate a match result suitable for screening systems. Therules include phonetic and fuzzy name matching, rules regarding the use of IMO, ISIN, LEI, OGRN, INN and other entity identifiers, and rules that reduce the quality of matches in which supporting information (such as countries, DOB, gender, and address) are divergent between the query and the matching candidate. This model is calibrated to be used with the defaultthreshold
parameter value (0.7
).
name-based
andname-qualified
are name-only scoring system that combines the Jaro-Winkler and Soundex name comparison techniques to aggressively match entities by name. The algorithm attempts to loosely emulate theOFAC Sanctions Search web tool. This can be useful for regulatory purposes, or when you only know the names of the entities you need to screen.name-qualified
provides a marginal improvement overname-based
by computing the same score and then penalizing matches where the birth date or nationality is different for people, or where different registration numbers/tax identifiers are used for companies.
regression-v1
andregression-v2
are scoring systems based on logistic regression based on a wide set of features. They provide good results in particular if you can include multiple attributes to describe the entities you are screening for: dates of birth, nationalities, addresses, tax identifiers. Both models will produce high match scores only for multi-attribute matches, e.g. when a query shares the name and birth date or identification number of an entry in the database.
regression-v2
produces signficantly lower score values thanregression-v1
. You may want to set thethreshold
parameter for matches to0.5
when using it.Thelogic-v1
,name-based
andname-qualified
matchers support the fine tuning of feature weights for custom scoring. For example, an API client may want to give more weight to a phonetic matching algorithm, or fully disable one of the existing mechanisms. Feature weights are between 0.0-1.0 and can be applied to any of thedocumented features by including aweights
section in the body of the/match
API request:
{"weights":{"name_literal_match":0.0,"name_soundex_match":1.0},"queries":{"q1":{"schema":"Person","properties":{"name":["Barack Ohbama"]}}}}
Thelogic-v1
matcher includes some features that are weighted at 0.0 by default. These are meant to be enabled using custom weights if desired by the API user. Features that have a 0.0 weight are not computed by default, which has a positive impact on system performance.
Matching entities from multiple databases is a complex problem. The matchers included inyente
provides solutions to this problem that have several known limitations. These limitations are most visible in scenarios where the query data provided by the API consumer is extremely limited (e.g. name-only matching).
Some known limitations:
Lymited
vs.Limited
).Alexander
andSasha
). Such dictionaries are not currently included in the OpenSanctions matching system.Several vendors of advanced entity matching technology have integrated OpenSanctions data into their solutions. We're happy toput you in touch with those vendors.
In order to set up a matching solution with low error rates (both false positives and false negatives), it may be helpful to reflect what input data you can provide in order to allow precise decision-making. Consider the following questions:
schema
in your matching query toPerson
andOrganization
will increase precision.firstName
,lastName
properties)?birthDate
)?registrationNumber
) or tax identifiers (taxNumber
)?country
)?/match/default
will search sanctions lists, the PEP database and a broad set of other risk-adjacent entities. For a simple sanctions screening system, consider using/match/sanctions
instead: this will only produce matches sourced from sanctions lists.OpenSanctions isfree for non-commercial users. Businesses must acquire a data license to use the dataset.
The data is licensed under the terms ofCreative Commons 4.0 Attribution NonCommercial
Made with across Europe ·API console ·System status ·Changelog ·Impressum ·Privacy ·Security