Configuring the scoring system

Both the hosted OpenSanctions API and the yente open source application provide a selection of different mechanisms that can be used to score result matches.

Introduction

When you send amatch query to the API, it will be processed in two stages: first, a search index is used to locate possible candidate results. This process is meant to optimise for recall, i.e. find a broad selection of result candidates. In a second stage, these candidates are scored against the query that has been provided by the API consumer. The following URL query parameters are used to configure that process:

algorithm query parameter is used to select a scoring algorithm. The different algorithms are described below, but you can also retrieve metadata about them programmatically using the/algorithms endpoint.
- Set this tobest to use the highest-quality algorithm available at any time. This will produce good results, but may mean that specific scores for matched entities change significantly over time.
threshold is defined as the numeric score limit above which a result should be considered amatch. This parameter may need to be adapted in conjunction with thealgorithm to avoid producing too many false positive matches.
cutoff describes the lower bound of the result score that should still be returned by the API. Lower this parameter to see more candidates that have been down-ranked by the scoring system.
limit gives the maximum number of matches returned. The OpenSanctions dataset is de-duplicated, so there can usually only be one matching record for each query. Returning a large number of results therefore does not make sense like it would in afull-text search.

Recommended default: ?algorithm=best

Supported scoring mechanisms

The API supports several scoring mechanisms ("algorithms") that can be used to compute and rank the results of amatch query. Below is a narrative overview of the supported algorithms, please also refer to thetechnical documentation.

logic-v1 (currently also:best) implements a large number of deterministic rules to generate a match result suitable for screening systems. Therules include phonetic and fuzzy name matching, rules regarding the use of IMO, ISIN, LEI, OGRN, INN and other entity identifiers, and rules that reduce the quality of matches in which supporting information (such as countries, DOB, gender, and address) are divergent between the query and the matching candidate. This model is calibrated to be used with the defaultthreshold parameter value (0.7).
name-based andname-qualified are name-only scoring system that combines the Jaro-Winkler and Soundex name comparison techniques to aggressively match entities by name. The algorithm attempts to loosely emulate theOFAC Sanctions Search web tool. This can be useful for regulatory purposes, or when you only know the names of the entities you need to screen.name-qualified provides a marginal improvement overname-based by computing the same score and then penalizing matches where the birth date or nationality is different for people, or where different registration numbers/tax identifiers are used for companies.
regression-v1 andregression-v2 are scoring systems based on logistic regression based on a wide set of features. They provide good results in particular if you can include multiple attributes to describe the entities you are screening for: dates of birth, nationalities, addresses, tax identifiers. Both models will produce high match scores only for multi-attribute matches, e.g. when a query shares the name and birth date or identification number of an entry in the database.
- Please note:regression-v2 produces signficantly lower score values thanregression-v1. You may want to set thethreshold parameter for matches to0.5 when using it.

Fine-tuning the score weights

Thelogic-v1,name-based andname-qualified matchers support the fine tuning of feature weights for custom scoring. For example, an API client may want to give more weight to a phonetic matching algorithm, or fully disable one of the existing mechanisms. Feature weights are between 0.0-1.0 and can be applied to any of thedocumented features by including aweights section in the body of the/match API request:

{"weights":{"name_literal_match":0.0,"name_soundex_match":1.0},"queries":{"q1":{"schema":"Person","properties":{"name":["Barack Ohbama"]}}}}

Thelogic-v1 matcher includes some features that are weighted at 0.0 by default. These are meant to be enabled using custom weights if desired by the API user. Features that have a 0.0 weight are not computed by default, which has a positive impact on system performance.

Limitations of the matcher system

Matching entities from multiple databases is a complex problem. The matchers included inyente provides solutions to this problem that have several known limitations. These limitations are most visible in scenarios where the query data provided by the API consumer is extremely limited (e.g. name-only matching).

Some known limitations:

Name matching is less precise when used in conjunction with writing systems that are not Western-style alphabets. In particular, the fuzzy comparison between different writing systems will produce increased error rates. This affects writing systems including Arabic/Farsi, Burmese, the systems used in China, Japan, Korea and many Indian languages.
Phonetic matching (Soundex, Metaphone) does not support any non-alphabet writing systems.
The company name matching mechanism is particularly vulnerable to mis-spellings in the legal type parts of company names (e.g.Lymited vs.Limited).
Some name comparisons require dictionary alias approaches (e.g. matchingAlexander andSasha). Such dictionaries are not currently included in the OpenSanctions matching system.

Several vendors of advanced entity matching technology have integrated OpenSanctions data into their solutions. We're happy toput you in touch with those vendors.

Selecting your input data

In order to set up a matching solution with low error rates (both false positives and false negatives), it may be helpful to reflect what input data you can provide in order to allow precise decision-making. Consider the following questions:

Do you know if a record in your screening set refers to a person or an organization? Setting theschema in your matching query toPerson andOrganization will increase precision.
Can you provide multiple name aliases? For persons, are you able to include the first and last name separately (in thefirstName,lastName properties)?
The following can be useful qualifiers to include in your query in order to reduce false positives from name-only matches:
- Can you provide a birth date or year of birth for individuals (birthDate)?
- For companies, do you know any registration numbers (registrationNumber) or tax identifiers (taxNumber)?
- Do you know the nationality of a person, or the country in which a company was registered (country)?
Finally, consider reducing the scope of your query. Using/match/default will search sanctions lists, the PEP database and a broad set of other risk-adjacent entities. For a simple sanctions screening system, consider using/match/sanctions instead: this will only produce matches sourced from sanctions lists.

Movatterモバイル変換