Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Information extraction from English and German texts based on predicate logic

License

NotificationsYou must be signed in to change notification settings

msg-systems/holmes-extractor

Repository files navigation

Author:Richard Paul Hudson, Explosion AI

1. Introduction

1.1 The basic idea

Holmes is a Python 3 library (v3.6—v3.10) running on top ofspaCy (v3.1—v3.3) that supports a number of use casesinvolving information extraction from English and German texts. In all use cases, the informationextraction is based on analysing the semantic relationships expressed by the component parts ofeach sentence:

  • In thechatbot use case, the system is configured using one or moresearch phrases.Holmes then looks for structures whose meanings correspond to those of these search phrases withina searcheddocument, which in this case corresponds to an individual snippet of text or speechentered by the end user. Within a match, each word with its own meaning (i.e. that does not merely fulfil a grammatical function) in the search phrasecorresponds to one or more such words in the document. Both the fact that a search phrase was matched and any structured information the search phrase extracts can be used to drive the chatbot.

  • Thestructural extraction use case uses exactly the samestructural matching technology as the chatbot usecase, but searching takes place with respect to a pre-existing document or documents that are typically muchlonger than the snippets analysed in the chatbot use case, and the aim is to extract and store structured information. For example, a set of business articles could be searched to find all the places where one company is said to be planning totake over a second company. The identities of the companies concerned could then be stored in a database.

  • Thetopic matching use case aims to find passages in a document or documents whose meaningis close to that of another document, which takes on the role of thequery document, or to that of aquery phrase entered ad-hoc by the user. Holmes extracts a number of smallphraselets from the query phrase orquery document, matches the documents being searched against each phraselet, and conflates the results to findthe most relevant passages within the documents. Because there is no strict requirement that everyword with its own meaning in the query document match a specific word or words in the searched documents, more matches are foundthan in the structural extraction use case, but the matches do not contain structured information that can beused in subsequent processing. The topic matching use case is demonstrated bya website allowing searches withinsix Charles Dickens novels (for English) and around 350 traditional stories (for German).

  • Thesupervised document classification use case uses training data tolearn a classifier that assigns one or moreclassification labels to new documents based on what they are about.It classifies a new document by matching it against phraselets that were extracted from the training documents in thesame way that phraselets are extracted from the query document in the topic matching use case. The technique isinspired by bag-of-words-based classification algorithms that use n-grams, but aims to derive n-grams whose componentwords are related semantically rather than that just happen to be neighbours in the surface representation of a language.

In all four use cases, theindividual words are matched using anumber of strategies.To work out whether two grammatical structures that contain individually matching words correspond logically andconstitute a match, Holmes transforms the syntactic parse information provided by thespaCy libraryinto semantic structures that allow texts to be compared using predicate logic. As a user of Holmes, you do not need tounderstand the intricacies of how this works, although there are someimportant tips around writing effective search phrases for the chatbot andstructural extraction use cases that you should try and take on board.

Holmes aims to offer generalist solutions that can be used more or less out of the box withrelatively little tuning, tweaking or training and that are rapidly applicable to a wide range of use cases.At its core lies a logical, programmed, rule-based system that describes how syntactic representations in eachlanguage express semantic relationships. Although the supervised document classification use case does incorporate aneural network and although the spaCy library upon which Holmes builds has itself been pre-trained using machinelearning, the essentially rule-based nature of Holmes means that the chatbot, structural extraction and topic matching usecases can be put to use out of the box without any training and that the supervised document classification use casetypically requires relatively little training data, which is a great advantage because pre-labelled training data isnot available for many real-world problems.

Holmes has a long and complex history and we are now able to publish it under the MIT license thanks to the goodwill and openness of several companies. I, Richard Hudson, wrote the versions up to 3.0.0 while working atmsg systems, a large international software consultancy based near Munich. In late 2021, I changed employers and now work forExplosion, the creators ofspaCy andProdigy. Elements of the Holmes library are covered by aUS patent that I myself wrote in the early 2000s while working at a startup called Definiens that has since been acquired byAstraZeneca. With the kind permission of both AstraZeneca and msg systems, I am now maintaining Holmes at Explosion and can offer it for the first time under a permissive license: anyone can now use Holmes under the terms of the MITlicense without having to worry about the patent.

The library was originally developed atmsg systems, but is now being maintained atExplosion AI.Please direct any new issues or discussions tothe Explosion repository.

1.2 Installation

1.2.1 Prerequisites

If you do not already havePython 3 andpip on your machine, you will need to install thembefore installing Holmes.

1.2.2 Library installation

Install Holmes using the following commands:

Linux:

pip3 install holmes-extractor

Windows:

pip install holmes-extractor

To upgrade from a previous Holmes version, issue the following commands and thenreissue the commands to download the spaCy and coreferee models to ensureyou have the correct versions of them:

Linux:

pip3 install --upgrade holmes-extractor

Windows:

pip install --upgrade holmes-extractor

If you wish to use the examples and tests, clone the source code using

git clone https://github.com/explosion/holmes-extractor

If you wish to experiment with changing the source code, you canoverride the installed code by starting Python (typepython3 (Linux) orpython(Windows)) in the parent directory of the directory where your alteredholmes_extractormodule code is. If you have checked Holmes out of Git, this will be theholmes-extractor directory.

If you wish to uninstall Holmes again, this is achieved by deleting the installedfile(s) directly from the file system. These can be found by issuing thefollowing from the Python command prompt started from any directoryotherthan the parent directory ofholmes_extractor:

import holmes_extractorprint(holmes_extractor.__file__)

1.2.3 Installing the spaCy and coreferee models

The spaCy and coreferee libraries that Holmes builds upon requirelanguage-specific models that have to be downloaded separately before Holmes can be used:

Linux/English:

python3 -m spacy download en_core_web_trfpython3 -m spacy download en_core_web_lgpython3 -m coreferee install en

Linux/German:

pip3 install spacy-lookups-data # (from spaCy 3.3 onwards)python3 -m spacy download de_core_news_lgpython3 -m coreferee install de

Windows/English:

python -m spacy download en_core_web_trfpython -m spacy download en_core_web_lgpython -m coreferee install en

Windows/German:

pip install spacy-lookups-data # (from spaCy 3.3 onwards)python -m spacy download de_core_news_lgpython -m coreferee install de

and if you plan to run theregression tests:

Linux:

python3 -m spacy download en_core_web_sm

Windows:

python -m spacy download en_core_web_sm

You specify a spaCy model for Holmes to usewhen you instantiate the Manager facade class.en_core_web_trf andde_core_web_lg are the models that have been found to yield the best results for English and German respectively. Becauseen_core_web_trf does not have its own word vectors, but Holmes requires word vectors forembedding-based-matching, theen_core_web_lg model is loaded as a vector source wheneveren_core_web_trf is specified to the Manager class as the main model.

Theen_core_web_trf model requires sufficiently more resources than the other models; in a siutation where resources are scarce, it may be a sensible compromise to useen_core_web_lg as the main model instead.

1.2.4 Comments about deploying Holmes in an enterprise environment

The best way of integrating Holmes into a non-Python environment is towrap it as a RESTful HTTP service and to deploy it as amicroservice. Seehere for an example.

1.2.5 Resource requirements

Because Holmes performs complex, intelligent analysis, it is inevitable that it requires more hardware resources than more traditional search frameworks. The use cases that involve loading documents —structural extraction andtopic matching — are most immediately applicable to large but not massive corpora (e.g. all the documents belonging to a certain organisation, all the patents on a certain topic, all the books by a certain author). For cost reasons, Holmes would not be an appropriate tool with which to analyse the content of the entire internet!

That said, Holmes is both vertically and horizontally scalable. With sufficient hardware, both these use cases can be applied to an essentially unlimited number of documents by running Holmes on multiple machines, processing a different set of documents on each one and conflating the results. Note that this strategy is already employed to distribute matching amongst multiple cores on a single machine: theManager class starts a number of worker processes and distributes registered documents between them.

Holmes holds loaded documents in memory, which ties in with its intended use with large but not massive corpora. The performance of document loading,structural extraction andtopic matching all degrade heavily if the operating system has to swap memory pages to secondary storage, because Holmes can require memory from a variety of pages to be addressed when processing a single sentence. This means it is important to supply enough RAM on each machine to hold all loaded documents.

Please note theabove comments about the relative resource requirements of the different models.

1.3 Getting started

The easiest use case with which to get a quick basic idea of how Holmes works is thechatbot use case.

Here one or more search phrases are defined to Holmes in advance, and thesearched documents are short sentences or paragraphs typed ininteractively by an end user. In a real-life setting, the extractedinformation would be used todetermine the flow of interaction with the end user. For testing anddemonstration purposes, there is a console that displaysits matched findings interactively. It can be easily andquickly started from the Python command line (which is itself started from theoperating system prompt by typingpython3 (Linux) orpython (Windows))or from within aJupyter notebook.

The following code snippet can be entered line for line into the Python commandline, into a Jupyter notebook or into an IDE. It registers the fact that you areinterested in sentences about big dogs chasing cats and starts ademonstration chatbot console:

English:

import holmes_extractor as holmesholmes_manager = holmes.Manager(model='en_core_web_lg', number_of_workers=1)holmes_manager.register_search_phrase('A big dog chases a cat')holmes_manager.start_chatbot_mode_console()

German:

import holmes_extractor as holmesholmes_manager = holmes.Manager(model='de_core_news_lg', number_of_workers=1)holmes_manager.register_search_phrase('Ein großer Hund jagt eine Katze')holmes_manager.start_chatbot_mode_console()

If you now enter a sentence that corresponds to the search phrase, theconsole will display a match:

English:

Ready for inputA big dog chased a catMatched search phrase with text 'A big dog chases a cat':'big'->'big' (Matches BIG directly); 'A big dog'->'dog' (Matches DOG directly); 'chased'->'chase' (Matches CHASE directly); 'a cat'->'cat' (Matches CAT directly)

German:

Ready for inputEin großer Hund jagte eine KatzeMatched search phrase 'Ein großer Hund jagt eine Katze':'großer'->'groß' (Matches GROSS directly); 'Ein großer Hund'->'hund' (Matches HUND directly); 'jagte'->'jagen' (Matches JAGEN directly); 'eine Katze'->'katze' (Matches KATZE directly)

This could easily have been achieved with a simple matching algorithm, so typein a few more complex sentences to convince yourself that Holmes isreally grasping them and that matches are still returned:

English:

The big dog would not stop chasing the catThe big dog who was tired chased the catThe cat was chased by the big dogThe cat always used to be chased by the big dogThe big dog was going to chase the catThe big dog decided to chase the catThe cat was afraid of being chased by the big dogI saw a cat-chasing big dogThe cat the big dog chased was scaredThe big dog chasing the cat was a problemThere was a big dog that was chasing a catThe cat chase by the big dogThere was a big dog and it was chasing a cat.I saw a big dog. My cat was afraid of being chased by the dog.There was a big dog. His name was Fido. He was chasing my cat.A dog appeared. It was chasing a cat. It was very big.The cat sneaked back into our lounge because a big dog had been chasing her.Our big dog was excited because he had been chasing a cat.

German:

Der große Hund hat die Katze ständig gejagtDer große Hund, der müde war, jagte die KatzeDie Katze wurde vom großen Hund gejagtDie Katze wurde immer wieder durch den großen Hund gejagtDer große Hund wollte die Katze jagenDer große Hund entschied sich, die Katze zu jagenDie Katze, die der große Hund gejagt hatte, hatte AngstDass der große Hund die Katze jagte, war ein ProblemEs gab einen großen Hund, der eine Katze jagteDie Katzenjagd durch den großen HundEs gab einmal einen großen Hund, und er jagte eine KatzeEs gab einen großen Hund. Er hieß Fido. Er jagte meine KatzeEs erschien ein Hund. Er jagte eine Katze. Er war sehr groß.Die Katze schlich sich in unser Wohnzimmer zurück, weil ein großer Hund sie draußen gejagt hatteUnser großer Hund war aufgeregt, weil er eine Katze gejagt hatte

The demonstration is not complete without trying other sentences thatcontain the same words but do not express the same idea and observing that theyarenot matched:

English:

The dog chased a big catThe big dog and the cat chased aboutThe big dog chased a mouse but the cat was tiredThe big dog always used to be chased by the catThe big dog the cat chased was scaredOur big dog was upset because he had been chased by a cat.The dog chase of the big cat

German:

Der Hund jagte eine große KatzeDie Katze jagte den großen HundDer große Hund und die Katze jagtenDer große Hund jagte eine Maus aber die Katze war müdeDer große Hund wurde ständig von der Katze gejagtDer große Hund entschloss sich, von der Katze gejagt zu werdenDie Hundejagd durch den große Katze

In the above examples, Holmes has matched a variety of differentsentence-level structures that share the same meaning, but the baseforms of the three words in the matched documents have always been thesame as the three words in the search phrase. Holmes providesseveral further strategies for matching at the individual word level. Incombination with Holmes's ability to match different sentencestructures, these can enable a search phrase to be matched to a documentsentence that shares its meaning even where the two share no words andare grammatically completely different.

One of these additional word-matching strategies isnamed-entitymatching: special words can be included in search phrasesthat match whole classes of names like people or places. Exit theconsole by typingexit, then register a second search phrase andrestart the console:

English:

holmes_manager.register_search_phrase('An ENTITYPERSON goes into town')holmes_manager.start_chatbot_mode_console()

German:

holmes_manager.register_search_phrase('Ein ENTITYPER geht in die Stadt')holmes_manager.start_chatbot_mode_console()

You have now registered your interest in people going into town and canenter appropriate sentences into the console:

English:

Ready for inputI met Richard Hudson and John Doe last week. They didn't want to go into town.Matched search phrase with text 'An ENTITYPERSON goes into town'; negated; uncertain; involves coreference:'Richard Hudson'->'ENTITYPERSON' (Has an entity label matching ENTITYPERSON); 'go'->'go' (Matches GO directly); 'into'->'into' (Matches INTO directly); 'town'->'town' (Matches TOWN directly)Matched search phrase with text 'An ENTITYPERSON goes into town'; negated; uncertain; involves coreference:'John Doe'->'ENTITYPERSON' (Has an entity label matching ENTITYPERSON); 'go'->'go' (Matches GO directly); 'into'->'into' (Matches INTO directly); 'town'->'town' (Matches TOWN directly)

German:

Ready for inputLetzte Woche sah ich Richard Hudson und Max Mustermann. Sie wollten nicht mehr in die Stadt gehen.Matched search phrase with text 'Ein ENTITYPER geht in die Stadt'; negated; uncertain; involves coreference:'Richard Hudson'->'ENTITYPER' (Has an entity label matching ENTITYPER); 'gehen'->'gehen' (Matches GEHEN directly); 'in'->'in' (Matches IN directly); 'die Stadt'->'stadt' (Matches STADT directly)Matched search phrase with text 'Ein ENTITYPER geht in die Stadt'; negated; uncertain; involves coreference:'Max Mustermann'->'ENTITYPER' (Has an entity label matching ENTITYPER); 'gehen'->'gehen' (Matches GEHEN directly); 'in'->'in' (Matches IN directly); 'die Stadt'->'stadt' (Matches STADT directly)

In each of the two languages, this last example demonstrates severalfurther features of Holmes:

  • It can match not only individual words, but alsomultiwordphrases likeRichard Hudson.
  • When two or more words or phrases are linked byconjunction(and oror), Holmes extracts a separate match for each.
  • When a sentence isnegated (not), Holmes marks the matchaccordingly.
  • Like several of the matches yielded by the more complex entrysentences in the above example about big dogs and cats, Holmes marks thetwo matches asuncertain. This means that the search phrase wasnot matched exactly, but rather in the context of some other, morecomplex relationship ('wanting to go into town' is not the samething as 'going into town').

For more examples, please seesection 5.

2. Word-level matching strategies

The following strategies are implemented withone Python module per strategy.Although the standard library does not support adding bespoke strategies via theManagerclass, it would be relatively easy for anyone with Python programming skills tochange the code to enable this.

2.1 Direct matching (word_match.type=='direct')

Direct matching between search phrase words and document words is alwaysactive. The strategy relies mainly on matching stem forms of words,e.g. matching Englishbuy andchild tobought andchildren,Germansteigen andKind tostieg andKinder. However, in order toincrease the chance of direct matching working when the parser delivers anincorrect stem form for a word, the raw-text forms of both search-phrase anddocument words are also taken into consideration during direct matching.

2.2 Derivation-based matching (word_match.type=='derivation')

Derivation-based matching involves distinct but related words that typicallybelong to different word classes, e.g. Englishassess andassessment,Germanjagen andJagd. It is active by default but can be switched off usingtheanalyze_derivational_morphology parameter, which is set when instantiating theManager class.

2.3 Named-entity matching (word_match.type=='entity')

Named-entity matching is activated by inserting a special named-entityidentifier at the desired point in a search phrase in place of a noun,e.g.

An ENTITYPERSON goes into town (English)
Ein ENTITYPER geht in die Stadt (German).

The supported named-entity identifiers depend directly on the named-entity information suppliedby the spaCy models for each language (descriptions copied from an earlier version of the spaCydocumentation):

English:

IdentifierMeaning
ENTITYNOUNAny noun phrase.
ENTITYPERSONPeople, including fictional.
ENTITYNORPNationalities or religious or political groups.
ENTITYFACBuildings, airports, highways, bridges, etc.
ENTITYORGCompanies, agencies, institutions, etc.
ENTITYGPECountries, cities, states.
ENTITYLOCNon-GPE locations, mountain ranges, bodies of water.
ENTITYPRODUCTObjects, vehicles, foods, etc. (Not services.)
ENTITYEVENTNamed hurricanes, battles, wars, sports events, etc.
ENTITYWORK_OF_ARTTitles of books, songs, etc.
ENTITYLAWNamed documents made into laws.
ENTITYLANGUAGEAny named language.
ENTITYDATEAbsolute or relative dates or periods.
ENTITYTIMETimes smaller than a day.
ENTITYPERCENTPercentage, including "%".
ENTITYMONEYMonetary values, including unit.
ENTITYQUANTITYMeasurements, as of weight or distance.
ENTITYORDINAL"first", "second", etc.
ENTITYCARDINALNumerals that do not fall under another type.

German:

IdentifierMeaning
ENTITYNOUNAny noun phrase.
ENTITYPERNamed person or family.
ENTITYLOCName of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains).
ENTITYORGNamed corporate, governmental, or other organizational entity.
ENTITYMISCMiscellaneous entities, e.g. events, nationalities, products or works of art.

We have addedENTITYNOUN to the genuine named-entity identifiers. Asit matches any noun phrase, it behaves in a similar fashion togeneric pronouns.The differences are thatENTITYNOUN has to match a specific noun phrase within a documentand that this specific noun phrase is extracted and available for further processing.ENTITYNOUN is not supported within the topic matching use case.

2.4 Ontology-based matching (word_match.type=='ontology')

An ontology enables the user to define relationships between words thatare then taken into account when matching documents to search phrases.The three relevant relationship types arehyponyms (something is asubtype of something),synonyms (something means the same assomething) andnamed individuals (something is a specific instance ofsomething). The three relationship types are exemplified in Figure 1:

Figure 1

Ontologies are defined to Holmes using theOWL ontologystandard serialized usingRDF/XML. Such ontologiescan be generated with a variety of tools. For the Holmesexamples andtests, the free toolProtege was used. It is recommendedthat you use Protege both to define your own ontologies and to browsethe ontologies that ship with the examples and tests. When saving anontology under Protege, please selectRDF/XML as the format. Protegeassigns standard labels for the hyponym, synonym and named-individual relationshipsthat Holmesunderstands as defaults but that can also beoverridden.

Ontology entries are defined using an Internationalized ResourceIdentifier (IRI),e.g. http://www.semanticweb.org/hudsonr/ontologies/2019/0/animals#dog.Holmes only uses the final fragment for matching, which allows homonyms(words with the same form but multiple meanings) to be defined atmultiple points in the ontology tree.

Ontology-based matching gives the best results with Holmes when smallontologies are used that have been built for specific subject domainsand use cases. For example, if you are implementing a chatbot for abuilding insurance use case, you should create a small ontology capturing theterms and relationships within that specific domain. On the other hand,it is not recommended to use large ontologies builtfor all domains within an entire language such asWordNet. This is because the manyhomonyms and relationships that only apply in narrow subjectdomains will tend to lead to a large number of incorrect matches. Forgeneral use cases,embedding-based matching will tend to yield better results.

Each word in an ontology can be regarded as heading a subtree consistingof its hyponyms, synonyms and named individuals, those words' hyponyms,synonyms and named individuals, and so on. With an ontology set up in the standard fashion thatis appropriate for thechatbot andstructural extraction use cases,a word in a Holmes search phrase matches a word in a document if the document word is within thesubtree of the search phrase word. Were the ontology in Figure 1 defined to Holmes, in addition to thedirect matching strategy, which would match each word to itself, thefollowing combinations would match:

  • animal in a search phrase would matchhound,dog,cat,pussy,puppy,Fido,kitten andMimi Momo in documents;
  • hound in a search phrase would matchdog,puppy andFido indocuments;
  • dog in a search phrase would matchhound,puppy andFido indocuments;
  • cat in a search phrase would matchpussy,kitten andMimiMomo in documents;
  • pussy in a search phrase would matchcat,kitten andMimiMomo in documents.

English phrasal verbs likeeat up and German separable verbs likeaufessen must be defined as single items within ontologies. When Holmes is analysing a text andcomes across such a verb, the main verb and the particle are conflated into a singlelogical word that can then be matched via an ontology. This means thateat up withina text would match the subtree ofeat up within the ontology but not the subtree ofeat within the ontology.

Ifderivation-based matching is active, it is taken into accounton both sides of a potential ontology-based match. For example, ifalter andamend aredefined as synonyms in an ontology,alteration andamendment would also match each other.

In situations where finding relevant sentences is more important thanensuring the logical correspondence of document matches to search phrases,it may make sense to specifysymmetric matching when defining the ontology.Symmetric matching is recommended for thetopic matching use case, butis unlikely to be appropriate for thechatbot orstructural extraction use cases.It means that the hypernym (reverse hyponym) relationship is taken into account as well as thehyponym and synonym relationships when matching, thus leading to a more symmetric relationshipbetween documents and search phrases. An important rule applied when matching via a symmetric ontology is that a match path may not contain both hypernym and hyponym relationships, i.e. you cannot go back on yourself. Were theontology above defined as symmetric, the following combinations would match:

  • animal in a search phrase would matchhound,dog,cat,pussy,puppy,Fido,kitten andMimi Momo in documents;
  • hound in a search phrase would matchanimal,dog,puppy andFido indocuments;
  • dog in a search phrase would matchanimal,hound,puppy andFido indocuments;
  • puppy in a search phrase would matchanimal,dog andhound in documents;
  • Fido in a search phrase would matchanimal,dog andhound in documents;
  • cat in a search phrase would matchanimal,pussy,kitten andMimiMomo in documents;
  • pussy in a search phrase would matchanimal,cat,kitten andMimiMomo in documents.
  • kitten in a search phrase would matchanimal,cat andpussy in documents;
  • Mimi Momo in a search phrase would matchanimal,cat andpussy in documents.

In thesupervised document classification use case,two separate ontologies can be used:

  • Thestructural matching ontology is used to analyse the content of both trainingand test documents. Each word from a document that is found in the ontology is replaced by its most general hypernymancestor. It is important to realise that an ontology is only likely to work with structural matching forsupervised document classification if it was built specifically for the purpose: such an ontologyshould consist of a number of separate trees representing the main classes of object in the documentsto be classified. In the example ontology shown above, all words in the ontology would be replaced withanimal; in an extreme case with a WordNet-style ontology, all nouns would end up being replaced withthing, which is clearly not a desirable outcome!

  • Theclassification ontology is used to capture relationships between classification labels: that a documenthas a certain classification implies it also has any classifications to whose subtree that classification belongs.Synonyms should be used sparingly if at all in classification ontologies because they add to the complexity of theneural network without adding any value; and although it is technically possible to set up a classificationontology to use symmetric matching, there is no sensible reason for doing so. Note that a label within theclassification ontology that is not directly defined as the label of any training documenthas to be registered specifically using theSupervisedTopicTrainingBasis.register_additional_classification_label() method if it is to be taken intoaccount when training the classifier.

2.5 Embedding-based matching (word_match.type=='embedding')

spaCy offersword embeddings:machine-learning-generated numerical vector representations of wordsthat capture the contexts in which each wordtends to occur. Two words with similar meaning tend to emerge with wordembeddings that are close to each other, and spaCy can measure thecosine similarity between any two words' embeddings expressed as a decimalbetween 0.0 (no similarity) and 1.0 (the same word). Becausedog andcat tend to appear in similar contexts, they have a similarity of0.80;dog andhorse have less in common and have a similarity of0.62; anddog andiron have a similarity of only 0.25. Embedding-based matchingis only activated for nouns, adjectives and adverbs because the results have been found to beunsatisfactory with other word classes.

It is important to understand that the fact that two words have similarembeddings does not imply the same sort of logical relationship betweenthe two as whenontology-based matching is used: for example, thefact thatdog andcat have similar embeddings means neither that adog is a type of cat nor that a cat is a type of dog. Whether or notembedding-based matching is nonetheless an appropriate choice depends onthe functional use case.

For thechatbot,structural extraction andsupervised document classification use cases, Holmes makes use of word-embedding-based similarities using aoverall_similarity_threshold parameter defined globally ontheManager class. A match is detected between asearch phrase and a structure within a document whenever the geometricmean of the similarities between the individual corresponding word pairsis greater than this threshold. The intuition behind this technique isthat where a search phrase with e.g. six lexical words has matched adocument structure where five of these words match exactly and only onecorresponds via an embedding, the similarity that should be required to match this sixth word isless than when only three of the words matched exactly and two of the other words also correspondvia embeddings.

Matching a search phrase to a document begins by finding wordsin the document that match the word at the root (syntactic head) of thesearch phrase. Holmes then investigates the structure around each ofthese matched document words to check whether the document structure matchesthe search phrase structure in its entirity.The document words that match the search phrase root word are normally foundusing an index. However, if embeddings have to be taken into account whenfinding document words that match a search phrase root word,every word inevery document with a valid word class has to be compared for similarity to thatsearch phrase root word. This has a very noticeable performance hit that renders all use casesexcept thechatbot use case essentially unusable.

To avoid the typically unnecessary performance hit that results from embedding-based matchingof search phrase root words, it is controlled separately from embedding-based matching in generalusing theembedding_based_matching_on_root_words parameter, which is set when instantiating theManager class. You are advised to keep this setting switched off (valueFalse) for most use cases.

Neither theoverall_similarity_threshold nor theembedding_based_matching_on_root_words parameter has any effect on thetopic matching use case. Here word-level embedding similarity thresholds are set using theword_embedding_match_threshold andinitial_question_word_embedding_match_threshold parameters when calling thetopic_match_documents_against function on the Manager class.

2.6 Named-entity-embedding-based matching (word_match.type=='entity_embedding')

A named-entity-embedding based match obtains between a searched-document word that has a certain entity label and a search phrase or query document word whose embedding is sufficiently similar to the underlying meaning of that entity label, e.g. the wordindividual in a search phrase has a similar word embedding to the underlying meaning of thePERSON entity label. Note that named-entity-embedding-based matching is never active on root words regardless of theembedding_based_matching_on_root_words setting.

2.7 Initial-question-word matching (word_match.type=='question')

Initial-question-word matching is only active duringtopic matching. Initial question words in query phrases match entities in the searched documents that represent potential answers to the question, e.g. when comparing the query phraseWhen did Peter have breakfast to the searched-document phrasePeter had breakfast at 8 a.m., the question wordWhen would match the temporal adverbial phraseat 8 a.m..

Initial-question-word matching is switched on and off using theinitial_question_word_behaviour parameter when calling thetopic_match_documents_against function on the Manager class. It is only likely to be useful when topic matching is being performed in an interactive setting where the user enters short query phrases, as opposed to when it is being used to find documents on a similar topic to an pre-existing query document: initial question words are only processed at the beginning of the first sentence of the query phrase or query document.

Linguistically speaking, if a query phrase consists of a complex question with several elements dependent on the main verb, a finding in a searched document is only an 'answer' if contains matches to all these elements. Because recall is typically more important than precision when performing topic matching with interactive query phrases, however, Holmes will match an initial question word to a searched-document phrase wherever they correspond semantically (e.g. whereverwhen corresponds to a temporal adverbial phrase) and each depend on verbs that themselves match at the word level. One possible strategy to filter out 'incomplete answers' would be to calculate the maximum possible score for a query phrase and reject topic matches that score below a threshold scaled to this maximum.

3. Coreference resolution

Before Holmes analyses a searched document or query document, coreference resolution is performed using theCorefereelibrary running on top of spaCy. This means that situations are recognised where pronouns and nouns that are located near one another within a text refer to the same entities. The information from one mention can then be applied to the analysis of further mentions:

I saw abig dog.It was chasing a cat.
I saw abig dog.The dog was chasing a cat.

Coreferee also detects situations where a noun refers back to a named entity:

We discussedAstraZeneca.The company had given us permission to publish this library under the MIT license.

If this example were to match the search phraseA company gives permission to publish something, thecoreference information that the company under discussion is AstraZeneca is clearlyrelevant and worth extracting in addition to the word(s) directly matched to the searchphrase. Such information is captured in theword_match.extracted_word field.

4. Writing effective search phrases

4.1 General comments

The concept of search phrases hasalready been introduced and is relevant to thechatbot use case, the structural extraction use case and topreselection within the superviseddocument classification use case.

It is crucial to understand that the tips and limitations set out in Section 4 do not apply in any way to query phrases in topic matching. If you are using Holmes for topic matching only, you can completely ignore this section!

Structural matching between search phrases and documents is not symmetric: thereare many situations in which sentence X as a search phrase would matchsentence Y within a document but where the converse would not be true.Although Holmes does its best to understand any search phrases, theresults are better when the user writing them follows certain patternsand tendencies, and getting to grips with these patterns and tendencies isthe key to using the relevant features of Holmes successfully.

4.1.1 Lexical versus grammatical words

Holmes distinguishes between:lexical words likedog,chase andcat (English) orHund,jagen andKatze (German) in theinitialexample above; andgrammatical words likea (English)orein andeine (German) in the initial example above. Only lexical words matchwords in documents, but grammatical words still play a crucial role within asearch phrase: they enable Holmes to understand it.

Dog chase cat (English)
Hund jagen Katze (German)

contain the same lexical words as the search phrases in theinitialexample above, but as they are not grammatical sentences Holmes isliable to misunderstand them if they are used as search phrases. This is a major differencebetween Holmes search phrases and the search phrases you use instinctively withstandard search engines like Google, and it can take some getting used to.

4.1.2 Use of the present active

A search phrase need not contain a verb:

ENTITYPERSON (English)
A big dog (English)
Interest in fishing (English)
ENTITYPER (German)
Ein großer Hund (German)
Interesse am Angeln (German)

are all perfectly valid and potentially useful search phrases.

Where a verb is present, however, Holmes delivers the best results when the verbis in thepresent active, aschases andjagt are in theinitialexample above. This gives Holmes the best chance of understandingthe relationship correctly and of matching thewidest range of document structures that share the target meaning.

4.1.3 Generic pronouns

Sometimes you may only wish to extract the object of a verb. Forexample, you might want to find sentences that are discussing a catbeing chased regardless of who is doing the chasing. In order to avoid asearch phrase containing a passive expression like

A cat is chased (English)
Eine Katze wird gejagt (German)

you can use ageneric pronoun. This is a word that Holmes treatslike a grammatical word in that it is not matched to documents; its solepurpose is to help the user form a grammatically optimal search phrasein the present active. Recognised generic pronouns are Englishsomething,somebody andsomeone and Germanjemand (and inflected forms ofjemand) andetwas:Holmes treats them all as equivalent. Using generic pronouns,the passive search phrases above could be re-expressed as

Somebody chases a cat (English)
Jemand jagt eine Katze (German).

4.1.4 Prepositions

Experience shows that differentprepositions are often used with thesame meaning in equivalent phrases and that this can prevent searchphrases from matching where one would intuitively expect it. Forexample, the search phrases

Somebody is at the market (English)
Jemand ist auf dem Marktplatz (German)

would fail to match the document phrases

Richard was in the market (English)
Richard war am Marktplatz (German)

The best way of solving this problem is to define the prepositions inquestion as synonyms in anontology.

4.2 Structures not permitted in search phrases

The following types of structures are prohibited in search phrases andresult in Python user-defined errors:

4.2.1 Multiple clauses

A dog chases a cat. A cat chases a dog (English)
Ein Hund jagt eine Katze. Eine Katze jagt einen Hund (German)

Each clause must be separated out into its own search phrase andregistered individually.

4.2.2 Negation

A dog does not chase a cat. (English)
Ein Hund jagt keine Katze. (German)

Negative expressions are recognised as such in documents and the generatedmatches marked as negative; allowing search phrases themselves to benegative would overcomplicate the library without offering any benefits.

4.2.3 Conjunction

A dog and a lion chase a cat. (English)
Ein Hund und ein Löwe jagen eine Katze. (German)

Wherever conjunction occurs in documents, Holmes distributes theinformation among multiple matches as explainedabove. In theunlikely event that there should be a requirement to capture conjunction explicitlywhen matching, this could be achieved by using theManager.match() function and looking for situationswhere the document token objects are shared by multiple match objects.

4.2.4 Lack of lexical words

The (English)
Der (German)

A search phrase cannot be processed if it does not contain any wordsthat can be matched to documents.

4.2.5 Coreferring pronouns

A dog chases a cat and he chases a mouse (English)
Ein Hund jagt eine Katze und er jagt eine Maus (German)

Pronouns that corefer with nouns elsewhere in the search phrase are not permitted as thiswould overcomplicate the library without offering any benefits.

4.3 Structures strongly discouraged in search phrases

The following types of structures are strongly discouraged in searchphrases:

4.3.1 Ungrammatical expressions

Dog chase cat (English)
Hund jagen Katze (German)

Although these will sometimes work, the results will be better if searchphrases are expressed grammatically.

4.3.2 Complex verb tenses

A cat is chased by a dog (English)
A dog will have chased a cat (English)
Eine Katze wird durch einen Hund gejagt (German)
Ein Hund wird eine Katze gejagt haben (German)

Although these will sometimes work, the results will be better if verbs insearch phrases are expressed in the present active.

4.3.3 Questions

Who chases the cat? (English)
Wer jagt die Katze? (German)

Although questions are supported as query phrases in thetopic matching use case, they are not appropriate as search phrases.Questions should be re-phrased as statements, in this case

Something chases the cat (English)
Etwas jagt die Katze (German).

4.3.4 Compound words (relates to German only)

Informationsextraktion (German)
Ein Stadtmittetreffen (German)

The internal structure of German compound words is analysed within searched documents as well aswithin query phrases in thetopic matching use case, but not within searchphrases. In search phrases, compound words should be reexpressed as genitive constructions even in caseswhere this does not strictly capture their meaning:

Extraktion der Information (German)
Ein Treffen der Stadtmitte (German)

4.4 Structures to be used with caution in search phrases

The following types of structures should be used with caution in searchphrases:

4.4.1 Very complex structures

A fierce dog chases a scared cat on the way to the theatre(English)
Ein kämpferischer Hund jagt eine verängstigte Katze auf demWeg ins Theater (German)

Holmes can handle any level of complexity within search phrases, but themore complex a structure, the less likely it becomes that a documentsentence will match it. If it is really necessary to match such complex relationshipswith search phrases rather than withtopic matching, they are typically better extracted by splitting the search phrase up, e.g.

A fierce dog (English)
A scared cat (English)
A dog chases a cat (English)
Something chases something on the way to the theatre (English)

Ein kämpferischer Hund (German)
Eine verängstigte Katze (German)
Ein Hund jagt eine Katze (German)
Etwas jagt etwas auf dem Weg ins Theater (German)

Correlations between the resulting matches can then be established bymatching via theManager.match() function and looking forsituations where the document token objects are shared across multiple match objects.

One possible exception to this piece of advice is whenembedding-based matching is active. Becausewhether or not each word in a search phrase matches then depends on whetheror not other words in the same search phrase have been matched, large, complexsearch phrases can sometimes yield results that a combination of smaller,simpler search phrases would not.

4.4.2 Deverbal noun phrases

The chasing of a cat (English)
Die Jagd einer Katze (German)

These will often work, but it is generally better practiceto use verbal search phrases like

Something chases a cat (English)
Etwas jagt eine Katze (German)

and to allow the corresponding nominal phrases to be matched viaderivation-based matching.

5. Use cases and examples

5.1 Chatbot

The chatbot use case hasalready been introduced:a predefined set of search phrases is used to extractinformation from phrases entered interactively by an end user, which inthis use case act as the documents.

The Holmes source code ships with two examples demonstrating the chatbotuse case, one for each language, with predefined ontologies. Havingcloned the source code and installed the Holmes library,navigate to the/examples directory and type the following (Linux):

English:

python3 example_chatbot_EN_insurance.py

German:

python3 example_chatbot_DE_insurance.py

or click on the files in Windows Explorer (Windows).

Holmes matches syntactically distinct structures that are semanticallyequivalent, i.e. that share the same meaning. In a real chatbot usecase, users will typically enter equivalent information with phrases thatare semantically distinct as well, i.e. that have different meanings.Because the effort involved in registering a search phrase is barelygreater than the time it takes to type it in, it makes sense to registera large number of search phrases for each relationship you are trying toextract: essentiallyall ways people have been observed to express theinformation you are interested in orall ways you can imagine somebodymight express the information you are interested in. To assist this,search phrases can be registered with labels that do not needto be unique: a label can then be used to express the relationshipan entire group of search phrases is designed to extract. Note that when many searchphrases have been defined to extract the same relationship, a single user entryis likely to be sometimes matched by multiple search phrases. This must be handledappropriately by the calling application.

One obvious weakness of Holmes in the chatbot setting is its sensitivityto correct spelling and, to a lesser extent, to correct grammar.Strategies for mitigating this weakness include:

  • Defining common misspellings as synonyms in the ontology
  • Defining specific search phrases including common misspellings
  • Putting user entry through a spellchecker before submitting it toHolmes
  • Explaining the importance of correct spelling and grammar to users

5.2 Structural extraction

The structural extraction use case usesstructural matching in the same way as thechatbot use case,and many of the same comments and tips apply to it. The principal differences are that pre-existing andoften lengthy documents are scanned rather than text snippets entered ad-hoc by the user, and that thereturned match objects are not used todrive a dialog flow; they are examined solely to extract and store structured information.

Code for performing structural extraction would typically perform the following tasks:

  • Initialize the Holmes manager object.
  • CallManager.register_search_phrase() several times to define a number of search phrases specifying the information to be extracted.
  • CallManager.parse_and_register_document() several times to load a number of documents within which to search.
  • CallManager.match() to perform the matching.
  • Query the returned match objects to obtain the extracted information and store it in a database.

5.3 Topic matching

The topic matching use case matches aquery document, or alternatively aquery phraseentered ad-hoc by the user, against a set of documents pre-loaded into memory. The aim is to find the passagesin the documents whose topic most closely corresponds to the topic of the query document; the output isa ordered list of passages scored according to topic similarity. Additionally, if a query phrase contains aninitial question word, the output will contain potential answers to the question.

Topic matching queries may containgeneric pronouns andnamed-entity identifiers just like search phrases, although theENTITYNOUNtoken is not supported. However, an important difference fromsearch phrases is that the topic matching use case places norestrictions on the grammatical structures permissible within the query document.

In addition to theHolmes demonstration website, the Holmes source code ships withthree examples demonstrating the topic matching use case with an English literaturecorpus, a German literature corpus and a German legal corpus respectively. Users are encouraged to run theseto get a feel for how they work.

Topic matching uses a variety of strategies to find text passages that are relevant to the query. These includeresource-hungry procedures like investigating semantic relationships and comparing embeddings. Because applying theseacross the board would prevent topic matching from scaling, Holmes only attempts them for specific areas of the textthat less resource-intensive strategies have already marked as looking promising. This and the other interior workingsof topic matching are explainedhere.

5.4 Supervised document classification

In the supervised document classification use case, a classifier is trained with a number of documents thatare each pre-labelled with a classification. The trained classifier then assigns one or more labels to new documentsaccording to what each new document is about. As explainedhere, ontologies can beused both to enrichen the comparison of the content of the various documents and to capture implicationrelationships between classification labels.

A classifier makes use of a neural network (amultilayer perceptron) whose topology can eitherbe determined automatically by Holmes orspecified explicitly by the user.With a large number of training documents, the automatically determined topology can easily exhaust the memoryavailable on a typical machine; if there is no opportunity to scale up the memory, this problem can beremedied by specifying a smaller number of hidden layers or a smaller number of nodes in one or more of the layers.

A trained document classification model retains no references to its training data. This is an advantagefrom a data protection viewpoint, although itcannot presently be guaranteed that models willnot contain individual personal or company names.

A typical problem with the execution of many document classification use cases is that a new classification labelis added when the system is already live but that there are initially no examples of this new classification withwhich to train a new model. The best course of action in such a situation is to define search phrases whichpreselect the more obvious documents with the new classification using structural matching. Those documents thatare not preselected as having the new classification label are then passed to the existing, previously trainedclassifier in the normal way. When enough documents exemplifying the new classification have accumulated in the system,the model can be retrained and the preselection search phrases removed.

Holmes ships with an examplescript demonstrating supervised document classification for English with theBBC Documents dataset. The script downloads the documents (forthis operation and for this operation alone, you will need to be online) and places them in a working directory.When training is complete, the script saves the model to the working directory. If the model file is foundin the working directory on subsequent invocations of the script, the training phase is skipped and the scriptgoes straight to the testing phase. This means that if it is wished to repeat the training phase, either the modelhas to be deleted from the working directory or a new working directory has to be specified to the script.

Havingcloned the source code and installed the Holmes library,navigate to the/examples directory. Specify a working directory at the top of theexample_supervised_topic_model_EN.py file, then typepython3 example_supervised_topic_model_EN (Linux)or click on the script in Windows Explorer (Windows).

It is important to realise that Holmes learns to classify documents according to the words or semanticrelationships they contain, taking any structural matching ontology into account in the process. For manyclassification tasks, this is exactly what is required; but there are tasks (e.g. author attribution accordingto the frequency of grammatical constructions typical for each author) where it is not. For the right task,Holmes achieves impressive results. For the BBC Documents benchmarkprocessed by the example script, Holmes performs slightly better than benchmarks available online(see e.g.here)although the difference is probably too slight to be significant, especially given that the differenttraining/test splits were used in each case: Holmes has been observed to learn models that predict thecorrect result between 96.9% and 98.7% of the time. The range is explained by the fact that the behaviourof the neural network is not fully deterministic.

The interior workings of supervised document classification are explainedhere.

6 Interfaces intended for public use

6.1Manager

holmes_extractor.Manager(self, model, *, overall_similarity_threshold=1.0,  embedding_based_matching_on_root_words=False, ontology=None,  analyze_derivational_morphology=True, perform_coreference_resolution=None,  number_of_workers=None, verbose=False)The facade class for the Holmes library.Parameters:model -- the name of the spaCy model, e.g. *en_core_web_trf*overall_similarity_threshold -- the overall similarity threshold for embedding-based  matching. Defaults to *1.0*, which deactivates embedding-based matching. Note that this  parameter is not relevant for topic matching, where the thresholds for embedding-based  matching are set on the call to *topic_match_documents_against*.embedding_based_matching_on_root_words -- determines whether or not embedding-based  matching should be attempted on search-phrase root tokens, which has a considerable  performance hit. Defaults to *False*. Note that this parameter is not relevant for topic  matching.ontology -- an *Ontology* object. Defaults to *None* (no ontology).analyze_derivational_morphology -- *True* if matching should be attempted between different  words from the same word family. Defaults to *True*.perform_coreference_resolution -- *True* if coreference resolution should be taken into account  when matching. Defaults to *True*.use_reverse_dependency_matching -- *True* if appropriate dependencies in documents can be  matched to dependencies in search phrases where the two dependencies point in opposite  directions. Defaults to *True*.number_of_workers -- the number of worker processes to use, or *None* if the number of worker  processes should depend on the number of available cores. Defaults to *None*verbose -- a boolean value specifying whether multiprocessing messages should be outputted to  the console. Defaults to *False*
Manager.register_serialized_document(self, serialized_document:bytes, label:str="") -> NoneParameters:document -- a preparsed Holmes document.label -- a label for the document which must be unique. Defaults to the empty string,    which is intended for use cases involving single documents (typically user entries).

Manager.register_serialized_documents(self, document_dictionary:dict[str, bytes]) -> NoneNote that this function is the most efficient way of loading documents.Parameters:document_dictionary -- a dictionary from labels to serialized documents.
Manager.parse_and_register_document(self, document_text:str, label:str='') -> NoneParameters:document_text -- the raw document text.label -- a label for the document which must be unique. Defaults to the empty string,    which is intended for use cases involving single documents (typically user entries).
Manager.remove_document(self, label:str) -> None
Manager.remove_all_documents(self, labels_starting:str=None) -> NoneParameters:labels_starting -- a string starting the labels of documents to be removed,    or 'None' if all documents are to be removed.
Manager.list_document_labels(self) -> List[str]Returns a list of the labels of the currently registered documents.
Manager.serialize_document(self, label:str) -> Optional[bytes]Returns a serialized representation of a Holmes document that can be  persisted to a file. If 'label' is not the label of a registered document,  'None' is returned instead.Parameters:label -- the label of the document to be serialized.
Manager.get_document(self, label:str='') -> Optional[Doc]Returns a Holmes document. If *label* is not the label of a registered document, *None*  is returned instead.Parameters:label -- the label of the document to be serialized.
Manager.debug_document(self, label:str='') -> NoneOutputs a debug representation for a loaded document.Parameters:label -- the label of the document to be serialized.
Manager.register_search_phrase(self, search_phrase_text:str, label:str=None) -> SearchPhraseRegisters and returns a new search phrase.Parameters:search_phrase_text -- the raw search phrase text.  label -- a label for the search phrase, which need not be unique.  If label==None, the assigned label defaults to the raw search phrase text.
Manager.remove_all_search_phrases_with_label(self, label:str) -> None
Manager.remove_all_search_phrases(self) -> None
Manager.list_search_phrase_labels(self) -> List[str]

Manager.match(self, search_phrase_text:str=None, document_text:str=None) -> List[Dict]Matches search phrases to documents and returns the result as match dictionaries.Parameters:search_phrase_text -- a text from which to generate a search phrase, or 'None' if the    preloaded search phrases should be used for matching.document_text -- a text from which to generate a document, or 'None' if the preloaded    documents should be used for matching.

topic_match_documents_against(self, text_to_match:str, *,    use_frequency_factor:bool=True,    maximum_activation_distance:int=75,    word_embedding_match_threshold:float=0.8,    initial_question_word_embedding_match_threshold:float=0.7,    relation_score:int=300,    reverse_only_relation_score:int=200,    single_word_score:int=50,    single_word_any_tag_score:int=20,    initial_question_word_answer_score:int=600,    initial_question_word_behaviour:str='process',    different_match_cutoff_score:int=15,    overlapping_relation_multiplier:float=1.5,    embedding_penalty:float=0.6,    ontology_penalty:float=0.9,    relation_matching_frequency_threshold:float=0.25,    embedding_matching_frequency_threshold:float=0.5,    sideways_match_extent:int=100,    only_one_result_per_document:bool=False,    number_of_results:int=10,    document_label_filter:str=None,    tied_result_quotient:float=0.9) -> List[Dict]:Returns a list of dictionaries representing the results of a topic match between an entered textand the loaded documents.Properties:text_to_match -- the text to match against the loaded documents.use_frequency_factor -- *True* if scores should be multiplied by a factor between 0 and 1  expressing how rare the words matching each phraselet are in the corpus. Note that,  even if this parameter is set to *False*, the factors are still calculated as they are   required for determining which relation and embedding matches should be attempted.maximum_activation_distance -- the number of words it takes for a previous phraselet  activation to reduce to zero when the library is reading through a document.word_embedding_match_threshold -- the cosine similarity above which two words match where  the search phrase word does not govern an interrogative pronoun.initial_question_word_embedding_match_threshold -- the cosine similarity above which two  words match where the search phrase word governs an interrogative pronoun.relation_score -- the activation score added when a normal two-word relation is matched.reverse_only_relation_score -- the activation score added when a two-word relation  is matched using a search phrase that can only be reverse-matched.single_word_score -- the activation score added when a single noun is matched.single_word_any_tag_score -- the activation score added when a single word is matched  that is not a noun.initial_question_word_answer_score -- the activation score added when a question word is  matched to an potential answer phrase.initial_question_word_behaviour -- 'process' if a question word in the sentence  constituent at the beginning of *text_to_match* is to be matched to document phrases  that answer it and to matching question words; 'exclusive' if only topic matches that   answer questions are to be permitted; 'ignore' if question words are to be ignored.different_match_cutoff_score -- the activation threshold under which topic matches are  separated from one another. Note that the default value will probably be too low if  *use_frequency_factor* is set to *False*.overlapping_relation_multiplier -- the value by which the activation score is multiplied  when two relations were matched and the matches involved a common document word.embedding_penalty -- a value between 0 and 1 with which scores are multiplied when the  match involved an embedding. The result is additionally multiplied by the overall  similarity measure of the match.ontology_penalty -- a value between 0 and 1 with which scores are multiplied for each  word match within a match that involved the ontology. For each such word match,  the score is multiplied by the value (abs(depth) + 1) times, so that the penalty is  higher for hyponyms and hypernyms than for synonyms and increases with the  depth distance.relation_matching_frequency_threshold -- the frequency threshold above which single  word matches are used as the basis for attempting relation matches.embedding_matching_frequency_threshold -- the frequency threshold above which single  word matches are used as the basis for attempting relation matches with  embedding-based matching on the second word.sideways_match_extent -- the maximum number of words that may be incorporated into a  topic match either side of the word where the activation peaked.only_one_result_per_document -- if 'True', prevents multiple results from being returned  for the same document.number_of_results -- the number of topic match objects to return.document_label_filter -- optionally, a string with which document labels must start to  be considered for inclusion in the results.tied_result_quotient -- the quotient between a result and following results above which  the results are interpreted as tied.
Manager.get_supervised_topic_training_basis(self, *, classification_ontology:Ontology=None,  overlap_memory_size:int=10, oneshot:bool=True, match_all_words:bool=False,  verbose:bool=True) -> SupervisedTopicTrainingBasis:Returns an object that is used to train and generate a model for thesupervised document classification use case.Parameters:classification_ontology -- an Ontology object incorporating relationships between    classification labels, or 'None' if no such ontology is to be used.overlap_memory_size -- how many non-word phraselet matches to the left should be    checked for words in common with a current match.oneshot -- whether the same word or relationship matched multiple times within a    single document should be counted once only (value 'True') or multiple times    (value 'False')match_all_words -- whether all single words should be taken into account          (value 'True') or only single words with noun tags (value 'False')          verbose -- if 'True', information about training progress is outputted to the console.
Manager.deserialize_supervised_topic_classifier(self,  serialized_model:bytes, verbose:bool=False) -> SupervisedTopicClassifier:Returns a classifier for the supervised document classification use casethat will use a supplied pre-trained model.Parameters:serialized_model -- the pre-trained model as returned from `SupervisedTopicClassifier.serialize_model()`.verbose -- if 'True', information about matching is outputted to the console.
Manager.start_chatbot_mode_console(self)Starts a chatbot mode console enabling the matching of pre-registered  search phrases to documents (chatbot entries) entered ad-hoc by the  user.
Manager.start_structural_search_mode_console(self)Starts a structural extraction mode console enabling the matching of pre-registered  documents to search phrases entered ad-hoc by the user.
Manager.start_topic_matching_search_mode_console(self,      only_one_result_per_document:bool=False, word_embedding_match_threshold:float=0.8,  initial_question_word_embedding_match_threshold:float=0.7):Starts a topic matching search mode console enabling the matching of pre-registered  documents to query phrases entered ad-hoc by the user.Parameters:only_one_result_per_document -- if 'True', prevents multiple topic match  results from being returned for the same document.word_embedding_match_threshold -- the cosine similarity above which two words match where the    search phrase word does not govern an interrogative pronoun.initial_question_word_embedding_match_threshold -- the cosine similarity above which two  words match where the search phrase word governs an interrogative pronoun.
Manager.close(self) -> NoneTerminates the worker processes.

6.2manager.nlp

manager.nlp is the underlying spaCyLanguage object on which both Coreferee and Holmes have been registered as custom pipeline components. The most efficient way of parsing documents for use with Holmes is to callmanager.nlp.pipe(). This yields an iterable of documents that can then be loaded into Holmes viamanager.register_serialized_documents().

Thepipe() method has an argumentn_process that specifies the number of processors to use. With_lg,_md and_sm spaCy models, there aresome situations where it can make sense to specify a value other than 1 (the default). Note however that with transformer spaCy models (_trf) values other than 1 are not supported.

6.3Ontology

holmes_extractor.Ontology(self, ontology_path,  owl_class_type='http://www.w3.org/2002/07/owl#Class',  owl_individual_type='http://www.w3.org/2002/07/owl#NamedIndividual',  owl_type_link='http://www.w3.org/1999/02/22-rdf-syntax-ns#type',  owl_synonym_type='http://www.w3.org/2002/07/owl#equivalentClass',  owl_hyponym_type='http://www.w3.org/2000/01/rdf-schema#subClassOf',  symmetric_matching=False)Loads information from an existing ontology and manages ontologymatching.The ontology must follow the W3C OWL 2 standard. Search phrase words arematched to hyponyms, synonyms and instances from within documents beingsearched.This class is designed for small ontologies that have been constructedby hand for specific use cases. Where the aim is to model a large numberof semantic relationships, word embeddings are likely to offerbetter results.Holmes is not designed to support changes to a loaded ontology via directcalls to the methods of this class. It is also not permitted to share a single instanceof this class between multiple Manager instances: instead, a separate Ontology instancepointing to the same path should be created for each Manager.Matching is case-insensitive.Parameters:ontology_path -- the path from where the ontology is to be loaded,or a list of several such paths. See https://github.com/RDFLib/rdflib/.  owl_class_type -- optionally overrides the OWL 2 URL for types.  owl_individual_type -- optionally overrides the OWL 2 URL for individuals.  owl_type_link -- optionally overrides the RDF URL for types.  owl_synonym_type -- optionally overrides the OWL 2 URL for synonyms.  owl_hyponym_type -- optionally overrides the RDF URL for hyponyms.symmetric_matching -- if 'True', means hypernym relationships are also taken into account.

6.4SupervisedTopicTrainingBasis (returned fromManager.get_supervised_topic_training_basis())

Holder object for training documents and their classifications from which one or moreSupervisedTopicModelTrainer objects can be derived. This class is NOT threadsafe.

SupervisedTopicTrainingBasis.parse_and_register_training_document(self, text:str, classification:str,  label:Optional[str]=None) -> NoneParses and registers a document to use for training.Parameters:text -- the document textclassification -- the classification labellabel -- a label with which to identify the document in verbose training output,  or 'None' if a random label should be assigned.
SupervisedTopicTrainingBasis.register_training_document(self, doc:Doc, classification:str,   label:Optional[str]=None) -> NoneRegisters a pre-parsed document to use for training.Parameters:doc -- the documentclassification -- the classification labellabel -- a label with which to identify the document in verbose training output,  or 'None' if a random label should be assigned.
SupervisedTopicTrainingBasis.register_additional_classification_label(self, label:str) -> NoneRegister an additional classification label which no training document possesses explicitly  but that should be assigned to documents whose explicit labels are related to the  additional classification label via the classification ontology.
SupervisedTopicTrainingBasis.prepare(self) -> NoneMatches the phraselets derived from the training documents against the training  documents to generate frequencies that also include combined labels, and examines the  explicit classification labels, the additional classification labels and the  classification ontology to derive classification implications.  Once this method has been called, the instance no longer accepts new training documents  or additional classification labels.

SupervisedTopicTrainingBasis.train(        self,        *,        minimum_occurrences: int = 4,        cv_threshold: float = 1.0,        learning_rate: float = 0.001,        batch_size: int = 5,        max_epochs: int = 200,        convergence_threshold: float = 0.0001,        hidden_layer_sizes: Optional[List[int]] = None,        shuffle: bool = True,        normalize: bool = True    ) -> SupervisedTopicModelTrainer:Trains a model based on the prepared state.Parameters:minimum_occurrences -- the minimum number of times a word or relationship has to  occur in the context of the same classification for the phraselet  to be accepted into the final model.cv_threshold -- the minimum coefficient of variation with which a word or relationship has  to occur across the explicit classification labels for the phraselet to be  accepted into the final model.learning_rate -- the learning rate for the Adam optimizer.batch_size -- the number of documents in each training batch.max_epochs -- the maximum number of training epochs.convergence_threshold -- the threshold below which loss measurements after consecutive  epochs are regarded as equivalent. Training stops before 'max_epochs' is reached  if equivalent results are achieved after four consecutive epochs.hidden_layer_sizes -- a list containing the number of neurons in each hidden layer, or  'None' if the topology should be determined automatically.shuffle -- 'True' if documents should be shuffled during batching.normalize -- 'True' if normalization should be applied to the loss function.

6.5SupervisedTopicModelTrainer (returned fromSupervisedTopicTrainingBasis.train())

Worker object used to train and generate models. This object could be removed from the public interface(SupervisedTopicTrainingBasis.train() could return aSupervisedTopicClassifier directly) but hasbeen retained to facilitate testability.

This class is NOT threadsafe.

SupervisedTopicModelTrainer.classifier(self)Returns a supervised topic classifier which contains no explicit references to the training data and thatcan be serialized.

6.6SupervisedTopicClassifier (returned from

SupervisedTopicModelTrainer.classifier() andManager.deserialize_supervised_topic_classifier()))

SupervisedTopicClassifier.def parse_and_classify(self, text: str) -> Optional[OrderedDict]:Returns a dictionary from classification labels to probabilities  ordered starting with the most probable, or *None* if the text did  not contain any words recognised by the model.Parameters:text -- the text to parse and classify.
SupervisedTopicClassifier.classify(self, doc: Doc) -> Optional[OrderedDict]:Returns a dictionary from classification labels to probabilities  ordered starting with the most probable, or *None* if the text did  not contain any words recognised by the model.Parameters:doc -- the pre-parsed document to classify.
SupervisedTopicClassifier.serialize_model(self) -> strReturns a serialized model that can be reloaded using  *Manager.deserialize_supervised_topic_classifier()*

6.7 Dictionary returned fromManager.match()

A text-only representation of a match between a search phrase and adocument. The indexes refer to tokens.Properties:search_phrase_label -- the label of the search phrase.search_phrase_text -- the text of the search phrase.document -- the label of the document.index_within_document -- the index of the match within the document.sentences_within_document -- the raw text of the sentences within the document that matched.negated -- 'True' if this match is negated.uncertain -- 'True' if this match is uncertain.involves_coreference -- 'True' if this match was found using coreference resolution.overall_similarity_measure -- the overall similarity of the match, or  '1.0' if embedding-based matching was not involved in the match.  word_matches -- an array of dictionaries with the properties:  search_phrase_token_index -- the index of the token that matched from the search phrase.  search_phrase_word -- the string that matched from the search phrase.  document_token_index -- the index of the token that matched within the document.  first_document_token_index -- the index of the first token that matched within the document.    Identical to 'document_token_index' except where the match involves a multiword phrase.  last_document_token_index -- the index of the last token that matched within the document    (NOT one more than that index). Identical to 'document_token_index' except where the match    involves a multiword phrase.  structurally_matched_document_token_index -- the index of the token within the document that    structurally matched the search phrase token. Is either the same as 'document_token_index' or    is linked to 'document_token_index' within a coreference chain.  document_subword_index -- the index of the token subword that matched within the document, or    'None' if matching was not with a subword but with an entire token.  document_subword_containing_token_index -- the index of the document token that contained the    subword that matched, which may be different from 'document_token_index' in situations where a    word containing multiple subwords is split by hyphenation and a subword whose sense    contributes to a word is not overtly realised within that word.  document_word -- the string that matched from the document.  document_phrase -- the phrase headed by the word that matched from the document.  match_type -- 'direct', 'derivation', 'entity', 'embedding', 'ontology', 'entity_embedding'    or 'question'.  negated -- 'True' if this word match is negated.  uncertain -- 'True' if this word match is uncertain.  similarity_measure -- for types 'embedding' and 'entity_embedding', the similarity between the    two tokens, otherwise '1.0'.  involves_coreference -- 'True' if the word was matched using coreference resolution.  extracted_word -- within the coreference chain, the most specific term that corresponded to    the document_word.  depth -- the number of hyponym relationships linking 'search_phrase_word' and    'extracted_word', or '0' if ontology-based matching is not active. Can be negative    if symmetric matching is active.  explanation -- creates a human-readable explanation of the word match from the perspective of the    document word (e.g. to be used as a tooltip over it).

6.8 Dictionary returned fromManager.topic_match_documents_against()

A text-only representation of a topic match between a search text and a document.Properties:document_label -- the label of the document.text -- the document text that was matched.text_to_match -- the search text.rank -- a string representation of the scoring rank which can have the form e.g. '2=' in case of a tie.index_within_document -- the index of the document token where the activation peaked.subword_index -- the index of the subword within the document token where the activation peaked, or  'None' if the activation did not peak at a specific subword.start_index -- the index of the first document token in the topic match.end_index -- the index of the last document token in the topic match (NOT one more than that index).sentences_start_index -- the token start index within the document of the sentence that contains  'start_index'sentences_end_index -- the token end index within the document of the sentence that contains  'end_index' (NOT one more than that index).sentences_character_start_index_in_document -- the character index of the first character of 'text'  within the document.sentences_character_end_index_in_document -- one more than the character index of the last  character of 'text' within the document.score -- the scoreword_infos -- an array of arrays with the semantics:  [0] -- 'relative_start_index' -- the index of the first character in the word relative to    'sentences_character_start_index_in_document'.  [1] -- 'relative_end_index' -- one more than the index of the last character in the word    relative to 'sentences_character_start_index_in_document'.    [2] -- 'type' -- 'single' for a single-word match, 'relation' if within a relation match    involving two words, 'overlapping_relation' if within a relation match involving three    or more words.  [3] -- 'is_highest_activation' -- 'True' if this was the word at which the highest activation    score reported in 'score' was achieved, otherwise 'False'.  [4] -- 'explanation' -- a human-readable explanation of the word match from the perspective of    the document word (e.g. to be used as a tooltip over it).answers -- an array of arrays with the semantics:  [0] -- the index of the first character of a potential answer to an initial question word.  [1] -- one more than the index of the last character of a potential answer to an initial question    word.

7 A note on the license

Earlier versions of Holmes could only be published under a restrictive license because of patent issues. As explained in theintroduction, this is no longer the case thanks to the generosity ofAstraZeneca:versions from 4.0.0 onwards are licensed under the MIT license.

8 Information for developers

8.1 How it works

8.1.1 Structural matching (chatbot and structural extraction)

The word-level matching and the high-level operation of structuralmatching between search-phrase and document subgraphs both work more orless as one would expect. What is perhaps more in need of furthercomment is the semantic analysis code subsumed in theparsing.pyscript as well as in thelanguage_specific_rules.py script for eachlanguage.

SemanticAnalyzer is an abstract class that is subclassed for eachlanguage: at present byEnglishSemanticAnalyzer andGermanSemanticAnalyzer. These classes contain most of the semantic analysis code.SemanticMatchingHelper is a second abstract class, again with an concreteimplementation for each language, that contains semantic analysis codethat is required at matching time. Moving this out to a separate class familywas necessary because, on operating systems that spawn processes ratherthan forking processes (e.g. Windows),SemanticMatchingHelper instanceshave to be serialized when the worker processes are created: this wouldnot be possible forSemanticAnalyzer instances because not allspaCy models are serializable, and would also unnecessarily consumelarge amounts of memory.

At present, all functionality that is commonto the two languages is realised in the two abstract parent classes.Especially because English and German are closely related languages, itis probable that functionality will need to be moved from the abstractparent classes to specific implementing children classes if and when newsemantic analyzers are added for new languages.

TheHolmesDictionary class is defined as aspaCy extensionattributethat is accessed using the syntaxtoken._.holmes. The most importantinformation in the dictionary is a list ofSemanticDependency objects.These are derived from the dependency relationships in the spaCy output(token.dep_) but go through a considerable amount of processing tomake them 'less syntactic' and 'more semantic'. To give but a fewexamples:

  • Where coordination occurs, dependencies are added to and from allsiblings.
  • In passive structures, the dependencies are swapped around to capturethe fact that the syntactic subject is the semantic object andvice versa.
  • Relationships are added spanning main and subordinate clauses tocapture the fact that the syntactic subject of a main clause alsoplays a semantic role in the subordinate clause.

Some new semantic dependency labels that do not occur in spaCy outputsas values oftoken.dep_ are added for Holmes semantic dependencies.It is important to understand that Holmes semantic dependencies are usedexclusively for matching and are therefore neither intended nor requiredto form a coherent set of linguistic theoretical entities or relationships;whatever works best for matching is assigned on an ad-hoc basis.

For each language, thematch_implication_dict dictionary maps search-phrase semantic dependenciesto matching document semantic dependencies and is responsible for theasymmetry of matchingbetween search phrases and documents.

8.1.2 Topic matching

Topic matching involves the following steps:

  1. The query document or query phrase is parsed and a number ofphraseletsare derived from it. Single-word phraselets are extracted for every word (or subword in German) with its own meaning within the query phrase apart from a handful of stop words defined within the semantic matching helper (SemanticMatchingHelper.topic_matching_phraselet_stop_lemmas), which areconsistently ignored throughout the whole process.
  2. Two-word orrelation phraselets are extracted from the query document or query phrase wherever certain grammatical structuresare found. The structures that trigger two-word phraselets differ from language to languagebut typically include verb-subject, verb-object and noun-adjective pairs as well as verb-noun and noun-noun relations spanning prepositions. Each relation phraselethas a parent (governor) word or subword and a child (governed) word or subword. The relevantphraselet structures for a given language are defined inSemanticMatchingHelper.phraselet_templates.
  3. Both types of phraselet are assigned afrequency factor expressing how common or rare its word or words are in the corpus. Frequency factors are determined using a logarithmic calculation and range from 0.0 (very common) to 1.0 (very rare). Each word within a relation phraselet is also assigned its own frequency factor.
  4. Phraselet templates where the parent word belongs to a closed word class, e.g. prepositions, can be defined as 'reverse_only'. This signals that matching with derived phraselets should only be attempted starting from the child word rather than from the parent word as normal. Phraselets are also defined as reverse-only when the parent word is one of a handful of words defined within the semantic matching helper (SemanticMatchingHelper.topic_matching_reverse_only_parent_lemmas) or when the frequency factor for the parent word is below the threshold for relation matching (relation_matching_frequency_threshold, default: 0.25). These measures are necessary because matching on e.g. a parent preposition would lead to a large number ofpotential matches that would take a lot of resources to investigate: it is better to startinvestigation from the less frequent word within a given relation.
  5. All single-word phraselets are matched against the document corpus.
  6. Normalstructural matching is used to match against the document corpus all relation phraseletsthat are not set to reverse-matching.
  7. Reverse matching starts at all words in the corpus that match a relation phraselet child word. Every word governing one of these words is a potential match for the corresponding relation phraselet parent word, so structural matching is attempted starting at all these parent words. Reverse matching is only attempted for relation phraselets where the child word's frequency factor is above the threshold for relation matching (relation_matching_frequency_threshold, default: 0.25).
  8. If either the parent or the child word of a relation template has a frequency factor above a configurable threshold (embedding_matching_frequency_threshold, default: 0.5), matching at all of those words where the relation template has not already beenmatched is retried using embeddings at the other word within the relation. A pair of words is then regarded as matching when their mutual cosine similarity is aboveinitial_question_word_embedding_match_threshold (default: 0.7) in situations where the document word has an initial question word in its phrase orword_embedding_match_threshold (default: 0.8) in all other situations.
  9. The set of structural matches collected up to this point is filtered to cover cases where the samedocument words were matched by multiple phraselets, where multiple sibling words have been matched by the samephraselet where one sibling has a higherembedding-based similarity than theother, and where a phraselet has matched multiple words thatcorefer with one another.
  10. Each document is scanned from beginning to end and a psychologically inspiredactivation scoreis determined for each word in each document.
  • Activation is tracked separately for each phraselet. Each timea match for a phraselet is encountered, the activation for that phraselet is set to the score returned bythe match, unless the existing activation is already greater than that score. If the parameteruse_frequency_factor is set toTrue (the default), each score is scaled by the frequency factor of its phraselet, meaning that words that occur less frequently in the corpus give rise to higher scores.
  • For as long as the activation score for a phraselet has a value above zero, 1 divided by aconfigurable number (maximum_activation_distance; default: 75) of its value is subtracted from it as each new word is read.
  • The score returned by a match depends on whether the match was produced by a single-word noun phraselet that matched an entire word (single_word_score; default: 50), a non-noun single-word phraselet or a noun phraselet that matched a subword (single_word_any_tag_score; default: 20),a relation phraselet produced by a reverse-only template (reverse_only_relation_score; default: 200),any other (normally matched) relation phraselet (relation_score; default: 300), or a relationphraselet involving an initial question word (initial_question_word_answer_score; default: 600).
  • Where a match involves embedding-based matching, the resulting inexactitude iscaptured by multiplying the potential new activation score with the value of thesimilarity measure that was returned for the match multiplied by a penalty value (embedding_penalty; default: 0.6).
  • Where a match involves ontology-based matching, the resulting inexactitude is capturedby multiplying the potential new activation score by a penalty value (ontology_penalty;default: 0.9) once more often than the difference in depth between the two ontology entries,i.e. once for a synonym, twice for a child, three times for a grandchild and so on.
  • When the same word was involved in matches against more than one two-word phraselets, thisimplies that a structure involving three or more words has been matched. The activation score returned byeach match within such a structure is multiplied by a configurable factor(overlapping_relation_multiplier; default: 1.5).
  1. The most relevant passages are then determined by the highest activation score peaks within the documents. Areas to either side of each peak up to a certain distance(sideways_match_extent; default: 100 words) within which the activation score is higher than thedifferent_match_cutoff_score (default: 15) are regarded as belonging to a contiguous passage around the peak that is then returned as aTopicMatch object. (Note that this default will almost certainly turn out to be too low ifuse_frequency_factoris set toFalse.) A word whose activation equals the threshold exactly is included at the beginning of the area as long as the next word whereactivation increases has a score above the threshold. If the topic match peak is below thethreshold, the topic match will only consist of the peak word.
  2. Ifinitial_question_word_behaviour is set toprocess (the default) or toexclusive, where a document word hasmatched an initial question word from the query phrase, the subtree of the matched document word is identified as a potential answer to the question and added to the dictionary to be returned. Ifinitial_question_word_behaviour is set toexclusive, any topic matches that do not contain answers to initial question words are discarded.
  3. Settingonly_one_result_per_document = True prevents more than one result from being returned from the samedocument; only the result from each document with the highest score will then be returned.
  4. Adjacent topic matches whose scores differ by less thantied_result_quotient (default: 0.9) are labelled as tied.

8.1.3 Supervised document classification

The supervised document classification use case relies on the same phraselets as thetopic matching use case, although reverse-only templates are ignored anda different set of stop words is used (SemanticMatchingHelper.supervised_document_classification_phraselet_stop_lemmas).Classifiers are built and trained as follows:

  1. All phraselets are extracted from all training documents and registered with a structural matcher.
  2. Each training document is then matched against the totality of extracted phraselets and the number of timeseach phraselet is matched within training documents with each classification label is recorded. Whether multipleoccurrences within a single document are taken into account depends on the value ofoneshot; whethersingle-word phraselets are generated for all words with their own meaning or only for those such words whosepart-of-speech tags match the single-word phraselet template specification (essentially: noun phraselets) depends on the valueofmatch_all_words. Wherever two phraselet matches overlap, a combined match is recorded. Combined matches aretreated in the same way as other phraselet matches in further processing. This means that effectively thealgorithm picks up one-word, two-word and three-word semantic combinations.Seehere for a discussion of theperformance of this step.
  3. The results for each phraselet are examined and phraselets are removed from the model that do not play astatistically significant role in predicting classifications. Phraselets are removed that did not match withinthe documents of any classification a minimum number of times (minimum_occurrences; default: 4) or where thecoefficient of variation (the standard deviation divided by the arithmetic mean) of the occurrences across thecategories is below a threshold (cv_threshold; default: 1.0).
  4. The phraselets that made it into the model are once again matched against each document. Matches against eachphraselet are used to determine the input values to a multilayer perceptron: the input nodes can either recordoccurrence (binary) or match frequency (scalar) (oneshot==True vs.oneshot==False respectively). The outputs are thecategory labels, including any additional labels determined via a classification ontology. By default, the multilayerperceptron has three hidden layers where the first hidden layer has the same number of neurons as the input layer andthe second and third layers have sizes in between the input and the output layer with an equally sized step betweeneach size; the user is howeverfree to specify any other topology.
  5. The resulting model is serializable, i.e. can be saved and reloaded.
  6. When a new document is classified, the outputis zero, one or many suggested classifications; when more than one classification is suggested, the classificationsare ordered by decreasing probabilility.

8.2 Development and testing guidelines

Holmes code is formatted withblack.

The complexity of what Holmes does makes development impossible withouta robust set of over 1400 regression tests. These can be executed individuallywithunittest or all at once by running thepytest utility from the Holmessource code root directory. (Note that the Python 3 command on Linuxispytest-3.)

Thepytest variant will only work on machines with sufficient memory resources. Toreduce this problem, the tests are distributed across three subdirectories, so thatpytest can be run three times, once from each subdirectory:

  • en: tests relating to English
  • de: tests relating to German
  • common: language-independent tests

8.3 Areas for further development

8.3.1 Additional languages

New languages can be added to Holmes by subclassing theSemanticAnalyzer andSemanticMatchingHelper classes as explainedhere.

8.3.2 Use of machine learning to improve matching

The sets of matching semantic dependencies captured in the_matching_dep_dict dictionary for each language have been obtained onthe basis of a mixture of linguistic-theoretical expectations and trialand error. The results would probably be improved if the_matching_dep_dict dictionariescould be derived using machine learning instead; as yet this has not beenattempted because of the lack of appropriate training data.

8.3.3 Remove names from supervised document classification models

An attempt should be made to remove personal data from supervised document classification models tomake them more compliant with data protection laws.

8.3.4 Improve the performance of supervised document classification training

In cases whereembedding-based matching is not active, the second step of thesupervised document classification procedure repeatsa considerable amount of processing from the first step. Retaining the relevant information from the firststep of the procedure would greatly improve training performance. This has not been attempted up to nowbecause a large number of tests would be required to prove that such performance improvements did nothave any inadvertent impacts on functionality.

8.3.5 Explore the optimal hyperparameters for topic matching and supervised document classification

Thetopic matching andsupervised document classificationuse cases are both configured with a number of hyperparameters that are presently set to best-guess valuesderived on a purely theoretical basis. Results could be further improved by testing the use cases with a varietyof hyperparameters to learn the optimal values.

8.4 Version history

8.4.1 Version 2.0.x

The initial open-source version.

8.4.2 Version 2.1.0
  • Upgrade to spaCy 2.1.0 and neuralcoref 4.0.0.
  • Addition of new dependencypobjp linking parents of prepositions directly with their children.
  • Development of the multiprocessing architecture, which has theMultiprocessingManager objectas its facade.
  • Complete overhaul oftopic matching.
  • Incorporation of coreference information into Holmes document structures so it no longer needs to be calculated on the fly.
  • New literature examples for both languages and the facility to serve them over RESTful HTTP.
  • Numerous minor improvements and bugfixes.

8.4.3 Version 2.2.0
  • Addition of derivational morphology analysis allowing the matching of related words with thesame stem.
  • Addition of new dependency types and dependency matching rules to make full use of the new derivational morphology information.
  • For German, analysis of and matching with subwords (constituent parts of compound words), e.g.Information andExtraktion are the subwords withinInformationsextraktion.
  • It is now possible to supply multiple ontology files to theOntology constructor.
  • Ontology implication rules are now calculated eagerly to improve runtime performance.
  • Ontology-based matching now includes special, language-specific rules to handle hyphens within ontology entries.
  • Word-match information is now included in all matches including single-word matches.
  • Word matches and dictionaries derived from them now include human-readable explanations designed to be used as tooltips.
  • Intopic matching, a penalty is now applied to ontology-based matches as well as to embedding-based matches.
  • Topic matching now includes a filter facility to specifythat only documents whose labels begin with a certain string should be searched.
  • Error handling and reporting have been improved for the MultiprocessingManager.
  • Numerous minor improvements and bugfixes.
  • Thedemo website has been updated to reflect the changes.

8.4.4 Version 2.2.1
  • Fixed bug with reverse derived lemmas and subwords (only affects German).
  • Removed dead code.

8.4.5 Version 3.0.0
  • Moved tocoreferee as the source of coreference information, meaning that coreference resolution is now active for German as well as English; all documents can be serialized; and the latest spaCy version can be supported.
  • The corpus frequencies of words are now taken into account when scoring topic matches.
  • Reverse dependencies are now taken into account, so that e.g.a man dies can matchthe dead man although the dependencies in the two phrases point in opposite directions.
  • Merged the pre-existingManager andMultiprocessingManager classes into a singleManager class, with a redesigned public interface, that uses worker threads for everything except supervised document classification.
  • Added support forinitial question words.
  • Thedemo website has been updated to reflect the changes.

8.4.6 Version 4.0.0
  • The license has been changed from GPL3 to MIT.
  • The word matching code has been refactored and now uses the Strategy pattern, making it easy to add additional word-matching strategies.
  • With the exception ofrdflib, all direct dependencies are now from within the Explosion stack, makinginstallation much faster and more trouble-free.
  • Holmes now supports a wide range of Python (3.6—3.10) and spaCy (3.1—3.3) versions.
  • A newdemo website has been developed byEdward Schmuhl based on Streamlit.

About

Information extraction from English and German texts based on predicate logic

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors2

  •  
  •  

Languages


[8]ページ先頭

©2009-2025 Movatter.jp