This is a living document, describing theconceptual data model used byWikibaseLexeme.It is not aspecification of any concretebinding,implementation,mapping, orserialization.
Thedata model ofWikibaseLexeme describes the structure of the data that is handled as "Lexemes" in Wikibase, such as words and phrases.While it would be theoretically possible to model these things usingItems, a more expressive specialized model helps to reduce complexity, and improve re-use and mappings to other vocabularies.This data model is conceptual ("Which information do we have to support?") and does not specify how this data should be represented technically ("Which data structures should the software use?") or syntactically ("How should the data be expressed in a file?").Separate documents describe the serialization of the Wikibasedata model inJSON (JavaScript Object Notation) and inRDF (Resource Description Framework).The Lexeme data model defines basic concepts and relationships needed to describelexemes, which act as a fixedontology.This ontology provides a minimal scaffolding that allowsItems andStatements to be used for detailedmodeling of a lexeme.The specification of the Lexeme data model is based on theWikibase data model, so theWikidata glossary and theWikibase data model primer may be helpful in understanding this document.The Lexeme data model aims to align with theLEMON model by theOntolex W3C community group, where useful and practical.However, in the spirit of Wikibase, the Lexeme model is designed to be simple and flexible enough for casual collaborative editing, as opposed to the more formalized approach taken by LEMON.
A Lexeme is a lexical element of a language, such as a word, a phrase, or a prefix (seeLexeme on Wikipedia). Lexemes areEntities in the sense of the Wikibase data model.A Lexeme is described using the following information:
L3746552. These IDs are unique within the repository that manages the Lexeme. The ID can be combined with a repository's concept base URI to form a unique URI for the Lexeme.Editorial Note:We should provide some hint regarding how grammatical gender can be modeled using Statements.
In Wikidata they generally use the most general lexical category possible, e.g.affix and then instead describe which type of affix it is using aninstance of-statement.
In Wikidata the community decided to have usage examples in one place on the lexeme because then they know where to look for them.They have to demonstrate two properties: form (d:Property:P5830) and sense (d:Property:P6072).They can have multiple examples from different time periods e.g. different centuries and for formality/informality and written/spoken.
The lemma is a human readable representation of the lexeme (seeLemma on Wikipedia). Typically, the canonical form of the lexeme (e.g. the infinitive form of verbs) will be used as the lemma (see alsolemon:canonicalForm).Lemmas are not simple strings, butMultilingualTextValues, since the same lemma may have multiple spellings. This is specially important for languages that use multiple scripts such as Serbian and Japanese.
Example: {{{1}}}
A Lemma cannot be entirely empty, at least one variant has to be provided.
Note: Lemmas are not unique, nor is the combination of Lemma, Language, and Lexical category. Two distinct lexemes with the same lexical category can exist in the same language if they have different data, it may be gender, etymology, morphology (different forms), and so on.
Example: {{{1}}}
Themorphology of the lexeme is understood as a set of Forms. Each form defines how a lexeme changes based on a specificsyntactic role ormode it may take in a sentence (see alsolemon:Form).
Example: {{{1}}}
A Form is described using the following information:
L3746552-F7. These IDs are unique within the repository that manages the Lexeme. The ID can be combined with a repository's concept base URI to form a unique URI for the Form.Planned Feature:
Lorem Ipsum
A form's Representation is its written form, as used in a text (comparelemon:writtenRep). Just likeLemmas, Representations are not simple strings, butMultilingualTextValues, since the same form may have multiple spellings, possibly in multiple scripts.
A Representation cannot be entirely empty, at least one variant has to be provided.
Multiple forms with the same representation are allowed to enable adding usage examples demonstrating each of them.Example in Wikidata
A form's grammatical features specify under which conditions or in which syntactic role that form is used (seelexinfo:morphosyntacticProperty andgrammatical category on Wikipedia).Multiple grammatical features can be combined to express under which conditions the language's grammar requires a given form to be used. Grammatical features are represented as references toItems.
Example: {{{1}}}
Editorial Note:How do we model "a" vs "an"? What item would we use as a feature to describe this? Do we need free text usage notes after all?
Editorial Note:We should note that gender-specific forms like "baroness" can be treated as Forms, or as separate Lexemes, as need be.
The senses of a lexeme are different meanings which it may represent in a text. The senses are given as natural language definitions orglosses (compareintensional definitions on Wikipedia).
A sense is described using the following information:
L3746552-S4. These IDs are unique within the repository that manages the Lexeme. The ID can be combined with a repository's concept base URI to form a unique URI for the Sense.In Wikidataimage is also added to provide a culturally adapted image of the sense, e.g. of a letterbox or color that can vary greatly between cultures.
Editorial Note:We should find a good place to address a common source of misunderstandings: Senses can be connected to Wikidata Items via an appropriate Statement they evoke or denote (comparelemon:denotes andlemon:evokes). However, such a connection should not be interpreted as the lexeme actually representing the concept defined by the item (comparelemon:LexicalSense andlemon:LexicalConcept).In particular, if two lexemes have senses that refer to the same concept in this way, this does not imply that the two lexemes are synonyms.
Example: The lexemes for the English adjectives "hot" and "cold" could both have a sense that refers toQ11466 (temperature), even though they are antonyms.
Editorial Note:We should describe how wordfunction can be described for things like "to" or "a", using Statements on the Lexeme. We should also explain that function words should not have senses. Do we need free text usage notes?
Planned Feature:
Lorem Ipsum
A sense's gloss gives a natural definition of the sense (seeGloss on Wikipedia andskos:definition). Glosses cannot be referenced.
Similar toLemmas, Glosses are not simple strings, butMultilingualTextValues.However, the reason is not providing support for variants, but to allow the gloss to be given in entirely different languages.E.g. it would be quite useful for a German learning French to have a German gloss for a French sense.
A Gloss cannot be entirely empty, at least one language has to be provided. A good gloss provides little or no space for ambiguity about the meaning. Lexemes with multiple senses should have glosses that are easily distinguishable from each other.
Short glosses of only a single or a few words should be avoided as it leaves too much space for interpretation of the meaning.
In Wikidata Glosses are often very similar to carefully crafted descriptions on Q-items.E.g. for apple the Q-items English descriptionfruit of the apple tree is copied as gloss when using tools likeMachtSinn to match lexemes and Q-items together and create missing senses.