Incomputer science, in the area offormal language theory, frequent use is made of a variety ofstring functions; however, the notation used is different from that used forcomputer programming, and some commonly used functions in the theoretical realm are rarely used when programming. This article defines some of these basic terms.
A string is a finite sequence of characters.Theempty string is denoted by.The concatenation of two string and is denoted by, or shorter by.Concatenating with the empty string makes no difference:.Concatenation of strings isassociative:.
For example,.
Alanguage is a finite or infinite set of strings.Besides the usual set operations like union, intersection etc., concatenation can be applied to languages:if both and are languages, their concatenation is defined as the set of concatenations of any string from and any string from, formally.Again, the concatenation dot is often omitted for brevity.
The language consisting of just the empty string is to be distinguished from the empty language.Concatenating any language with the former doesn't make any change:,while concatenating with the latter always yields the empty language:.Concatenation of languages is associative:.
For example, abbreviating, the set of all three-digit decimal numbers is obtained as. The set of all decimal numbers of arbitrary length is an example for an infinite language.
Thealphabet of a string is the set of all of the characters that occur in a particular string. Ifs is a string, itsalphabet is denoted by
Thealphabet of a language is the set of all characters that occur in any string of, formally:.
For example, the set is the alphabet of the string,and theabove is the alphabet of theabove language as well as of the language of all decimal numbers.
LetL be alanguage, and let Σ be its alphabet. Astring substitution or simply asubstitution is a mappingf that maps characters in Σ to languages (possibly in a different alphabet). Thus, for example, given a charactera ∈ Σ, one hasf(a)=La whereLa ⊆ Δ* is some language whose alphabet is Δ. This mapping may be extended to strings as
for theempty string ε, and
for strings ∈L and charactera ∈ Σ. String substitutions may be extended to entire languages as[1]
Regular languages are closed under string substitution. That is, if each character in the alphabet of a regular language is substituted by another regular language, the result is still a regular language.[2]Similarly,context-free languages are closed under string substitution.[3][note 1]
A simple example is the conversionfuc(.) to uppercase, which may be defined e.g. as follows:
character | mapped to language | remark |
---|---|---|
x | fuc(x) | |
‹a› | { ‹A› } | map lowercase char to corresponding uppercase char |
‹A› | { ‹A› } | map uppercase char to itself |
‹ß› | { ‹SS› } | no uppercase char available, map to two-char string |
‹0› | { ε } | map digit to empty string |
‹!› | { } | forbid punctuation, map to empty language |
... | similar for other chars |
For the extension offuc to strings, we have e.g.
For the extension offuc to languages, we have e.g.
Astring homomorphism (often referred to simply as ahomomorphism informal language theory) is a string substitution such that each character is replaced by a single string. That is,, where is a string, for each character.[note 2][4]
String homomorphisms aremonoid morphisms on thefree monoid, preserving the empty string and thebinary operation ofstring concatenation. Given a language, the set is called thehomomorphic image of. Theinverse homomorphic image of a string is defined as
while the inverse homomorphic image of a language is defined as
In general,, while one does have
and
for any language.
The class of regular languages is closed under homomorphisms and inverse homomorphisms.[5] Similarly, the context-free languages are closed under homomorphisms[note 3] and inverse homomorphisms.[6]
A string homomorphism is said to be ε-free (or e-free) if for alla in the alphabet. Simple single-lettersubstitution ciphers are examples of (ε-free) string homomorphisms.
An example string homomorphismguc can also be obtained by defining similar to theabove substitution:guc(‹a›) = ‹A›, ...,guc(‹0›) = ε, but lettingguc be undefined on punctuation chars. Examples for inverse homomorphic images are
For the latter language,guc(guc−1({ ‹A›, ‹bb› })) =guc({ ‹a› }) = { ‹A› } ≠ { ‹A›, ‹bb› }.The homomorphismguc is not ε-free, since it maps e.g. ‹0› to ε.
A very simple string homomorphism example that maps each character to just a character is the conversion of anEBCDIC-encoded string toASCII.
Ifs is a string, and is an alphabet, thestring projection ofs is the string that results by removing all characters that are not in. It is written as. It is formally defined by removal of characters from the right hand side:
Here denotes theempty string. The projection of a string is essentially the same as aprojection in relational algebra.
String projection may be promoted to theprojection of a language. Given aformal languageL, its projection is given by
Theright quotient of a charactera from a strings is the truncation of the charactera in the strings, from the right hand side. It is denoted as. If the string does not havea on the right hand side, the result is the empty string. Thus:
The quotient of the empty string may be taken:
Similarly, given a subset of a monoid, one may define the quotient subset as
Left quotients may be defined similarly, with operations taking place on the left of a string.[citation needed]
Hopcroft and Ullman (1979) define the quotientL1/L2 of the languagesL1 andL2 over the same alphabet asL1/L2 = {s | ∃t∈L2.st∈L1 }.[7]This is not a generalization of the above definition, since, for a strings and distinct charactersa,b, Hopcroft's and Ullman's definition implies yielding {}, rather than { ε }.
The left quotient (when defined similar to Hopcroft and Ullman 1979) of a singleton languageL1 and an arbitrary languageL2 is known asBrzozowski derivative; ifL2 is represented by aregular expression, so can be the left quotient.[8]
The right quotient of a subset of a monoid defines anequivalence relation, called therightsyntactic relation ofS. It is given by
The relation is clearly of finite index (has a finite number of equivalence classes) if and only if the family right quotients is finite; that is, if
is finite. In the case thatM is the monoid of words over some alphabet,S is then aregular language, that is, a language that can be recognized by afinite-state automaton. This is discussed in greater detail in the article onsyntactic monoids.[citation needed]
Theright cancellation of a charactera from a strings is the removal of the first occurrence of the charactera in the strings, starting from the right hand side. It is denoted as and is recursively defined as
The empty string is always cancellable:
Clearly, right cancellation and projectioncommute:
Theprefixes of a string is the set of allprefixes to a string, with respect to a given language:
where.
Theprefix closure of a language is
Example:
A language is calledprefix closed if.
The prefix closure operator isidempotent:
Theprefix relation is abinary relation such that if and only if. This relation is a particular example of aprefix order.[citation needed]