Example ofRecipeML, a simple markup language based on XML for creating recipes. The markup can be converted programmatically for display into, for example,HTML,PDF orRich Text Format.
Amarkup language is atext-encoding system which specifies the structure and formatting of a document and potentially the relationships among its parts.[1] Markup can control the display of a document or enrich its content to facilitate automated processing.
A markup language is a set of rules governing what markup information may be included in a document and how it is combined with the content of the document in a way to facilitate use by humans and computer programs. The idea andterminology evolved from the marking up of papermanuscripts (e.g., with revision instructions by editors), traditionally written with a red pen orblue pencil on authors' manuscripts.[2]
Older markup languages, which typically focus ontypesetting and presentation, includetroff,TeX, andLaTeX.Scribe and most modern markup languages, such asXML, identify document components (for example headings, paragraphs, and tables), with the expectation that technology, such asstylesheets, will be used to apply formatting or other processing.[citation needed]
Some markup languages, such as the widely usedHTML, have pre-definedpresentation semantics, meaning that theirspecifications prescribe some aspects of how to present thestructured data on particular media. HTML, likeDocBook,Open eBook,JATS, and many others, are based on the markupmetalanguages XML andSGML. That is, SGML and XML allow designers to specify particularschemas, which determine which elements,attributes, and other features are permitted, and where.[3]
A key characteristic of most markup languages is that they allow combining markup with content such as text and pictures. For example, if a few words in a sentence need to be emphasized, or identified as a proper name, defined term, or another special item, the markup may be inserted between the characters of the sentence.
The wordmarkup is derived from the traditional publishing practice ofmarking up amanuscript, which involves adding handwrittenannotations in the form of conventional symbolicprinter's instructions—in themargins and text of a paper or printed manuscript.
For centuries, this task was done primarily by skilledtypographers known asmarkup men[4] ormarkers[5] who marked up text to indicate whattypeface, style, and size should be applied to each part, and then passed the manuscript to others fortypesetting by hand or machine.
The markup was also commonly applied byeditors,proofreaders,publishers, andgraphic designers, and by authors themselves, all of whom might also mark things such as corrections and changes.
There are three general categories of electronic markup, articulated by James Coombs, Allen Renear, andSteven DeRose in 1987,[6] andTim Bray in 2003.[7]
Presentational markup is used by traditionalword-processing systems.Binary codes embedded within document text produce theWYSIWYG ('what you see is what you get') effect. Such markup is usually hidden from human users, even authors and editors. Such systems use procedural and descriptive markup internally but convert them to present the user with formatted arrangements of type.[citation needed]
Markup is embedded in text which providesinstructions forprograms to process the text. Well-known examples includetroff,TeX, andMarkdown. Generally, software processes the text sequentially from beginning to end, following the instructions as encountered. Such text is often edited with the markup visible and directly manipulated by the author. Popular procedural markup systems usually includeprogramming constructs, especiallymacros, allowing complex sets of instructions to be invoked by a simple name (and perhaps a few parameters). This is much faster, less error-prone, and more maintenance-friendly than re-stating the same or similar instructions in many places.
Descriptive markup is specifically used to describe parts of the document for what they are, rather than how they should be processed. Well-known systems that provide many such labels includeLaTeX,HTML, andXML. The objective is todecouple the structure of the document from any particular treatment or rendition of it. Such markup is often described assemantic. An example of a descriptive markup is HTML's<cite> tag, which is used to label acitation. Descriptive markup—sometimes calledlogical markup orconceptual markup—encourages authors to write in a way that describes the material conceptually, rather than visually.[8]
There is considerable overlap and concurrent use of markup types. In modern word-processing systems, presentational markup is often saved in descriptive-markup-oriented systems such as XML, and then processed procedurally byimplementations. The programming in procedural-markup systems, such as TeX, may be used to create higher-level markup systems that are more descriptive in nature, such as LaTeX.
In recent years,[when?] several markup languages have been developed with ease of use as a key goal, and without input from standards organizations, aimed at allowing authors to create formatted text viaweb browsers, for example inwikis andweb forums. These are sometimes calledlightweight markup languages. Markdown,BBCode, and themarkup language used by Wikipedia are examples of such languages.
The first well-known public presentation of markup languages in computer text processing was made byWilliam W. Tunnicliffe at a conference in 1967, although he preferred to call itgeneric coding. It can be seen as a response to the emergence of processing programs such asRUNOFF that each used their own control notation, often specific to the target typesetting device. In the 1970s, Tunnicliffe led the development of a standard called GenCode for the publishing industry.Book designer Stanley Rice published speculation along similar lines in 1970.[9]
Brian Reid, in his 1980 dissertation atCarnegie Mellon University, developed a theory and working implementation of descriptive markup in actual use. However,IBM researcherCharles Goldfarb is more commonly considered the inventor of markup languages. Goldfarb developed the basic idea while working on a primitivedocument management system intended for law firms in 1969, and helped invent IBM'sGeneralized Markup Language (GML) later that same year. GML was first publicly disclosed in 1973.
Standard Generalized Markup Language (SGML), the first standard descriptive markup language, was based on both GML and GenCode. It was the result of anInternational Organization for Standardization (ISO) committee that was first chaired by Tunnicliffe, and which Goldfarb also worked on beginning in 1974.[10] Goldfarb eventually became chair of the committee. SGML was first released by ISO as the ISO 8879 standard in October 1986.
Some early examples of computer markup languages available outside the publishing industry can be found in typesetting tools onUnix systems such as troff andnroff. In these systems, formatting commands were inserted into the document text so that typesetting software could format the text according to the editor's specifications. It was atrial and error iterative process to correctly print a document.[11] The availability of WYSIWYG publishing software supplanted much use of these languages among casual users, though professional publishing work still uses markup to specify the non-visual structure of texts, and WYSIWYG editors now usually save documents in a markup-language-based format.
Another major publishing standard is TeX, created and refined byDonald Knuth in the 1970s and 1980s. TeX concentrated on the detailed layout of text and font descriptions to typeset mathematical books. This required Knuth to spend considerable time investigating the art of typesetting. TeX is mainly used in academia, where it is ade facto standard in many scientific disciplines. A TeX macro package known as LaTeX provides a descriptive markup system on top of TeX, and is widely used both among the scientific community and the publishing industry.
The first language to make a clear distinction between structure and presentation was Scribe, developed by Brian Reid and described in his doctoral thesis in 1980.[12] Scribe was revolutionary in a number of ways, introducing the idea of styles separated from the marked-up document, and agrammar that controlled the usage of descriptive elements. Scribe influenced the development of GML and later SGML,[13] and is a direct ancestor to HTML and LaTeX.[a]
In the early 1980s, the idea that markup should focus on the structural aspects of a document and leave the visual presentation of that structure to the interpreter led to the creation of SGML. The language was developed by a committee chaired by Goldfarb. It incorporated ideas from many different sources, including Tunnicliffe's project, GenCode. Sharon Adler, Anders Berglund, and James A. Marke were also key members of the SGML committee.
SGML specifies asyntax for including the markup in documents, as well as one for separately describing what tags are allowed, and where (thedocument type definition (DTD), later known as aschema). This allows authors to create and use any markup they want, selecting tags that make the most sense to them and are named in their ownnatural languages, while also allowing automated verification. Thus, SGML is properly ametalanguage, and many markup languages are derived from it. From the late 1980s onward, most substantial new markup languages have been based on SGML, including theText Encoding Initiative (TEI) guidelines andDocBook. SGML was promulgated as the ISO 8879 standard in 1986.[14]
SGML found wide acceptance and use in fields with very large-scaledocumentation requirements. However, many found it cumbersome and difficult to learn—a side effect of its design attempting to do too much and being too flexible. For example, SGML made end tags (or start tags, or both) optional in certain contexts, because its developers thought markup would be done manually by overworked support staff who would appreciate saving keystrokes[citation needed].
In 1989, computer scientistTim Berners-Lee wrote a memo proposing anInternet-basedhypertext system,[15] then specified HTML and wrote the browser and server software in late 1990. The first publicly available description of HTML was a document called "HTML Tags", first mentioned on the Internet by Berners-Lee in late 1991.[16][17] It describes 18 elements comprising the initial, relatively simple design of HTML. Except for thehyperlink tag, these were strongly influenced bySGMLguid, an in-house SGML-based documentation format atCERN, and very similar to the sample schema in the SGML standard. Eleven of these elements still exist in HTML 4.[18]
Berners-Lee considered HTML an SGML application. TheInternet Engineering Task Force (IETF) formally defined it as such with the mid-1993 publication of the first proposal for an HTMLspecification: "Hypertext Markup Language (HTML)" by Berners-Lee andDan Connolly,[19] which included an SGML DTD to define the grammar.[20] Many of the HTML text elements are found in the 1988 ISO technical reportTR 9537 Techniques for using SGML, which in turn covers the features of early text formatting languages, such as that used by theRUNOFF command developed in the early 1960s for theCompatible Time-Sharing System operating system. These formatting commands were derived from those used by typesetters to manually format documents. Steven DeRose argues that HTML's use of descriptive markup (and the influence of SGML in particular) was a major factor in the success of the Web, because of the flexibility andextensibility that it enabled.[21] HTML became the main markup language for creating web pages and other information that can be displayed in a web browser and is likely the most used markup language in the world in the 21st century.
XML (Extensible Markup Language) is a widely used meta markup language. It was developed by theWorld Wide Web Consortium (W3C) in a committee created and chaired byJon Bosak. The main purpose of XML was to simplify SGML by focusing on a particular use case—documents on the Internet.[22] XML remains a metalanguage like SGML, allowing users to create any tags needed (henceextensible) and then describing those tags and their permitted uses.
XML adoption was hastened by the fact that every XML document can be written so that it is also an SGML document, allowing existing SGML users and software to switch to XML fairly easily. At the same time, XML eliminates many complex features of SGML to simplify implementation environments such as documents and publications. It appears to balance simplicity and flexibility, as well as support very robust schema definitions and validation tools, and was rapidly adopted for many uses. XML is now widely used forcommunicating data between applications,serializing program data, for hardware communication protocols,vector graphics, and other uses besides documents.
From January 2000 untilHTML 5 was released, allW3C recommendations for HTML were based on XML, usingXHTML (Extensible HyperText Markup Language). The language specification requires that XHTML documents bewell-formed XML documents. This allows for more rigorous and robust documents, by avoiding many syntax errors which historically led to unwanted browser behavior, while still using document components familiar to HTML users.
One of the most noticeable differences between HTML and XHTML is the latter's rule thatall tags must be closed: empty HTML tags such as<br> must either beclosed with a regular end-tag, or replaced by a special form:<br /> (the space before the slash on the end tag is optional but frequently used, because it enables some pre-XML web browsers and SGML parsers to accept the tag). Another difference is that allattribute values in tags must be quoted. Both these differences are commonly criticized as verbose but also praised because they make it far easier to detect, localize, and repair errors. Finally, all tag and attribute names within the XHTML namespace must be lowercase to be valid. HTML, on the other hand, was case-insensitive.
A common feature of many markup languages is that they intermix the text of a document with markup instructions in the same data stream or file. This is not necessary; it is possible to isolate markup from text content, using pointers, offsets, IDs, or other methods to coordinate the two. Suchstandoff markup is typical for the internal representations that programs use to work with marked-up documents. However,embedded orinline markup is much more common elsewhere. For example, the following is a small section of text marked up in HTML:
<!DOCTYPE html><html><head><metacharset="utf-8"><title>My test page</title></head><body><h1>Mozilla is cool</h1><imgsrc="images/firefox-icon.png"alt="The Firefox logo: a flaming fox surrounding the Earth."><p>At Mozilla, we’re a global community of</p><ul><!-- changed to list in the tutorial --><li>technologists</li><li>thinkers</li><li>builders</li></ul><p>working together to keep the Internet alive and accessible, so people worldwide can be informed contributors and creators of the Web. We believe this act of human collaboration across an open platform is essential to individual growth and our collective future.</p><p>Read the<ahref="https://www.mozilla.org/en-US/about/manifesto/">Mozilla Manifesto</a> to learn even more about the values and principles that guide the pursuit of our mission.</p></body></html>
The codes enclosed in angle-brackets<like this> are markup instructions (known astags), while the text between these instructions is the actual text of the document. The codesh1,p, andem are examples ofsemantic markup, in that they describe the intended purpose or the meaning of the text they include. Specifically,h1 means the enclosed text is afirst-level heading,p means aparagraph, andem means anemphasized word or phrase. A program interpreting such structural markup may apply its own rules or styles for presenting the various pieces of text, using different typefaces, boldness, font size, indentation, color, or other styles, as desired. For example, a tag such ash1 might be presented in a large boldsans-serif typeface in an article, or it might be underscored in amonospaced (fixed-width font) document, or it might not change the presentation at all.
In contrast, thei tag in HTML 4 is an example ofpresentational markup, which is generally used to specify a characteristic of the text without specifying the reason for that appearance. In this case, thei element dictates the use of anitalic typeface. However, in HTML 5, this element has been repurposed with a more semantic usage: to denote "a span of text in an alternate voice or mood, or otherwise offset from the normal prose in a manner indicating a different quality of text".[23] For example, it is appropriate to use thei element to indicate a taxonomic designation or a phrase in another language.[23] The change was made to ease the transition from HTML 4 to 5 as smoothly as possible so thatdeprecated uses of presentational elements would preserve the most likely intended meaning.
TEI has published extensive guidelines[24] for how to encode texts of interest in thehumanities andsocial sciences, developed through years of international cooperative work. These guidelines are used for encoding historical documents, and the works of particular scholars, periods, and genres.
^Allan Woods,Modern Newspaper Production (New York: Harper & Row, 1963), 85; Stewart Harral,Profitable Public Relations for Newspapers (Ann Arbor: J. W. Edwards, 1957), 76; andChiarella v. United States,445U.S.222 (1980).
^From the Notebooks of H. J. H & D. H. An on Composition, Kingsport Press Inc., undated (1960s).
^Rice, Stanley. "Editorial Text Structures (with some relations to information structures and format controls in computerized composition)". American National Standards Institute, March 17, 1970.
^Reid, Brian. "Scribe: A Document Specification Language and its Compiler". Ph.D. thesis, Carnegie-Mellon University, Pittsburgh PA. Also available as Technical Report CMU-CS-81-100.