While this may serve as an interesting design document for thenow-independent docutils, it is no longer slated for inclusion in thestandard library.
This PEP documents design issues and implementation details forDocutils, a Python Docstring Processing System (DPS). The rationaleand high-level concepts of a DPS are documented inPEP 256, “DocstringProcessing System Framework”. Also seePEP 256 for a“Road Map to the Docstring PEPs”.
Docutils is being designed modularly so that any of its components canbe replaced easily. In addition, Docutils is not limited to theprocessing of Python docstrings; it processes standalone documents aswell, in several contexts.
No changes to the core Python language are required by this PEP. Itsdeliverables consist of a package for the standard library and itsdocumentation.
Project components and data flow:
+---------------------------+|Docutils:||docutils.core.Publisher,||docutils.core.publish_*()|+---------------------------+/| \/| \1,3,5/6| \7+--------++-------------++--------+|READER|---->|TRANSFORMER|====>|WRITER|+--------++-------------++--------+/ \\|/ \\|2/4 \\8|+-------++--------++--------+|INPUT||PARSER||OUTPUT|+-------++--------++--------+
The numbers above each component indicate the path a document’s datatakes. Double-width lines between Reader & Parser and betweenTransformer & Writer indicate that data sent along these paths shouldbe standard (pure & unextended) Docutils doc trees. Single-widthlines signify that internal tree extensions or completely unrelatedrepresentations are possible, but they must be supported at both ends.
Thedocutils.core module contains a “Publisher” facade class andseveral convenience functions: “publish_cmdline()” (for command-linefront ends), “publish_file()” (for programmatic use with file-likeI/O), and “publish_string()” (for programmatic use with string I/O).The Publisher class encapsulates the high-level logic of a Docutilssystem. The Publisher class has overall responsibility forprocessing, controlled by thePublisher.publish() method:
publish() method.Calling the “publish” function (or instantiating a “Publisher” object)with component names will result in default behavior. For custombehavior (customizing component settings), create custom componentobjects first, and passthem to the Publisher orpublish_*convenience functions.
Readers understand the input context (where the data is coming from),send the whole input or discrete “chunks” to the parser, and providethe context to bind the chunks together back into a cohesive whole.
Each reader is a module or package exporting a “Reader” class with a“read” method. The base “Reader” class can be found in thedocutils/readers/__init__.py module.
Most Readers will have to be told what parser to use. So far (see thelist of examples below), only the Python Source Reader (“PySource”;still incomplete) will be able to determine the parser on its own.
Responsibilities:
Examples:
The “Standalone Reader” has been implemented in moduledocutils.readers.standalone.
docutils.readers.pep; seePEP 287 andPEP 12.<body>, before</body>?)Parsers analyze their input and produce a Docutilsdocument tree.They don’t know or care anything about the source or destination ofthe data.
Each input parser is a module or package exporting a “Parser” classwith a “parse” method. The base “Parser” class can be found in thedocutils/parsers/__init__.py module.
Responsibilities: Given raw input text and a doctree root node,populate the doctree by parsing the input text.
Example: The only parser implemented so far is for thereStructuredText markup. It is implemented in thedocutils/parsers/rst/ package.
The development and integration of other parsers is possible andencouraged.
The Transformer class, indocutils/transforms/__init__.py, storestransforms and applies them to documents. A transformer object isattached to every new document tree. ThePublisher callsTransformer.apply_transforms() to apply all stored transforms tothe document tree. Transforms change the document tree from one formto another, add to the tree, or prune it. Transforms resolvereferences and footnote numbers, process interpreted text, and doother context-sensitive processing.
Some transforms are specific to components (Readers, Parser, Writers,Input, Output). Standard component-specific transforms are specifiedin thedefault_transforms attribute of component classes. Afterthe Reader has finished processing, thePublisher callsTransformer.populate_from_components() with a list of componentsand all default transforms are stored.
Each transform is a class in a module in thedocutils/transforms/package, a subclass ofdocutils.transforms.Transform. Transformclasses each have adefault_priority attribute which is used bythe Transformer to apply transforms in order (low to high). Thedefault priority can be overridden when adding transforms to theTransformer object.
Transformer responsibilities:
Transform responsibilities:
Examples of transforms (in thedocutils/transforms/ package):
Writers produce the final output (HTML, XML, TeX, etc.). Writerstranslate the internaldocument tree structure into the final dataformat, possibly running Writer-specifictransforms first.
By the time the document gets to the Writer, it should be in finalform. The Writer’s job is simply (and only) to translate from theDocutils doctree structure to the target format. Some smalltransforms may be required, but they should be local andformat-specific.
Each writer is a module or package exporting a “Writer” class with a“write” method. The base “Writer” class can be found in thedocutils/writers/__init__.py module.
Responsibilities:
Examples:
docutils.writers.docutils_xml).docutils.writers.html4css1).docutils.writers.pseudoxml, used for testing).I/O classes provide a uniform API for low-level input and output.Subclasses will exist for a variety of input/output mechanisms.However, they can be considered an implementation detail. Mostapplications should be satisfied using one of the conveniencefunctions associated with thePublisher.
I/O classes are currently in the preliminary stages; there’s a lot ofwork yet to be done. Issues:
Responsibilities:
Examples of input sources:
docutils.io.FileInput).MultiFileInput?).docutils.io.StringInput).Examples of output destinations:
docutils.io.FileOutput).docutils.io.StringOutput).docutils.io.NullOutput).docutils/parsers/__init__.py)SeeParsers above.
docutils/readers/__init__.py)SeeReaders above.
docutils/writers/__init__.py)SeeWriters above.
docutils/transforms/__init__.py)docutils/transforms/__init__.py)SeeTransforms above.
docutils/languages/__init__.py)extras/optparse.py andextras/textwrap.py provideoption parsing and command-line help; from Greg Ward’shttp://optik.sf.net/ project, included for convenience.extras/roman.py contains Roman numeral conversion routines.Thetools/ directory contains several front ends for commonDocutils processing. SeeDocutils Front-End Tools for details.
A single intermediate data structure is used internally by Docutils,in the interfaces between components; it is defined in thedocutils.nodes module. It is not required that this datastructure be usedinternally by any of the components, justbetween components as outlined in the diagram in theDocutilsProject Model above.
Custom node types are allowed, provided that either (a) a transformconverts them to standard Docutils nodes before they reach the Writerproper, or (b) the custom node is explicitly supported by certainWriters, and is wrapped in a filtered “pending” node. An example ofcondition (a) is thePython Source Reader (see below), where a“stylist” transform converts custom nodes. The HTML<meta> tag isan example of condition (b); it is supported by the HTML Writer butnot by others. The reStructuredText “meta” directive creates a“pending” node, which contains knowledge that the embedded “meta” nodecan only be handled by HTML-compatible writers. The “pending” node isresolved by thedocutils.transforms.components.Filter transform,which checks that the calling writer supports HTML; if it doesn’t, the“pending” node (and enclosed “meta” node) is removed from thedocument.
The document tree data structure is similar to a DOM tree, but withspecific node names (classes) instead of DOM’s generic nodes. Theschema is documented in an XML DTD (eXtensible Markup LanguageDocument Type Definition), which comes in two parts:
The DTD defines a rich set of elements, suitable for many input andoutput formats. The DTD retains all information necessary toreconstruct the original input text, or a reasonable facsimilethereof.
SeeThe Docutils Document Tree for details (incomplete).
When the parser encounters an error in markup, it inserts a systemmessage (DTD element “system_message”). There are five levels ofsystem messages:
Although the initial message levels were devised independently, theyhave a strong correspondence toVMS error condition severitylevels; the names in quotes for levels 1 through 4 were borrowedfrom VMS. Error handling has since been influenced by thelog4jproject.
The Python Source Reader (“PySource”) is the Docutils component thatreads Python source files, extracts docstrings in context, thenparses, links, and assembles the docstrings into a cohesive whole. Itis a major and non-trivial component, currently under experimentaldevelopment in the Docutils sandbox. High-level design issues arepresented here.
This model will evolve over time, incorporating experience anddiscoveries.
Abstract Syntax Tree mining code will be written (or adapted) thatscans a parsed Python module, and returns an ordered tree containingthe names, docstrings (including attribute and additional docstrings;see below), and additional info (in parentheses below) of all of thefollowing objects:
(Extract comments too? For example, comments at the start of a modulewould be a good place for bibliographic field lists.)
In order to evaluate interpreted text cross-references, namespaces foreach of the above will also be required.
See the python-dev/docstring-develop thread “AST mining”, started on2001-08-14.
__all__” variable is present in the module beingdocumented, only identifiers listed in “__all__” areexamined for docstrings.__all__”, all identifiers are examined,except those whose names are private (names begin with “_” butdon’t begin and end with “__”).Docstrings are string literal expressions, and are recognized inthe following places within Python modules:
__doc__ attributes.__init__ method definition,after any comments. SeeAttribute Docstrings below.Whenever possible, Python modules should be parsed by Docutils, notimported. There are several reasons:
Of course, standard Python parsing tools such as the “parser”library module should be used.
When the Python source code for a module is not available(i.e. only the.pyc file exists) or for C extension modules, toaccess docstrings the module can only be imported, and anylimitations must be lived with.
Since attribute docstrings and additional docstrings are ignored bythe Python byte-code compiler, no namespace pollution or runtime bloatwill result from their use. They are not assigned to__doc__ orto any other attribute. The initial parsing of a module may take aslight performance hit.
(This is a simplified version ofPEP 224.)
A string literal immediately following an assignment statement isinterpreted by the docstring extraction machinery as the docstring ofthe target of the assignment statement, under the followingconditions:
__init__” method definition of aclass: an instance attribute. Instance attributes assigned inother methods are assumed to be implementation details. (@@@__new__ methods?)Since each of the above contexts are at the top level (i.e., in theoutermost suite of a definition), it may be necessary to placedummy assignments for attributes assigned conditionally or in aloop.
self.attrib”, where “self” matches the “__init__”method’s first parameter (the instance parameter) and “attrib”is a simple identifier as in 3a.name.attrib”, where “name” matches an already-definedfunction or method name and “attrib” is a simple identifier asin 3a.Blank lines may be used after attribute docstrings to emphasize theconnection between the assignment and the docstring.
Examples:
g='module attribute (module-global variable)'"""This is g's docstring."""classAClass:c='class attribute'"""This is AClass.c's docstring."""def__init__(self):"""Method __init__'s docstring."""self.i='instance attribute'"""This is self.i's docstring."""deff(x):"""Function f's docstring."""returnx**2f.a=1"""Function attribute f.a's docstring."""
(This idea was adapted fromPEP 216.)
Many programmers would like to make extensive use of docstrings forAPI documentation. However, docstrings do take up space in therunning program, so some programmers are reluctant to “bloat up” theircode. Also, not all API documentation is applicable to interactiveenvironments, where__doc__ would be displayed.
Docutils’ docstring extraction tools will concatenate all stringliteral expressions which appear at the beginning of a definition orafter a simple assignment. Only the first strings in definitions willbe available as__doc__, and can be used for brief usage textsuitable for interactive sessions; subsequent string literals and allattribute docstrings are ignored by the Python byte-code compiler andmay contain more extensive API information.
Example:
deffunction(arg):"""This is __doc__, function's docstring."""""" This is an additional docstring, ignored by the byte-code compiler, but extracted by Docutils. """pass
Issue:from__future__import
This would break “from__future__import” statements introducedin Python 2.1 for multiple module docstrings (main docstring plusadditional docstring(s)). The Python Reference Manual specifies:
A future statement must appear near the top of the module. Theonly lines that can appear before a future statement are:
- the module docstring (if any),
- comments,
- blank lines, and
- other future statements.
Resolution?
__future__statement? Very ugly.__future__ statements to allow multiple precedingstring literals?__future__ statements in production code, afterall. Perhaps modules with__future__ statements will simplyhave to put up with the single-docstring limitation.Rather than force everyone to use a single docstring format, multipleinput formats are allowed by the processing system. A specialvariable,__docformat__, may appear at the top level of a modulebefore any function or class definitions. Over time or throughdecree, a standard format or set of formats should emerge.
A module’s__docformat__ variable only applies to the objectsdefined in the module’s file. In particular, the__docformat__variable in a package’s__init__.py file does not apply to objectsdefined in subpackages and submodules.
The__docformat__ variable is a string containing the name of theformat being used, a case-insensitive string matching the inputparser’s module or package name (i.e., the same name as required to“import” the module or package), or a registered alias. If no__docformat__ is specified, the default format is “plaintext” fornow; this may be changed to the standard format if one is everestablished.
The__docformat__ string may contain an optional second field,separated from the format name (first field) by a single space: acase-insensitive language identifier as defined inRFC 1766. Atypical language identifier consists of a 2-letter language code fromISO 639 (3-letter codes used only if no 2-letter code exists;RFC 1766 is currently being revised to allow 3-letter codes). If nolanguage identifier is specified, the default is “en” for English.The language identifier is passed to the parser and can be used forlanguage-dependent markup features.
In Python docstrings, interpreted text is used to classify and mark upprogram identifiers, such as the names of variables, functions,classes, and modules. If the identifier alone is given, its role isinferred implicitly according to the Python namespace lookup rules.For functions and methods (even when dynamically assigned),parentheses (‘()’) may be included:
This function uses `another()` to do its work.
For class, instance and module attributes, dotted identifiers are usedwhen necessary. For example (using reStructuredText markup):
classKeeper(Storer):""" Extend `Storer`. Class attribute `instances` keeps track of the number of `Keeper` objects instantiated. """instances=0"""How many `Keeper` objects are there?"""def__init__(self):""" Extend `Storer.__init__()` to keep track of instances. Keep count in `Keeper.instances`, data in `self.data`. """Storer.__init__(self)Keeper.instances+=1self.data=[]"""Store data in a list, most recent last."""defstore_data(self,data):""" Extend `Storer.store_data()`; append new `data` to a list (in `self.data`). """self.data=data
Each of the identifiers quoted with backquotes (“`”) will becomereferences to the definitions of the identifiers themselves.
Stylist transforms are specialized transforms specific to the PySourceReader. The PySource Reader doesn’t have to make any decisions as tostyle; it just produces a logically constructed document tree, parsedand linked, including custom node types. Stylist transformsunderstand the custom nodes created by the Reader and convert theminto standard Docutils nodes.
Multiple Stylist transforms may be implemented and one can be chosenat runtime (through a “–style” or “–stylist” command-line option).Each Stylist transform implements a different layout or style; thusthe name. They decouple the context-understanding part of the Readerfrom the layout-generating part of processing, resulting in a moreflexible and robust system. This also serves to “separate style fromcontent”, the SGML/XML ideal.
By keeping the piece of code that does the styling small and modular,it becomes much easier for people to roll their own styles. The“barrier to entry” is too high with existing tools; extracting thestylist code will lower the barrier considerably.
A SourceForge project has been set up for this work athttp://docutils.sourceforge.net/.
This document has been placed in the public domain.
This document borrows ideas from the archives of thePythonDoc-SIG. Thanks to all members past & present.
Source:https://github.com/python/peps/blob/main/peps/pep-0258.rst
Last modified:2025-02-01 08:59:27 GMT