RemexHtml is a parser for HTML 5, written in PHP.
RemexHtml aims to be:
RemexHtml contains the following modules:
RemexHtml presently lacks:
RemexHtml aims to be compliant with W3C recommendation HTML 5.1, except for minor backported bugfixes.We chose to implement the W3C standard rather than the latest WHATWG draft because our application needs stability more than feature completeness.
RemexHtml passes allhtml5lib tests, except for parse error counts and tests which reference a future version of the standard.
RemexHtml has been available in MediaWiki as a core composer dependency since MediaWiki 1.29.Its initial use case was as a replacement forHTML Tidy.Output from the wikitext parser is fed into RemexHtml's HTML parser and cleaned up per the HTML 5 tag soup specification.The Tokenizer component is now also used for tag stripping in Sanitizer.
It is also used for HTML postprocessing in theCollection,TEI andWikibase extensions.
Install thewikimedia/remex-html package from Packagist:
composer require wikimedia/remex-html
Semantic versioning is used.The major version number will be incremented for every change that breaks backwards compatibility.
For full reference documentation, please see the documentation generated from the source (or the source itself)
RemexHtml uses a pipeline model.Each event producer calls the attached callback object when it has an event ready to produce.The pipeline stages are:
In the HTML specification, the tree construction algorithm is imagined as being tightly integrated with creation of a DOM data structure.A major innovation of RemexHtml is to separate tree construction into a phase which generates a tree mutation event stream, and a phase which actually produces the data structure.RemexHtml is able to directly serialize the tree mutation event stream, without needing to store the whole DOM in memory.
When Serializer is used, there is a final pipeline stage:
RemexHtml also provides:
There are optional pipeline stages providing debugging facilities:
RemexHtml's model of a configurable pipeline provides a great deal of flexibility.Applications may subclass pipeline classes provided by RemexHtml, or write their own from scratch, implementing the relevant event receiver interface.Or they may interpose custom pipeline stages in between RemexHtml's standard stages.
However, for simple use cases, there is a fair amount of boilerplate.T217850 proposes to add a simplified method for constructing a standard pipeline, but this has not yet been implemented.
useWikimedia\RemexHtml\DOM\DOMBuilder;useWikimedia\RemexHtml\TreeBuilder\TreeBuilder;useWikimedia\RemexHtml\TreeBuilder\Dispatcher;useWikimedia\RemexHtml\Tokenizer\Tokenizer;functionparseHtmlToDom($input){$domBuilder=newDOMBuilder();$treeBuilder=newTreeBuilder($domBuilder);$dispatcher=newDispatcher($treeBuilder);$tokenizer=newTokenizer($dispatcher,$input);$tokenizer->execute();return$domBuilder->getFragment();}
In the above code sample, the pipeline is constructed backwards, from end to start.The constructor of each pipeline stage receives the following pipeline stage.Then with the pipeline fully constructed, $tokenizer->execute() causes the whole input text to be parsed and emitted through the pipeline, eventually reaching the DOMBuilder.After execution, the constructed document is available via $domBuilder->getFragment().
useWikimedia\RemexHtml\HTMLData;useWikimedia\RemexHtml\Serializer\HtmlFormatter;useWikimedia\RemexHtml\Serializer\Serializer;useWikimedia\RemexHtml\Serializer\SerializerNode;useWikimedia\RemexHtml\Tokenizer\Tokenizer;useWikimedia\RemexHtml\TreeBuilder\Dispatcher;useWikimedia\RemexHtml\TreeBuilder\TreeBuilder;functionchangeLinks($html){$formatter=newclassextendsHtmlFormatter{publicfunctionelement(SerializerNode$parent,SerializerNode$node,$contents){if($node->namespace===HTMLData::NS_HTML&&$node->name==='a'&&isset($node->attrs['href'])){$node=clone$node;$node->attrs=clone$node->attrs;$node->attrs['href']='http://example.com/'.$node->attrs['href'];}returnparent::element($parent,$node,$contents);}};$serializer=newSerializer($formatter);$treeBuilder=newTreeBuilder($serializer);$dispatcher=newDispatcher($treeBuilder);$tokenizer=newTokenizer($dispatcher,$html);$tokenizer->execute();return$serializer->getResult();}
This example modifies an HTML document on the fly, altering href attributes inside<a>
tags and returning an HTML string.It does this by subclassing HtmlFormatter, which is a relatively easy hook point into reserialization.It clones the SerializerNode and Attributes objects to avoid altering the document as seen by Serializer, since it is possible this function may be called more than once on each node, and we don't want to prefix the domain name more than once.
Alternatively we could have used SerializerNode::$snData as a flag, to avoid double-prefixing:
if(!$node->snData){$node->snData=true;$node->attrs['href']='http://example.com/'.$node->attrs['href'];}
Various options can be enabled which improve performance, potentially at the expense of correctness:
If RemexHtml throws a TokenizerError exception, for example "pcre.backtrack_limit exhausted", this is usually not a bug in RemexHtml.Either the relevant configuration setting should be increased, or the input size should be limited.The pcre.backtrack_limit INI setting should be at least double the input size.