NotificationsYou must be signed in to change notification settings
Fork9.7k
Star30.8k

[DomCrawler] Add`DomCrawler` to leverage PHP 8.4's HTML5-compliant DOM parser#61356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

nicolas-grekas wants to merge1 commit intosymfony:7.4fromnicolas-grekas:dom-crawler-84

Closed

[DomCrawler] Add`DomCrawler` to leverage PHP 8.4's HTML5-compliant DOM parser#61356

nicolas-grekas wants to merge1 commit intosymfony:7.4fromnicolas-grekas:dom-crawler-84

Conversation

Copy link

Member

nicolas-grekas commentedAug 7, 2025•
edited
Loading

Q	A
Branch?	7.4
Bug fix?	no
New feature?	yes
Deprecations?	yes
Issues	#53666
License	MIT

This PR replaces#54383

It takes the BC breaking approach by widening some property and method types to accept / return classes from bothDom\* andDOM* sets.

This widening is a hard BC break for child classes the re-declare the corresponding properties/methods.
I expect this situation to be rather uncommon, so I don't think this will be a big-bang BC break.
Fortunately, this change in signature will be easy to spot since the PHP engine will fail early.

In addition, theCrawler class is made generic: once an instance is created, it's going to stick to one set, eitherDom\* orDOM* classes.
To further limit the impact,Crawler::__construct() allows building onlyCrawler<DOMNode> instances and aDomCrawler class is introduced to createCrawler<Dom\Node> ones.

As a reminder, the goal of this work is to be able to remove the dependency onmasterminds/html5 in Symfony 8.
It is NOT to force moving to the newDom\* API.

Still, in order to achieve this goal and because we want HTML5-parsing to be the default,BrowserKit will leverage the newDomCrawler class when running on PHP 8.4+.

The newDom\* API has some behavioral changes, all legit and documented athttps://wiki.php.net/rfc/opt_in_dom_spec_compliance

TheCrawler implementation ensures some fundamental behaviors remain, such as node names being returned in lowercase, empty values/texts being returned as the empty string instead of null, or void tags to not have closing tags. I don't think we'd gain much by propagating these changes brought by theDom\* API to theCrawler one. It'd actually make adopting the newDom\* implementation harder for little to no benefits.

The BC break is a decision we have to make. I think it's the best possible approach because in practice, the impact will be limited. The alternative taken in#54383 is to create two independent type hierarchies. This will create a situation where suddenly, all the existing crawler-related code will need to be updated. Better not when there's a less costly path - even though it'll break some edge-cases.

carsonbot added Status: Needs Review Deprecation DomCrawler Feature labels

Aug 7, 2025

carsonbot added this to the7.4 milestone

Aug 7, 2025

nicolas-grekas force-pushed thedom-crawler-84 branch 4 times, most recently from2e9f204 to8b72b64Compare

August 7, 2025 15:59

Copy link

MemberAuthor

nicolas-grekas commentedAug 7, 2025

Looks like psalm doesn't undertand my annotations. Any proposals to make them understandable to the tool?
Maybe phpstan would do a better job?

nicolas-grekas mentioned this pull request

Aug 7, 2025

[DomCrawler] Support classes from the new DOM extension#54383

Closed

Copy link

Member

alexandre-daubois commentedAug 7, 2025

I'll have a deeper look at this PR soon, but the BC break seems definitely the best option. As you said, I also expect that very few projects will be impacted. Thank you for taking over this topic!

nicolas-grekas mentioned this pull request

Aug 8, 2025

[HtmlSanitizer] Use the native HTML5 parser when using PHP 8.4+#61366

Merged

nicolas-grekas added a commit that referenced this pull request

Aug 12, 2025

feature#61366[HtmlSanitizer] Use the native HTML5 parser when using…

0d68b22

… PHP 8.4+ (nicolas-grekas)This PR was merged into the 7.4 branch.Discussion----------[HtmlSanitizer] Use the native HTML5 parser when using PHP 8.4+| Q             | A| ------------- | ---| Branch?       | 7.4| Bug fix?      | no| New feature?  | yes| Deprecations? | yes| Issues        |#53666| License       | MITTogether with#61356, this PR allows removing any dependency on masterminds/html5 in favor of the native HTML5 capabilities of PHP 8.4 on Symfony 8In order to do so, this we: * Use the native HTML5 parser when using PHP 8.4+ * Deprecate `MastermindsParser`; use `NativeParser` instead * [BC BREAK] `ParserInterface::parse()` can now return `\Dom\Node|\DOMNode|null` instead of just `\DOMNode|null` * Add argument `$context` to `ParserInterface::parse()`Note that `DomVisitor` is internal so no BC breaks there.And `StringSanitizer::htmlLower()` can leverage `strtolower()` since PHP 8.2 thanks tohttps://wiki.php.net/rfc/strtolower-asciiCommits-------d0f98ad [HtmlSanitizer] Use the native HTML5 parser when using PHP 8.4+

alexandre-daubois approved these changes

Aug 12, 2025

View reviewed changes

Copy link

Member

alexandre-daubois left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks good! About PHPStan, I'm not sure... To me, the syntax looks correct but I never saw theT is ... used elsewhere than with@return. Usage with@return is the only one documented, so I guess@var may be unsupported unfortunately.

carsonbot added Status: Reviewed and removed Status: Needs Review labels

Aug 12, 2025

nicolas-grekas force-pushed thedom-crawler-84 branch from8b72b64 tob3e62b3Compare

August 19, 2025 06:44

Copy link

MemberAuthor

nicolas-grekas commentedAug 19, 2025

At least phpstan understands the type alias and the type conditions.
PR is good to go. Votes pending ;)

nicolas-grekas force-pushed thedom-crawler-84 branch 5 times, most recently fromc7d6231 to9f6bc5aCompare

August 20, 2025 12:04

stof reviewed

Aug 20, 2025

View reviewed changes

src/Symfony/Component/DomCrawler/Crawler.php

		}

		if ($thisinstanceof DomCrawler) {
		// remove all void elements as defined by HTML5

Copy link

Member

stofAug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Be careful. A DomCrawler might hold XML content, not just HTML content. You need to also check whether$this->document is aDom\HtmlDocument, or check$this->isHtml

src/Symfony/Component/DomCrawler/Crawler.php OutdatedShow resolvedHide resolved

UPGRADE-7.4.md

		* properties`FormField::$document`,`$xpath`,`$node` and method`getLabel()`
		* methods`Form::getFormNode()` and`addField()`
		* property`AbstractUriElement::$node`, and methods`getNode()` and`setNode()`
		* methods`Crawler::add()`,`addDocument()`,`addNodeList()`,`addNode()`,`getNode()` and`sibling()`

Copy link

Member

stofAug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

type widening forgetNode is a BC impacting caller code rather than child classes due to widening a return type instead of a parameter type (same for other getters in other classes). This should be highlighted IMO.

Copy link

Member

stofAug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

And this is not just theoretical.https://github.com/minkphp/MinkBrowserKitDriver will be totally broken by this BC break affecting consumers of the library (and immediately because of the BC break in BrowserKit switching to the BC-breaking implementation)

src/Symfony/Component/DomCrawler/Crawler.php OutdatedShow resolvedHide resolved

UPGRADE-7.4.md

		BrowserKit
		----------

		* Leverage the native HTML5 parser when using PHP 8.4+

Copy link

Member

stofAug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

this is a BC break for code interacting with DomCrawler methods returning a raw node as the type has changed. this should be documented as such.

src/Symfony/Component/DomCrawler/Crawler.php OutdatedShow resolvedHide resolved

[DomCrawler] AddDomCrawler to leverage PHP 8.4's HTML5-compliant D…

0a77e4e

…OM parser

nicolas-grekas force-pushed thedom-crawler-84 branch from9f6bc5a to0a77e4eCompare

August 20, 2025 13:13

nicolas-grekas mentioned this pull request

Aug 20, 2025

[DomCrawler] Use the native HTML5 parser on PHP 8.4#61475

Merged

Copy link

MemberAuthor

nicolas-grekas commentedAug 20, 2025•
edited
Loading

Thanks for the review@stof
I fixed your comments and then tried another BC-safe approach which appears to work, see#61475
This very PR might still be useful if we want to expose the newDom\* API one day.
Closing for now.

nicolas-grekas closed this

Aug 20, 2025

nicolas-grekas deleted the dom-crawler-84 branch

August 20, 2025 15:47

nicolas-grekas added a commit that referenced this pull request

Aug 21, 2025

feature#61475[DomCrawler] Use the native HTML5 parser on PHP 8.4 (n…

8187625

…icolas-grekas)This PR was merged into the 7.4 branch.Discussion----------[DomCrawler] Use the native HTML5 parser on PHP 8.4| Q             | A| ------------- | ---| Branch?       | 7.4| Bug fix?      | no| New feature?  | no| Deprecations? | no| Issues        | -| License       | MITThis PR keeps the `DOM*`-based API but uses the native HTML5 parser on PHP 8.4 instead of masterminds/html5.This works by parsing HTML strings using `Dom\HTMLDocument` then serializing to XML, and loading again using `DOMDocument::loadXML()`.This basically replaces#61356 since it removes any BC breaks.The drawback compared to a more native approach is the double-parsing that happens.This could be worked on later by providing a way to leverage the new `Dom\*` API directly.To be proved worth it before.Commits-------bd7fb51 [DomCrawler] Use the native HTM5 parser on PHP 8.4

nicolas-grekas restored the dom-crawler-84 branch

August 21, 2025 07:56

nicolas-grekas deleted the dom-crawler-84 branch

August 21, 2025 07:57

nicolas-grekas mentioned this pull request

Aug 21, 2025

[DomCrawler] Improve phpdoc annotations#61482

Merged

nicolas-grekas added a commit that referenced this pull request

Aug 21, 2025

minor#61482[DomCrawler] Improve phpdoc annotations (nicolas-grekas)

d2ccf32

This PR was merged into the 7.4 branch.Discussion----------[DomCrawler] Improve phpdoc annotations| Q             | A| ------------- | ---| Branch?       | 7.4| Bug fix?      | no| New feature?  | no| Deprecations? | no| Issues        | -| License       | MITSubset of#61356Commits-------c06142a [DomCrawler] Improve phpdoc annotations

Labels

Deprecation DomCrawler Feature Status: Reviewed

Movatterモバイル変換

Uh oh!

[DomCrawler] AddDomCrawler to leverage PHP 8.4's HTML5-compliant DOM parser#61356

[DomCrawler] AddDomCrawler to leverage PHP 8.4's HTML5-compliant DOM parser#61356

Uh oh!

Conversation

nicolas-grekas commentedAug 7, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

nicolas-grekas commentedAug 7, 2025

Uh oh!

alexandre-daubois commentedAug 7, 2025

Uh oh!

alexandre-daubois left a comment• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nicolas-grekas commentedAug 19, 2025

Uh oh!

stofAug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stofAug 20, 2025

Choose a reason for hiding this comment

Uh oh!

stofAug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stofAug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nicolas-grekas commentedAug 20, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[DomCrawler] Add`DomCrawler` to leverage PHP 8.4's HTML5-compliant DOM parser#61356

[DomCrawler] Add`DomCrawler` to leverage PHP 8.4's HTML5-compliant DOM parser#61356

nicolas-grekas commentedAug 7, 2025•
edited
Loading

alexandre-daubois left a comment•
edited
Loading

nicolas-grekas commentedAug 20, 2025•
edited
Loading