Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork9.7k
[DomCrawler] Optionally use html5-php to parse HTML#29306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
stof commentedNov 24, 2018
As the native implementation uses |
Uh oh!
There was an error while loading.Please reload this page.
fabpot commentedFeb 21, 2019
@tgalopin What's the status of this PR? |
tgalopin commentedFeb 21, 2019
Waiting forMasterminds/html5-php#163 to be merged to pass tests here. |
fabpot commentedMar 4, 2019
@tgalopin Upstream PR merged :) |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
stof commentedMar 28, 2019
Due toMasterminds/html5-php#139, shouldn't we use the |
fabpot commentedMar 31, 2019
@tgalopin friendly ping |
Uh oh!
There was an error while loading.Please reload this page.
e21e17a to14a454dComparetgalopin commentedApr 3, 2019
Tests are failing for an unrelated reason. I think this is ready to review. |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
tgalopin commentedApr 3, 2019
Updated |
3e61e24 toe0ca69aCompare| } | ||
| /** | ||
| * Convert charset to HTML-entities to ensure valid parsing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Converts
fabpot commentedApr 3, 2019
Thank you@tgalopin. |
…galopin)This PR was squashed before being merged into the 4.3-dev branch (closes#29306).Discussion----------[DomCrawler] Optionally use html5-php to parse HTML| Q | A| ------------- | ---| Branch? | master| Bug fix? | no| New feature? | yes| BC breaks? | no| Deprecations? | no| Tests pass? | WIP| Fixed tickets |#29280,#28596| License | MIT| Doc PR |symfony/symfony-docs#10700This PR introduces the possibility to parse HTML content in the Crawler using the html5-php library (https://github.com/Masterminds/html5-php). This allows for better support of HTML5 and fix many unexpected behaviors and inconsistencies of the native DOM extension.Commits-------4050ec4 [DomCrawler] Optionally use html5-php to parse HTML
| "masterminds/html5":"^2.6" | ||
| }, | ||
| "conflict": { | ||
| "masterminds/html5":"<2.6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We should also conflict with> 3 then
| thrownew \InvalidArgumentException('The current node list is empty.'); | ||
| } | ||
| if (null !==$this->html5Parser) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
There is an issue here. You instantiate the HTML5 parser in the constructor even when the content added is not HTML5 but XML or existing DOM elements (coming from elsewhere than a parent crawler using HTML5). This means you might be saving with the HTML5 parser when it was not used for parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
How do you propose to improve this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
well, we need to distinguish 3 cases:
- we are parsing some HTML5
- we are parsing some older HTML
- we are not parsing HTML at all
The boolean argument in the constructor allows us to decide between the first 2 cases at the time we instantiate. But knowing whether this is HTML or no is not something the controller knows (as it can be done later).
The solution might be to store the boolean property. Then, based on that, we would decide which parsing strategy to useif we load HTML and instantiate the HTML5 parser if needed.
Then, here, we can keep saying "if I used an HTML5 parser, I also use it for saving".
And for subcrawlers, we copy the content of the private property.
| } | ||
| if ($useHtml5Parser ??class_exists(HTML5::class)) { | ||
| $this->html5Parser =newHTML5(['disable_html_ns' =>true]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
When creating a child crawler, you should not rely on guessing but pass the existing value used for the parsing (or even better, assign the actual parser instead of instantiating a new one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
You mean in thecreateSubCrawler method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
yes
stof commentedApr 3, 2019
Using a constructor argument has a big drawback (but the previous implementation using a setter that must be called before loading the content has the same drawback): most people in Symfony don't instantiate a Crawler themselves. They use BrowserKit which manages this instantiation. This means they don't have direct access to anything happening before adding content. |
…y (tgalopin)This PR was merged into the master branch.Discussion----------[DomCrawler][WIP] Add note about the HTML5 parser libraryDocumentation for the PRsymfony/symfony#29306.Commits-------6e2f04a [DomCrawler] Add note about the HTML5 parser library
…on (tgalopin)This PR was merged into the 4.3-dev branch.Discussion----------[DomCrawler] Improve Crawler HTML5 parser need detection| Q | A| ------------- | ---| Branch? | master| Bug fix? | kind of| New feature? | no| BC breaks? | no| Deprecations? | no>| Tests pass? | yes| Fixed tickets | -| License | MIT| Doc PR | -Live from #eu-fossaFollow up of#29306This PR introduces a better detection mechanism to choose when to parse using the HTML5 parser or not, and fix a subcrawler parsing issue as well.@stof I'd be super interested by your review :) !Commits-------9bbdab6 [DomCrawler] Improve Crawler HTML5 parser need detection
| "masterminds/html5":"<2.6" | ||
| }, | ||
| "suggest": { | ||
| "symfony/css-selector":"" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Shouldn't there be an entry here, that describes that you can loadmasterminds/html5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
indeed, that would make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I'm against adding things undersuggest, nobody reads them anyway. I would even go as far as removing the existing entries :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I always read them!
Uh oh!
There was an error while loading.Please reload this page.
This PR introduces the possibility to parse HTML content in the Crawler using the html5-php library (https://github.com/Masterminds/html5-php). This allows for better support of HTML5 and fix many unexpected behaviors and inconsistencies of the native DOM extension.