Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[DomCrawler] Optionally use html5-php to parse HTML#29306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
fabpot merged 1 commit intosymfony:masterfromtgalopin:html5-parser
Apr 3, 2019

Conversation

@tgalopin
Copy link
Contributor

@tgalopintgalopin commentedNov 24, 2018
edited
Loading

QA
Branch?master
Bug fix?no
New feature?yes
BC breaks?no
Deprecations?no
Tests pass?WIP
Fixed tickets#29280,#28596
LicenseMIT
Doc PRsymfony/symfony-docs#10700

This PR introduces the possibility to parse HTML content in the Crawler using the html5-php library (https://github.com/Masterminds/html5-php). This allows for better support of HTML5 and fix many unexpected behaviors and inconsistencies of the native DOM extension.

@nicolas-grekasnicolas-grekas added this to thenext milestoneNov 24, 2018
@stof
Copy link
Member

As the native implementation usesvalidateOnParse, I think your alternative implementation needs to check$html5->hasErrors() and throw based on$html5->getErrors() too. Otherwise, parse errors might go unnoticed.

@fabpot
Copy link
Member

@tgalopin What's the status of this PR?

@tgalopin
Copy link
ContributorAuthor

Waiting forMasterminds/html5-php#163 to be merged to pass tests here.

@fabpot
Copy link
Member

@tgalopin Upstream PR merged :)

@stof
Copy link
Member

Due toMasterminds/html5-php#139, shouldn't we use thesaveHTML of the HTML5 library instead of the native one when use use the HTML5 parser to parse the DOM (meaning we need to also remember whether the DOM was created by the HTML5 parser)

@fabpot
Copy link
Member

@tgalopin friendly ping

@tgalopintgalopinforce-pushed thehtml5-parser branch 3 times, most recently frome21e17a to14a454dCompareMarch 31, 2019 10:15
@tgalopintgalopin changed the title[DomCrawler][WIP] Optionally use html5-php to parse HTML[DomCrawler] Optionally use html5-php to parse HTMLMar 31, 2019
@tgalopin
Copy link
ContributorAuthor

Tests are failing for an unrelated reason. I think this is ready to review.

@tgalopin
Copy link
ContributorAuthor

Updated

@tgalopintgalopinforce-pushed thehtml5-parser branch 2 times, most recently from3e61e24 toe0ca69aCompareApril 3, 2019 12:56
}

/**
* Convert charset to HTML-entities to ensure valid parsing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Converts

@fabpot
Copy link
Member

Thank you@tgalopin.

@fabpotfabpot merged commit4050ec4 intosymfony:masterApr 3, 2019
fabpot added a commit that referenced this pull requestApr 3, 2019
…galopin)This PR was squashed before being merged into the 4.3-dev branch (closes#29306).Discussion----------[DomCrawler] Optionally use html5-php to parse HTML| Q             | A| ------------- | ---| Branch?       | master| Bug fix?      | no| New feature?  | yes| BC breaks?    | no| Deprecations? | no| Tests pass?   | WIP| Fixed tickets |#29280,#28596| License       | MIT| Doc PR        |symfony/symfony-docs#10700This PR introduces the possibility to parse HTML content in the Crawler using the html5-php library (https://github.com/Masterminds/html5-php). This allows for better support of HTML5 and fix many unexpected behaviors and inconsistencies of the native DOM extension.Commits-------4050ec4 [DomCrawler] Optionally use html5-php to parse HTML
@tgalopintgalopin deleted the html5-parser branchApril 3, 2019 13:23
"masterminds/html5":"^2.6"
},
"conflict": {
"masterminds/html5":"<2.6"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We should also conflict with> 3 then

thrownew \InvalidArgumentException('The current node list is empty.');
}

if (null !==$this->html5Parser) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There is an issue here. You instantiate the HTML5 parser in the constructor even when the content added is not HTML5 but XML or existing DOM elements (coming from elsewhere than a parent crawler using HTML5). This means you might be saving with the HTML5 parser when it was not used for parsing.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

How do you propose to improve this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

well, we need to distinguish 3 cases:

  • we are parsing some HTML5
  • we are parsing some older HTML
  • we are not parsing HTML at all

The boolean argument in the constructor allows us to decide between the first 2 cases at the time we instantiate. But knowing whether this is HTML or no is not something the controller knows (as it can be done later).

The solution might be to store the boolean property. Then, based on that, we would decide which parsing strategy to useif we load HTML and instantiate the HTML5 parser if needed.
Then, here, we can keep saying "if I used an HTML5 parser, I also use it for saving".

And for subcrawlers, we copy the content of the private property.

}

if ($useHtml5Parser ??class_exists(HTML5::class)) {
$this->html5Parser =newHTML5(['disable_html_ns' =>true]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

When creating a child crawler, you should not rely on guessing but pass the existing value used for the parsing (or even better, assign the actual parser instead of instantiating a new one).

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You mean in thecreateSubCrawler method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

yes

@stof
Copy link
Member

stof commentedApr 3, 2019

Using a constructor argument has a big drawback (but the previous implementation using a setter that must be called before loading the content has the same drawback): most people in Symfony don't instantiate a Crawler themselves. They use BrowserKit which manages this instantiation. This means they don't have direct access to anything happening before adding content.

javiereguiluz added a commit to symfony/symfony-docs that referenced this pull requestApr 5, 2019
…y (tgalopin)This PR was merged into the master branch.Discussion----------[DomCrawler][WIP] Add note about the HTML5 parser libraryDocumentation for the PRsymfony/symfony#29306.Commits-------6e2f04a [DomCrawler] Add note about the HTML5 parser library
fabpot added a commit that referenced this pull requestApr 6, 2019
…on (tgalopin)This PR was merged into the 4.3-dev branch.Discussion----------[DomCrawler] Improve Crawler HTML5 parser need detection| Q             | A| ------------- | ---| Branch?       | master| Bug fix?      | kind of| New feature?  | no| BC breaks?    | no| Deprecations? | no>| Tests pass?   | yes| Fixed tickets | -| License       | MIT| Doc PR        | -Live from #eu-fossaFollow up of#29306This PR introduces a better detection mechanism to choose when to parse using the HTML5 parser or not, and fix a subcrawler parsing issue as well.@stof I'd be super interested by your review :) !Commits-------9bbdab6 [DomCrawler] Improve Crawler HTML5 parser need detection
"masterminds/html5":"<2.6"
},
"suggest": {
"symfony/css-selector":""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Shouldn't there be an entry here, that describes that you can loadmasterminds/html5?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

indeed, that would make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm against adding things undersuggest, nobody reads them anyway. I would even go as far as removing the existing entries :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I always read them!

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@fabpotfabpotfabpot approved these changes

@stofstofstof left review comments

@nicolas-grekasnicolas-grekasnicolas-grekas approved these changes

+3 more reviewers

@phoenixgaophoenixgaophoenixgao left review comments

@Shine-nekoShine-nekoShine-neko left review comments

@apfelboxapfelboxapfelbox left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Projects

None yet

Milestone

4.3

Development

Successfully merging this pull request may close these issues.

9 participants

@tgalopin@stof@fabpot@nicolas-grekas@phoenixgao@Shine-neko@apfelbox@javiereguiluz@carsonbot

[8]ページ先頭

©2009-2025 Movatter.jp