Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

[DomCrawler] Cache discovered namespaces#39097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
fabpot merged 2 commits intosymfony:5.xfromsimonberger:crawler_cache_namespaces
Nov 27, 2020

Conversation

@simonberger
Copy link
Contributor

@simonbergersimonberger commentedNov 16, 2020
edited by fabpot
Loading

QA
Branch?5.x
Bug fix?no
New feature?yes
Deprecations?no
TicketsAddress#39067
LicenseMIT

Discovering namespaces is by far the most expensive task when filtering for nodes and the xpath contains prefixes.
WhenCrawler::filterRelativeXPath is called multiple times with (identical) prefixes the slowdown is huge.

This fix brings the runtime of the linked ticket down from 27 seconds to 9 seconds in my test. Compared to a pure PHP version which takes < 0.5 seconds the design of the crawler API is the limiting factor. There are still many repeated namespace queries caused by new Crawler instances. Ideas to solve this are discussed in the ticket.

@carsonbot
Copy link

Hey!

I see that this is your first PR. That is great! Welcome!

Symfony has acontribution guide which I suggest you to read.

In short:

  • Always add tests
  • Keep backward compatibility (seehttps://symfony.com/bc).
  • Bug fixes must be submitted against the lowest maintained branch where they apply (seehttps://symfony.com/releases)
  • Features and deprecations must be submitted against the 5.x branch.

Review the GitHub status checks of your pull request and try to solve the reported issues. If some tests are failing, try to see if they are failing because of this change.

When two Symfony core team members approve this change, it will be merged and you will become an official Symfony contributor!
If this PR is merged in a lower version branch, it will be merged up to all maintained branches within a few days.

I am going to sit back now and wait for the reviews.

Cheers!

Carsonbot

@carsonbotcarsonbot changed the titleCache discovered namespaces in Crawler[DomCrawler] Cache discovered namespaces in CrawlerNov 16, 2020
@simonbergersimonberger changed the title[DomCrawler] Cache discovered namespaces in Crawler[DomCrawler] Cache discovered namespacesNov 16, 2020
$crawler->add($domxpath->query($xpath,$node));
}

$crawler->cachedNamespaces =$this->cachedNamespaces;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Rather than copying the cached namespaced a second time to account for changes done increateDOMXPath, I'd rather ensure that the changes are done before creating the subcrawler.
This can be done by creating a single$domxpath thanks to the invariant that$node->ownerDocument === $this->document in Symfony 4+ (that was not enforced in Symfony 3), which could allow us to create a single$domxpath before thecreateSubCrawler call and use it in the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

@simonberger why is this marked as resolved ?

@simonberger
Copy link
ContributorAuthor

Is the ticket (still) auto-closed by this PR? If so can I write it differently? Otherwise I just remove it.

@stof
Copy link
Member

@simonberger don't putfix just before the issue number

@simonbergersimonbergerforce-pushed thecrawler_cache_namespaces branch 2 times, most recently from4624d83 to840fe47CompareNovember 19, 2020 12:36
@stof
Copy link
Member

this PR looks good to me.

the last potential improvement is the one I suggested in#39067 (comment) to makeall subcrawlers share their namespace cache with the main crawler (instead of copying the parent cache at instantiation time), which helps when the subcrawlers are the one making the discovery.

@simonberger
Copy link
ContributorAuthor

Yeah I tested to use an ArrayIterator instead of an array which seems to solve that shared namespace issue.
I'll push a solution like this tomorrow.

$this->uri =$uri;
$this->baseHref =$baseHref ?:$uri;
$this->html5Parser =class_exists(HTML5::class) ?newHTML5(['disable_html_ns' =>true]) :null;
$this->cachedNamespaces =new \ArrayIterator();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

should beArrayObject rather thanArrayIterator IMO

derrabus reacted with thumbs up emoji

$namespace = ($node =$namespaces->item(0)) ?$node->nodeValue :null;
$this->cachedNamespaces[$prefix] =$namespace;
$this->cachedNamespaces->offsetSet($prefix,$namespace);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

usingthis->cachedNamespaces[$prefix] = $namespace; is fine. that's whatArrayAccess is for.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I am thankful for your valuable improvement ideas but please let us not discuss every unimportant detail of my implementation. I know I have an ArrayIterator here in opposite to most public ArrayIterator|array|... usages so I decided to directly take its functions.

apfelbox reacted with confused emoji
@fabpotfabpot changed the base branch from4.4 to5.xNovember 27, 2020 07:00
@fabpotfabpotforce-pushed thecrawler_cache_namespaces branch fromb9fe3ba toc31d7efCompareNovember 27, 2020 07:03
@fabpotfabpotforce-pushed thecrawler_cache_namespaces branch fromc31d7ef toa8e85ecCompareNovember 27, 2020 07:04
@fabpot
Copy link
Member

Rebase on 5.x (we don't introduce new features in 4.4, and caching is a new feature). I've also made some cosmetic changes.

@fabpot
Copy link
Member

Thank you@simonberger.

@simonberger
Copy link
ContributorAuthor

@fabpot Not a really big thing but this changea8e85ec#diff-940b51ffa31dedac17aa49cd8e04e44aa8b3747782c7f9f27457b0510587c05dR1201 does not find the also cached null results of prefixes. One of the reasons I preferred to stick to the method usage everywhere.

@simonbergersimonberger deleted the crawler_cache_namespaces branchNovember 27, 2020 09:31
derrabus added a commit that referenced this pull requestDec 11, 2020
This PR was merged into the 5.3-dev branch.Discussion----------[DomCrawler] Fix null namespace issue in Crawler| Q             | A| ------------- | ---| Branch?       | 5.x <!-- see below -->| Bug fix?      | yes| New feature?  | no <!-- please update src/**/CHANGELOG.md files -->| Deprecations? | no <!-- please update UPGRADE-*.md and src/**/CHANGELOG.md files -->| Tickets       |Fix#39277 <!-- prefix each issue number with "Fix #", no need to create an issue if none exist, explain below instead -->| License       | MIT| Doc PR        |  <!-- required for new features -->In this case `isset` may cause an issue if null value would be stored in `$this->namespaces` or `$this->cachedNamespaces`, but never fetched, as if condition would return false. This PR is for 5.x, because namespace caching was introduced recently (#39097).Commits-------7b0a183 [DomCrawler] Fix null namespace issue in Crawler
@fabpotfabpot mentioned this pull requestApr 18, 2021
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@stofstofstof left review comments

@jderussejderussejderusse left review comments

@derrabusderrabusderrabus left review comments

@fabpotfabpotfabpot approved these changes

@chalasrchalasrAwaiting requested review from chalasr

@dunglasdunglasAwaiting requested review from dunglas

@lyrixxlyrixxAwaiting requested review from lyrixx

@srozesrozeAwaiting requested review from sroze

@wouterjwouterjAwaiting requested review from wouterj

@xabbuhxabbuhAwaiting requested review from xabbuh

@ycerutoycerutoAwaiting requested review from yceruto

Assignees

No one assigned

Projects

None yet

Milestone

4.4

Development

Successfully merging this pull request may close these issues.

7 participants

@simonberger@carsonbot@stof@fabpot@jderusse@derrabus@javiereguiluz

[8]ページ先頭

©2009-2025 Movatter.jp