Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork9.7k
[DomCrawler] Cache discovered namespaces#39097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
carsonbot commentedNov 16, 2020
Hey! I see that this is your first PR. That is great! Welcome! Symfony has acontribution guide which I suggest you to read. In short:
Review the GitHub status checks of your pull request and try to solve the reported issues. If some tests are failing, try to see if they are failing because of this change. When two Symfony core team members approve this change, it will be merged and you will become an official Symfony contributor! I am going to sit back now and wait for the reviews. Cheers! Carsonbot |
Uh oh!
There was an error while loading.Please reload this page.
d02e765 tod256b2fCompareUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| $crawler->add($domxpath->query($xpath,$node)); | ||
| } | ||
| $crawler->cachedNamespaces =$this->cachedNamespaces; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Rather than copying the cached namespaced a second time to account for changes done increateDOMXPath, I'd rather ensure that the changes are done before creating the subcrawler.
This can be done by creating a single$domxpath thanks to the invariant that$node->ownerDocument === $this->document in Symfony 4+ (that was not enforced in Symfony 3), which could allow us to create a single$domxpath before thecreateSubCrawler call and use it in the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@simonberger why is this marked as resolved ?
c73a09f toa26870aCompareUh oh!
There was an error while loading.Please reload this page.
a26870a tob4af1baComparesimonberger commentedNov 18, 2020
Is the ticket (still) auto-closed by this PR? If so can I write it differently? Otherwise I just remove it. |
stof commentedNov 18, 2020
@simonberger don't put |
4624d83 to840fe47Comparestof commentedNov 19, 2020
this PR looks good to me. the last potential improvement is the one I suggested in#39067 (comment) to makeall subcrawlers share their namespace cache with the main crawler (instead of copying the parent cache at instantiation time), which helps when the subcrawlers are the one making the discovery. |
simonberger commentedNov 19, 2020
Yeah I tested to use an ArrayIterator instead of an array which seems to solve that shared namespace issue. |
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| $this->uri =$uri; | ||
| $this->baseHref =$baseHref ?:$uri; | ||
| $this->html5Parser =class_exists(HTML5::class) ?newHTML5(['disable_html_ns' =>true]) :null; | ||
| $this->cachedNamespaces =new \ArrayIterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
should beArrayObject rather thanArrayIterator IMO
| $namespace = ($node =$namespaces->item(0)) ?$node->nodeValue :null; | ||
| $this->cachedNamespaces[$prefix] =$namespace; | ||
| $this->cachedNamespaces->offsetSet($prefix,$namespace); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
usingthis->cachedNamespaces[$prefix] = $namespace; is fine. that's whatArrayAccess is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I am thankful for your valuable improvement ideas but please let us not discuss every unimportant detail of my implementation. I know I have an ArrayIterator here in opposite to most public ArrayIterator|array|... usages so I decided to directly take its functions.
e1728a5 tob9fe3baCompareb9fe3ba toc31d7efComparec31d7ef toa8e85ecComparefabpot commentedNov 27, 2020
Rebase on 5.x (we don't introduce new features in 4.4, and caching is a new feature). I've also made some cosmetic changes. |
fabpot commentedNov 27, 2020
Thank you@simonberger. |
simonberger commentedNov 27, 2020
@fabpot Not a really big thing but this changea8e85ec#diff-940b51ffa31dedac17aa49cd8e04e44aa8b3747782c7f9f27457b0510587c05dR1201 does not find the also cached null results of prefixes. One of the reasons I preferred to stick to the method usage everywhere. |
This PR was merged into the 5.3-dev branch.Discussion----------[DomCrawler] Fix null namespace issue in Crawler| Q | A| ------------- | ---| Branch? | 5.x <!-- see below -->| Bug fix? | yes| New feature? | no <!-- please update src/**/CHANGELOG.md files -->| Deprecations? | no <!-- please update UPGRADE-*.md and src/**/CHANGELOG.md files -->| Tickets |Fix#39277 <!-- prefix each issue number with "Fix #", no need to create an issue if none exist, explain below instead -->| License | MIT| Doc PR | <!-- required for new features -->In this case `isset` may cause an issue if null value would be stored in `$this->namespaces` or `$this->cachedNamespaces`, but never fetched, as if condition would return false. This PR is for 5.x, because namespace caching was introduced recently (#39097).Commits-------7b0a183 [DomCrawler] Fix null namespace issue in Crawler
Uh oh!
There was an error while loading.Please reload this page.
Discovering namespaces is by far the most expensive task when filtering for nodes and the xpath contains prefixes.
When
Crawler::filterRelativeXPathis called multiple times with (identical) prefixes the slowdown is huge.This fix brings the runtime of the linked ticket down from 27 seconds to 9 seconds in my test. Compared to a pure PHP version which takes < 0.5 seconds the design of the crawler API is the limiting factor. There are still many repeated namespace queries caused by new Crawler instances. Ideas to solve this are discussed in the ticket.