Hide old docs from search engines via canonical link#24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

JazzTap wants to merge3 commits intomatplotlib:masterfromJazzTap:rel-canonical

Closed

Hide old docs from search engines via canonical link#24

JazzTap wants to merge3 commits intomatplotlib:masterfromJazzTap:rel-canonical

Conversation

Copy link

JazzTap commentedJul 15, 2018

Project initiated with@JLegs to point search engines (and users, gently) at current docs. Dumb approach used: delete version string from url (and put absolute link to matplotlib.org to avoid baseurl shenanigans).

All HTML parsed through lxml by 'tools/docs_deprecator' notebook or script. Only change expected besides whitespace, and property ordering, is 1) a <link> at the bottom of <head> and 2) a <div> at the top of <body>. (The bot-forwarder and human-forwarder respectively.)

Corresponding comment in issue tracker:matplotlib/matplotlib#10016 (comment)

Note that in an ideal world we'd forward pages using a database of pages & their descendants, replacements, whatever. Their automatic computation is compute-heavy, as discussed above.

JazzTap added3 commits

July 14, 2018 22:43

initial munge

3fb62e1

surface 'target' to which old page should forward.

c19392b

Deprecator prefers to munge one site-version at a time.

a0b0973

Copy link

Member

QuLogic commentedJul 15, 2018

There are aton of extraneous changes here; is there a way to get it to only do the two things you mentioned? It's not just whitespace changes that are added.

Copy link

Author

JazzTap commentedJul 15, 2018

That's lxml snapping all the docs to its grammar. But I didn't spot anything else beyond property re-ordering. Are there semantic changes?

As I understand it, all html is (was) machine-generated from source in the first place. But instead of parsing, one could regex carefully for the lines of form</head> <body> as the point of insertion.

Copy link

Member

tacaswell commentedJul 15, 2018

There appears to be a way to get git to not add whitespace only changes (https://stackoverflow.com/questions/3515597/add-only-non-whitespace-changes).

We should hold of on worry about the whitespace for now,@JazzTap and I are at the scipy sprints and agreed in person to focus on using thecleanup.py script to also add these changes to the files at the top level of the domain first.