insource:/regexp/ prefix:Template:Regex
| This template usesLua: |
This template helps field the details in the wikitext of any page on the wiki. Normally searches ignore non-alphanumeric characters, but regular expressions (regex) accept all characters, plus metacharacters.
This template acts as a doorway by helping to develop a database query before running it on the wiki, and it does this by way of asearch link that can also be used to share such discoveries. This template can also be used to learn the regular expression syntax ofthis version ofCirrus Search. You could use a bare{{search link}} to do all this, but this template saves a lot of typing (see below), so you only need to focus on entering a regexp.
An important alternative to using this template is performing asearch directly withinsource:"quotes-delimited arguments". These find wikitext without resorting to theregex searches this template does withinsource:/slash-delimited arguments/, (which is a common syntax for regex searches). See§ About CirrusSearch below for a better understanding of when this template is not needed. See below for other search tools.
Regular expressions are little computer programs, so it is characteristic of regex searches that they must always be tested to achieve their potential precision and thoroughness. But only a few of these intensive searches are technically able to run at a time against the database. This template minimizes your footprint, and guarantees that you will never run an untested regexp on every namespace in the wiki, even ifyour default search would let you do that. Use of this template enables the smallest possible footprint by using filters to limit the search domain. The first domain it targets is its own page in anad hoc sandbox. Once your regexp pattern is honed, you add a search domain, by setting|prefix=.
|pattern= or{{{1}}} | a regexp search pattern. Pattern is also the first positional parameter. |
|prefix= or{{{2}}} | search domain. Prefix accepts anamespace number, orn for the current namespace, or: for mainspace, plus it hasthe usualprefix: meaning. Defaults to its current page (fullpagename) if a pattern is given alone. |
|label= or{{{3}}} | search link label. Label is also a positional parameter. |
Decide whether you really need a thoroughly precise regexp search, or whether you can find the general wikitext of interest with a plaininsource: filter. Examples of the plaininsource: search are in§ Parameters hastemplate and insource. In those cases,{{search link}} is sufficient, and sandboxing is not being suggested.
Namespace plus pagename equals fullpagename.
The procedure here is an iterative, read-evaluate-modify cycle.
|pattern=. Prefix will be added later.|prefix=. Start with a namespace. At the complete query trim results viathe first letter(s) of pagenames tacked onto the namespace's automatically-given colon.Step6 is the core provision of this template.Caveat emptor: if you change thetarget, you'll have to re-save it to the database. If you target it again immediately, you'll want topurge that target. You don't have to ever purgeif you just change|pattern=. Note that you can target any single page usingprefix:.
Regular expressions are little computer programs, so it is characteristic of regex searches that they must be written while studying the target data, and tested to achieve their potential precision and thoroughness. However, only a few of these intensive searches are technically able to run at a time against the database.[1] A sandbox minimizes your footprint, and guarantees that you will never run an untested regexp on every namespace in the wiki, even ifyour default search would let you do that.
Although anormal search targeting the entire wiki will run quickly, aregexp search should target as few pages as possible by using filters in order to run quickly. A filter is part or whole of adatabase query. Filters include:
Order is not important because the search is optimized by the software before it is run.
To target just one page while experimenting with or developing a regex search, target a fullpagename. From the search box use the filterprefix:fullpagename. From the edit box (of any section of the page with the target data), you can always just writeprefix:{{FULLPAGENAME}} and it will "expand" for you to the fullpagename. Although you can edit a history page, technically a "history page" is not a page (in the database), and so {{FULLPAGENAME}}there will point to the database version (not its own rendering). For the same reason, you cannot search for the wikitext on a page that is not already saved (to the database), although you can certainly change the search parameters again and again with no need to savethem.
Fullpagename is namespace:pagename. Knowing this you can adjust your Prefix parameter. Althoughprefix can filter down to one page, it can filter up to a namespace, and it also accepts the beginning letter(s) of set of pagenames if you want to reduce the namespace search domain.
Regex sandboxing uses anad hoc sandbox made by editing any page containing the target data, and using it as a "sandbox" (not editing it to save it). It then develops by using adding a search link that includes insource:/regexp/, with the filterprefix:{{FULLPAGENAME}} alongside.
Use of a sandbox enables the smallest possible footprint by using filters to limit the search domain. Once your regexp pattern is honed, you increase the search domain. A regex search is best run with filters, not alone even if it is a polished rexexp.
Rather than use the search box, where entering an equals sign and a pipe character, and "quotes around phrases" is a straightforward matter, it is still easiest to use a regex-based search-link template —{{regex}} or{{tlusage}} — on the page with sample data, because then you can focus on the target data there and on writing the regexp pattern. It is easier, that is, if you already understand how templates "escape" the pipe character and the equals sign. SeeHelp:Template#Parameters for other important details.
The procedure here is an iterative, read-evaluate-modify cycle. Regex development requires that you study the target data while writing and rewriting its pattern.
Caveat emptor: if you change thetarget for animmediate retesting, you'll have to save andpurge, but not if you just change the regexp.
As anad hoc sandbox, you can show the wikitext of a section like this, (already saved in the database), modify some of the patterns in the regex-search-link template calls on this page, do a Show Preview, and see what matches when you click on the newly formed regex search-link, all quite safely, and without changing a thing in the database.
The template calls that produce "1 ft/s,2 ft2,3 m/s,4 m*s-2,5 ft.s-2,6 °C/J, and7 J/C"appear in the wikitext of this section like this:
Note how the above targets are |numbered|, then click on the links below.
| Query | Search link | Answer |
|---|---|---|
| Q1 Using{{search link}}, does this pageemploy template Val ? | {{sl|hastemplate: Val}} →hastemplate: Val | A. No, because this pagename is in Help not Article space.(Search link default).1300 search results. |
| Q2 Using{{search link}}responsibly, does this page use Val's fmt parameter? | {{sl|insource:/\{[Vv]al\{{!}}[^}]*fmt/ prefix:{{FULLPAGENAME}}}} → | A2.1. Look for 1 and 3 in the search results in bold text. (Adds an appropriate filter.) |
| Using{{regex}} instead... | {{slre|\{[Vv]al\{{!}}[^}]*fmt}} → | A2.2 Less typing than{{search link}}. |
| Using{{template usage}} instead... | {{tlre|Val|pattern=fmt}} → | A2.3 Easiest for templates. |
| Q3. Who usesu=ft ORul=ft? (one-letter differs) | {{regex|ul?=ft}} → | A. Look for 1, 2, and 5 in bold text. |
| Using{{template usage}}... | {{tlre|val|pattern = ul?=ft}} → | Finds same pattern, but onlyinside a Val template. |
| Q4. AND of these, who also uses fmt=commas after that? | {{slre|ul?=ft.*commas}} → | A. No context shown, but article title is shown. A half a Bug? |
| Who has one space before the word "commas"? | {{slre|. commas}} →insource:/. commas/ prefix:Template:Regex | A. 1 but not 2. |
| Q5. Who uses either u or ul with "ft" OR uses "fmt=commas". | {{slre|(ul? *= *ft{{!}}fmt *= *commas)}} | A. 1, 2, 3, and 5. (The pattern matches all possible spacing.) |
Q6. Who usesft orm, in|u= or|ul=? | {{slre|ul? *{{=}} *(ft{{!}}m)}} | A. 1, 2, 3, 4, and 5. Used {{!}} for thealternation metacharacter. Used{{=}}. (Could have used named |
| Q7. Who uses . or * in the unit code? | {{tlre|val|pattern = u *= *(\.{{!}}\*)/}} | A. 4 and 5. |
| Who uses a pipe? | {{regex|\|}} →insource:/\/ prefix:Template:Regex | All of them |
Q8. Who uses / or -within the|u= or|ul= parameter? | {{tlre|val|ul? *= *[^{{!}}}]+(\/{{!}}-)}} | A. 1,3,4,5,6 and 7. |
| Q9. Where is Val used in the template namespace for numbers only, (nou,ul,up, orupl parameters). | {{tlre|val|pattern = ~(u[lp].)|prefix = 10}}→hastemplate:"val" insource:/\{\{ *[Vv]al *\|[^}]*~(u[lp].)/ prefix:Template: | A. In the 30 or so templates listed. |
| Q10. Which articles use{{Convert}}'sand(-) option? | {{tlre|convert|pattern=and\(-\)| prefix=0}}→hastemplate:"convert" insource:/\{\{ *[Cc]onvert *\|[^}]*and\(-\)/ prefix:: | ACoast Range Arc andSkipjack shad |
InQ2, notice how the MediaWiki software ignores the spacesaround parameters, but how inQ4 the same MediaWiki software processes the spacesinside parameters. Q2 might have been solved with a plaininsource:val fmt search because "fmt" and "val" are whole words, and fmt is rarely seen apart from inside Val. How abouthastemplate:val insource:fmt?
The search engine can
A search matches what you see rendered on the screen and in a print preview.The raw "source" wikitext is searchable by employing theinsource parameter.For these two kinds of searches a word is any string of consecutive letters and numbersmatching a whole word or phrase.All other keyboard characters like punctuation marks, brackets and slashes, math and other symbols,are not normally searchable.
By default Search will alsostem the words and match them too.It automatically sorts results by the frequency and location of these, but also can boost page ranking by time, template usage, or even similarity to other pages.
Search is asearch engine that does afull text search by querying anindex database.It offers search syntax and parameters exceeding the capabilities and control of other public search engines that could search Wikipedia.
Say the search box is giventwo words.The search starts with two index lookups,and the two results are combined with a logical AND.But before they are displayed as search results,they must all be assigned a final score before the top twenty (listed on the first page) can be displayed,and they must be formatted with snippets and highlighting.Page ranking deals quickly with very large numbers of pages, by approaching things statistically, and taking several swipesthrough the data.
These attributes for a word earn that page a higher score:
There can be several other scoring mechanisms.The parameters that you can control aremorelike,boost-template, andprefer-recent.
There are now eleven parameters for various approachesto searching the many namespaces.Four of the seven new parameters now offer to target thesepage characteristics:hastemplate andlinksto,insource andinsource:/regexp/.The other three now offer to target page ranking:morelike works all alone,aprefer-recent term can be added to any query, and there is now also aboost-template parameter.The other four, preserved in name only,from the entirely rewritten previous version of Search,areintitle,incategory,prefix, andnamespace.
Any search will feature one of these approaches
The concept of asearch domain plays an important part in all this.By default it is just article space, but in generala search domain starts out as a set of namespaces,and ends up as all the pages in the search result.
One term of a query will set the search domain for another term in the same query.The order is optimized by the search engine.The queryterm1 term2 transforms the search domain twice to get those search results. For example, a bare namespace returns the pages of the namespace.The queryterm1 term2regexp relies heavily on the first two terms to reduce the search domain size.
All terms in a query are indexed searches unless they are a regexp.Indexed terms runword-wise instantly, and a regexp runscharacter-wise slowly. Even the most basic use of a regexp, just to find an exact string, should always limit the size of its search domain to as little as possible.This can be as simple as adding a few terms, (as covered below),because each term in a query tends to reduce the number of pages.Never run abare regexp on the wikiespecially if youruser profile is preset toEverything.The search engine limits the number of regexp searches that can run at once. Without the proper filter running alongside a regexp it will run for up totwenty seconds, and then incur an HTML timeout.
On the search results page, theinitial search domainon which the query was run is indicated by the following,given in increasing power to override the others:
For example, if the namespace parameter isall, the size of the initial search domain will be the 65,117,124 pages in all namespaces: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 100, 101, 118, 119, 710, 711, 828, 829Aprefix parameter specifiesjust one of those namespaces, in whole or part.If the initial search domain is the default,Content pages its size is the 7,138,285 pages in namespace 0, (article space).
A search can beset into a link to specialize and share searches:[[Special:Search/search]].Such a query should always be afully specifiedby specifying an initial search domain so as toavoid user profile discrepancies.This way it gives the same results.For example, if more than one namespace is needed, use{{search link}}.[3]
Other helpful approaches to the search engine features are
Greyspacecharacters are the non-alphanumeric characters:~!@#$%^&*()_+{}|[]\:";'<>?,./.Any string of greyspace characters and/orwhitespace characters is "greyspace".
Greyspace is ignored except where it has meaning as a modifier in syntax.
Parameters also accept words and phrases,but each can search their own index and interpret their own arguments, such as for
The delimiters:
Colon : character:
A search is aquery with one or moreterms. The query does not actually search the page database, but rather, a search queries a prebuilt, constantly maintained, search index database. When creating the search index of words on the wiki, or when entering a query, a word boundary is greyspace. Greyspacecharacters can create a multi-word_phrase. We must say tab and newline even though we cannot put those characters in our query; this is because of the important fact that the same analysis that is done on the wikitext is also done on the query. A word boundary is whitespace characters (tab, space, or newline) or greyspace characters. Greyspace characters and whitespace characters are all folded together as one, just as special characters like æ (ae) or á (a) are folded into the standard keyboard characters.
A phrase expresses an ordering of words,[4]and there are three ways to make one, depending on how aggressively you want the phrase to match.
"Quotation marks", phrases are called an "exact phrase" because it is exact wording:stemming,fuzzy search, andwildcards are not used in an "exact phrase". Like the rest of Search, an "exact phrase" tolerates greyspace between words. Joining_with_non-alphanumeric(characters) only, will employ stemming on the words. CamelCaseNaming or letter222number transitions, matches the phrase in greyspace, with stemming, and additionally matches the word itself. Parameters can require the quotation marks to include whitespace in their input.
The wikitext is searched by employing theinsource parameter. The insource parameter ignores greyspace characters too.
For example, to find the phrasehttp://en.wikipedia.org/wiki/Search_engine, usehttp://en.wikipedia.org/wiki/Search_engine, or useinsource: "http en wikipedia org wiki search engine".
When you search for a word, that word is just looked up in anindex. An indexed search instantly concludes with all search resulttitles, without having to search the wiki itself.
Each word you see in a page's content (a title's content) isalready in an index, where it points to all its otherprearranged results. A word isindexed to a list of page names, where it is seen in the text, or it is seen in the title only.
Each indexed word is seen as
For transitions from lower to upper case, (or camelCase), and transitions from letter to number:
for or digit-letter these match singly or together. In other words you don't need the space, but that also works to find either "word" of a camel case or mixed alphanumeric word. You don't need a space, and non-alphanumeric characters are treated as that null space.
We may call these "word" characters or "alphanumeric" characters at timesas opposed to the "non-word" characters, which are ignored except as to function as a word boundary. Usually a word boundary is just a space character.
These words are case-insensitive: a-z is equivalent to A-Z, so Search box will navigate to a pagename regardless of capitalization(even though wikilinks and URLs must match capitalization apart from the initial character).
Each word is aliased to all its word-stems, so cloud, clouding, clouds, clouded, cloudy will all point to the same index entry.
In Search the characters!@#$%^&*()_+-={}|[]\:;'<>,.?/ are ignored. Any mix of whitespace characters and these non-word characters, we may refer to asgrey-space. Grey-space, then, is all non-word characters except the double quote character, which is not ignored.
Grey-space is a string of one or more characters such as brackets and math symbols and punctuation and space. Now, a search-indexed word will be found betweengrey-space, and grey-space is an implied AND of two words in a search query, but the AND is not always implied: when twophrase exist side-by-side the AND is required.
Exceptions to what "words" are indexed are theseportioned words:
The word boundary between such numeric portions and an alphabetic portions may include grey-space or not, but a phrase search turns offportioning, because it is an "exact phrase search", the words in the phrase matching only alphanumeric words delimited by grey-space.
Words joined only by non-alphanumerics are treated like a phrase. So word1_word2&word3 is the same as "word1 word2 word 3". However they will also match camelCase and letter-number transitions. An exact phrase search will not match camelCase or letter-number transitions. For example, terms like wgCanonicalNamespace and !wgCanonicalSpecialPageName can be found looking forcanonical page name.
For example:
The following match the single term txt2regEx on a page:txt,2,regex,reg,ex,txt2,2reg,2regex.None of those portions would match in a phrase search; only "txt2regex" would match.[5]
The following match the two terms 2 + 2:2 or"2",2 2 or"2 2","2 2" or"2","2+2" or2+2,"2-2" or2-2,"2.2" or2.2Each term is a query, and the grey-space is an AND.
Stemming is a way to match meaning "ambitiously", to get the numbers up, for possible semantic matching, such thatrun_shoe also matchesrunning shoes. Stemming is a spelling algorithm only distantly reliant on any dictionary.[6] The algorithm attempts to find thesame word, but in all its word endings.
A fuzzy search will match adifferent word. Words (but not phrases) acceptapproximate string matching or "fuzzy search". A tilde~ character is appended for this "sounds like" search. The other word must differ by no more thantwo letters.
But it can differ byone letter in these ways. A fuzzy search matches the word exactlyplus words like it.
With wildcards you can specifywhich letters change, including the first two letters, and you can increasethe number of letters that can change. Wildcards have their own rules:
While the word indexes are being built and updated, stemming automatically adds aliases to most entries. An actual dictionary is not used. Instead it runs an algorithm that applies generic English syntax rules for word endings. The results are imperfect.[7] Even misspelled words, non-words, and words with numbers in them are indexed and stemmed in this way. By adding different forms of the same word to the indexed search query,stemming is a standard method search engines use to aggressively garner more search results to then run a bunch of page-ranking rules against.
For example, stemming will aliascloud,clouds,clouded, andclouding. It willnot alias the wordcloudy, but itwill alias the various forms ofcloud to thenon-wordcloudion, because -ion is a common word ending.
Stemming is automatically turned off for insource searches:
To turn stemmingoff put the word in quotation marks, this is an "exact phrase" search.[8]
For example: gameFolks, game!folks, game:folks matchesFolksSoul
An"Exact phrase" or aword will match in a title.And creating a phrase"with tilde"~ just turns on stemming, (which is equivalent to forming a phrase by joining the wordswith_greyspace). But"exact phrase"~1 matches the wording in that orderplus allows any one extra word to fall between the two words.
For example
"hitch4 hiker2" finds the two "words" in that order, (possibly separated by punctuation or brackets or other keyboard symbols like math symbols), and without the quotes finds them in the same article. In both cases the article is listed when the space satisfies the logical AND meaning.
hello_dolly does the same thing as"hello dolly" does, but the double quotes version offers a proximity filter. After the closing quote you add a tilde ~ and a number that indicates the total number of words allowed between all the terms.
Backward proximity works too, but includes the two end words between each segment. Proximity cannot make the last word proximate to the first. The proximity can be a large number, like 500 or 1000.
Say a page has word1 word2 word3 in that order.[9]
Two search terms withno quotes is two filters, and a bunch of page-rankingrules.
Truth logic is AND, OR, andnot.
Logical OR increases results, whereas logical AND decreases them.Logicalnot is a good way to refine a query by removingany kind of term except theprefix parameter.
For examplewhile -refining -unwanted search results. For examplecredit card -"credit card" finds all articles with "card" and "credit"
Prefix and namespace are the only positional parameters, and namespace is an unnamed search parameter.One or the other of them is used in a query to override the initial search domainset by user profile or by the search bar.They aren't used together: prefix overrides namespace.
The namespace argument must be at the beginning of a query, andtheprefix: parameter must be at the end of a query.
Namespace: is an unnamed search parameter that goes at the beginning of a query.[10]Thenamespace is followed by a colon, followed by zero or more whitespace characters.and matches anamespace name.The namespace names and "all" work as expected,but seeing one in the search box does not guarantee it represent the search results, as explained below.
In addition to the usual namespace names and their aliases
Pages with namespaces outnumber pages without them7 to 1.
On the search bar at the search results page
These differ from namespace "all" by matching your search termsinside apdf on ahelp:file page,that item on the search results page says "(matches file content)".
For examplefile:"885.7 seconds" matches inside a pdf,butall:"885.7 seconds" does not.
prefix:namespace:string filters a namespace down to one or more pages wherestring matches the pagename's beginning characters.[13]For example,prefix:help:t finds Help pagenames that begin with "T".
Prefix can perform the function of the namespace filter, plus it can isolate a single article whereasintitle cannot.Prefix cannot isolate a single page if it has subpages.
An alternative to a prefix query isSpecial:PrefixIndex:
Comparing the namespace andprefix parameters:
The following methods set an initial search domain by namespace:
These are in the order of precedence. A prefix overrides a namespace overrides the GUI.The argument to theprefix parameter is a fullpagename,which conveys a namespace.
When alternating search domains, with the various techniques,and because of their priorities,it deserves repeating: check the search bar indication; it is most subtle.[14] The Advanced namespace selection pane from the search bar is not so subtle.It will remain for as long as the earlier selection"remember selection for future searches" is in effect.You can "remember" article space and then either1) press Content,2) choose another search bar search domain, or3) remove all instances of&profile=advanced from the URL.
These five search parameters filter a namespace according to an input word or phrase.
These parameter names must be in all-lowercase letters.
Intitlefinds a word or phrase in a pagename.Like a word or phrase searchstemming andfuzzy searches can apply.
To find a match in a redirect title, or to apply a proximity search to a titleyou can rely on page ranking software to boost title matches before content matches.So a basic word or phrase search, or proximity search, is an alternative tointitle.
For example
Incategory has the general format
and selects from thepages section of givencategory pages,those pages that are also in the search domain.
Because many pagesoutside the mainspace are also categorized,the counts often won't match the category unless the search domain is the entire wiki:
Multi-category input counts a page only once. The following two categories have 209 pages in article space, with six pages found in both categories:
On the other hand these are disparate categories:
Because of the nature ofWikipedia:categorization these categories share no pages:
Categories and Search are synergistic.
In the following examples, note how the page description in the category namespace show category sizes instead of page sizes.
Hastemplate finds pages that transclude a given template.Finds templateusage, not just a name pattern, becauseit will find all pages where the template content itself was used in any way.The results differ slightly depending on the alias you give.
Hastemplate
If you don't find the searched template name on the wikitext of the page, it can mean either that you gave the canonical pagename but it found an alias,or that it was called as a secondary template by way of a template thatis shown in the wikitext.To find visible (primary) calls only, useinsource.
Insource:term finds a word or phrase in wikitext.
Unlike a normal search insource doesn't find things "sourced" by atransclusion.
Insource targets wikitext in two ways. They look similar, but the regexp form employs the slash / character to delimit the regexp.[16]
A basic regexp is an easy way to find a specific,/"exact strings"/, as shown below.The double quotes are field delimiters. They areescape characters which quote all the set of characters between them, and keep their interpretation literal (keep any metacharacter interpretation from occurring).
An advanced regexp uses the metacharacters to program general string patterns.It finds everything, even pieces and parts of words, conveying no notion of "words", but only that of a string of characters in a sequence. Metacharacters are interpreted unless quoted by a backslash, double quotes, or square brackets. See the section on regex. The obvious example is, you must quote any slash in your pattern so it won't be interpreted as the closing slash delimiter, using\/ instead of/ to match a literal slash. A regexp interprets all metacharacters.Testing a regexp pattern responsibly,requires limiting the search domain
Abusing regexp will not harm Wikipedia performance, but it limits regex search information from flowing elsewhere.
Only regex interpret greyspace characters. The regular insource, as everywhere else, ignores greyspace characters. Soinsource:"M S" matchesm/s, as doinsource:"M-S" andinsource:"m=s". Butinsource:/M\/S/ will match it, and the filtered version will too:insource:"M/S" insource:/M\/S/.Theinsource:"word1 word2" filter is the most obvious filter forinsource:/word1 word2/, where the two wikitext words are only separated by punctuation and space. Say the target string is{{Val|9999|ul=m/s|fmt=commas}}:
insource:"val 9999 ul m s fmt commas" → matchhastemplate:valinsource:"9999 ul" → matchhastemplate:valinsource:"999" → no matchhastemplate:valinsource:"fmt commas" → matchhastemplate:valinsource:"ul m" → matchhastemplate:valinsource:"ul M S" → matchhastemplate:valinsource:fmt → matchInsource matches words sequentially, but the match could occur anywhere on the page,not necessarilyinside the {{template markup}}. For this there is{{template usage}}, and it matches any regex inside the template.
For thorough precision, use /regex/. For example, to findany bare URL inside<ref name=name>...</ref>, with[external link bracketslabel], with possibleref name=nameyou than can't use the simplerinsource:"ref httpserver com".Taking a cautious approach, before trying the full regexp, create a search domain under 10,000 pages.Starting with two filters, prefix and insource:
insource: "ref http" prefix:A98000 is too many to start.insource: "ref http" prefix:AA1000 is good.insource:/\<ref[^>]\> *\[?https?:\/\/[^][<> "]+\]? */ zero for prefix:AA, one for prefix:ABinsource:/\<ref[^>]\> instead, and then try prefix:AA zero; try AB, one.[^>]*.insource: "ref http" insource:/\<ref[^>]*\> prefix:AB. There are3700, and that is OK.We have the only possible filterinsource: ref prefix:AA.That filter produces a regex search domain of only 2300.The filterinsource: ref prefix:A produces a search domain of264000.Running the regex on that many pages is possible, and produces64000 results.
To find a more targeted URL, say yahoo.brand.edgar.com, useinsource: "http yahoo brand edgar com" (or cut and paste the entire URL, slashes dots, and all; it doesn't matter). Do another search with the https version. These searches capable of more flexibility thanSpecial:LinkSearch. No filter is needed, but every search always benefits from extra information: any word, any phrase, and most parameters.
LinkstoReports wikilinks to a page name.
Linksto reports wikilinks to a page name, even if the wikilink is
Linksto can differ from the "What links here" tool, because the search domain for "What links here" isall.Linksto search results are in your default search domain.(Alsolinksto reports the count, as do all searches.)
In addition to wikitext it searches inside a pages transcluded content.
first, and then scan the contents.[17]For example
will report a list of 300 articles that link to it, as will "What links here".ButMozart and scatology is actually linked only 15 times bycontentauthors.The rest are due toMozart and scatology inTemplate:Wolfgang Amadeus Mozart on the unwanted pages. The template is wanted, but the "links to" reference is probably not.[18]
The trick to getting around this, and just finding allauthorship links to an article is a regexp search:
That search will findarticles only because the initial: limits theinitial search domain to article space, no matter how your default search domain happens to be set.It will find all of the links many times more quickly than a bare regexp would,because the firstinsource term instantly creates therefined search domainthat sets the proper limits for the regexp search.A regexp can accommodate for the variations found in the wikitext allowed by the permissions of wikilinks: 1) the metacharacter* allows for "zero or more" space characters before and after the title, and2) the [character class] at the beginning allows for the relaxed capitalization of the first character in any pagename, and3) the character class at the end finds the link whether it is labeled via the pipe character | or closed via the square bracket ] of the wikilink.
Links to transclusions are handled byhastemplate.
A page's overall score determines its place in the search results.
Abetter match will raise the score.
Wikiproject "importance" andarticle quality assessments can factor in.Searching from a page, its categories, wikidata, and geo-location can factor in.
Knowing this you may be able to better find, for example, a half-remembered title.Usingintitle may skew the results too much because of the order of the words.Use those in a word search, and depend on page ranking.The titular words will show up on top.
To get an idea of how CirrusSearch might work seemw:Search/Old#Search_Weighting_Ideas.
Tosort search results by date, useprefer-recent.Tosort search results by template usage, useboost-template.
Themorelike search parameterlists all articles that compare in word frequency and word length to one or more given articles.
Morelike calculates a multi-word search.
See them highlighted in the snippet.
Morelike looks up the given pagename(s) in the search index,creates a word-frequency aggregate and a word-length aggregatefromall the words, and calculates a multi-word search based on those,plus internal, variable settings.It is an expensive search.
For example, say you search for
then pick a name from that list and add it
then add more names, until you have five input pagenames.Then you could begin blindly adjusting this automatically calculatedmorelike query, saying the following sorts of things: Make the calculated query
Then, say, you adjust the number of input pagenames that have a word totwo (out of five).https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&search=morelike:ant%7Cbee%7Cwasp%7CEusociality%7Ctermite&fulltext=Search&cirrusMtlUseFields=yes&cirrusMltFields=opening_text&limit=1150
It can also find similar articles based on just the title, or on just the headings, or on just the lead section.
The search results depend on internal (Mlt, More like this) variables,settable via the URL,concerning which words to search with:
| &cirrusMltMinDocFreq | How many articles with a search word, minimally |
| &cirrusMltMaxDocFreq | How many articles with a chosen word, maximumally |
| &cirrusMltMaxQueryTerms | number of search words, maximum |
| &cirrusMltMinTermFreq | Minimum word frequency of a chosen word. |
| &cirrusMltMinWordLength | Minimal length of a term to be considered. Defaults to 0. |
| &cirrusMltMaxWordLength | The maximum word length above which words will be ignored. Defaults to unbounded (0). |
| &cirrusMltFields | A comma separated list of the fields to use. Allowed fields are title, text, auxiliary_text, opening_text, headings and all. |
| &cirrusMltUseFields (true or false) | use only the field data. Defaults to false: the system will extract the content of the text field to build the query. |
| &cirrusMltPercentTermsToMatch | The percentage of terms to match on. Defaults to 0.3 (30 percent). |
For example here is what the address bar (turned search bar) looks like for amorelike search for lead sections of two articles, as compared to other lead sections:https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&search=morelike:William+H.+Stewart%7CLeroy+Edgar+Burney&fulltext=Search&cirrusMtlUseFields=yes&cirrusMltFields=opening_textNotice the end containing the two added URL parameters that activated amorelike capability.
You can sort search results by date.
It goes anywhere in the query.It defaults to 160 days as "recent", and applies its boost formula 60% of the score.The formula is not the usual multiplier,it is an exponential multiplier, potentially much more powerful.This enables it to work where the default for "recent", instead of being 160 days, is can be as little as 9 seconds.If your "recent" means 9 seconds, useprefer-recent:0.0001
For example, if you're only interested in the relatively few articlesthat have changed in the last week, use 7 instead.How this works is thatall articles older than seven days are only boosted half as much, and all articles older than 14 days are boosted half as much again,and so on.
The boost is more than the usual multiplier, it is exponential.The factor used in the exponent is the time since the last edit.The bigger the time since the last edit, the less the boost. The formula ise−t,where t is either the interval in days or interval of interest.
Addprefer-recent to the beginning of a search.It will give the more recently edited articles a boost in the search results.The general form is
This parameter accepts two, comma-separated arguments to allowing for adjusting the default settings.By default this will scale 60% of the scoreexponentially with the time since the last edit, with a half life of 160 days.So the default isprefer-recent:0.6,160.
This can be changed to increase the weight:
or decrease it:
The proportion_of_score_to_scale must be a number between 0 and 1 inclusive.The half_life_in_days must be greater than 0 but allows decimal points, and so works pretty well to sort close edit times if very small.
For exampleprefer-recent:0.6,0.0001 operates with a half-life of 8.64 seconds
This will eventually be on by default for Wikinews.
Boost-templates:" " adds weight to pages with the given template or templates (plural). Using this search parameter overrides the normal template-boosting function of Search.Don't use this search parameter without supplying the weight-boosting argumentunless you mean to disable the template weighting function for the search.
The general format is
You see, normally thesystem message[19]titledMediaWiki:cirrussearch-boost-templates boosts the score of the following fullpagenames: Template:Featured article|200%Template:Featured picture|200%Template:Featured sound|200%Template:Featured list|175%Template:Good article|150%Template:Sockpuppet category|5%Template:Maintenance category|5%Template:Hidden category|5%Template:Tracking category|5%Template:Category class|5%Template:Category importance|5%Template:CatTrack|5%Template:Template category|5%. These are the actual template names and there actual boost. These are replaced during theboost-templates usage.
For example a search for "phenom" AND "lecture", with the templatesSearch link andregexp having the weighting score of the pages they are on multiplied by 1.5 and 2.25 respectively,ignoring all other templates (halting the addition of any score for any other template):
Boost-templtes differs fromhastemplate in
If you just want your search results to include only pages with certain templates, usehastemplate one or more times instead, tofilter out pages that don't.Otherwise, choose a multiplier similar to the system message shown above.Multiplying a page score by 10 is done with 1000%, and will probably mask all other weighting functions,such as "when the search words match in the title",will have little effect in the presentation of search results, and is not recommendedbecause it affects the order of the entire list.
Either hastemplate or boost-templates one can go anywhere in the query, each having other terms on either side of it. is a term in a query that can go anywhere in the query, having other terms on either side of it.
Relevantissues in CirrusSearch:
cm2,m3 does not findm3, where the superscript are unicode characters.Workarounds
Troubleshooting
Allpages on Wikipedia are scanned and indexed by Wikipedia's ownsearch engine. The entire wiki is treated as one "full text" kept in a separate database (an "index") built just for searching. It's like the index in a book, but practically every word and every number isindexed to every page.[20]
Since each word in the prebuilt search index already points to the pages that contain it, a keyword search usually corresponds to a single record lookup in the index. (This is also true for phrases, to a certain extent.) "Index searches" take basically no time to execute. They are cheap and plentiful.
There are separate indexes kept updated for:
Any text transcluded from a template is indexed as if it were really present on itstarget page. (In other words, by default, a keyword search is done on the text of therendered Wikipedia page, not on the page source itself. However, you can change this by usinginsource:keyword to search the source markup instead of the rendered page.)
Preparing and maintaining the search indexes is done by Wikipedia's servers, in the background, in near real time. As soon as you save the page, a few seconds later you can search for the changes you just made. For templates that are transcluded onto many many pages, the propagation of those changes to all the pages in the index might take a while.
The index is based on alphanumeric characters; it stores no information on non-alphanumeric characters. If you type any punctuation or brackets into the search box when doing an indexed search, those characters will be silently discarded.
A basic indexed search
Instead of doing a basic indexed search on keywords, you can perform aregex search, which bypasses the index. A regex search scans the text of each page on Wikipedia in real time, character by character, to find pages that match a specific sequence or pattern of characters. Unlike keyword searching, regex searching is by default case-sensitive, does not ignore punctuation, and operates directly on the page source (MediaWiki markup) rather than on the rendered contents of the page.
To perform a regex search, use the ordinary search box with the syntaxinsource:/regex/ orintitle:/regex/. The expressionregex denotes aregular expression in MediaWiki-flavored regular expression syntax.
Because regex searching scans each page character by character, it is generally much slower than an index search. You can—and should—add additional search terms when usinginsource:/regex/ to reduce the amount of text being processed. For example:
polish insource:/polish/ finds pages that match a case-insensitivestemmed keyword search for "polish" (including "polished" or "polishing"); then does a case-sensitive regex search within those pages. Only pages that match both filters are returned.insource:polish insource:/polish/ is similar, but starts with a case-insensitive search of the source markup instead of the rendered page (so it will find usages likePoles, and not find transclusions).intitle:,incategory:, andlinksto: are excellent filters.[clarification needed]hastemplate: is a good filter.[clarification needed]Adding an index-based search term to reduce the amount of text being scanned is important simply to make your own regex search finish in a reasonable amount of time. Regex searches that take too long will "time out" and return only partial results. Overuse of slow regex searches might cause temporary throttling of the feature for yourself and/or everyone on Wikipedia. (However, you cannot affectthe site performance of Wikipedia as a whole simply by abusing regex search.) Remember that a single regex search can take multiple seconds, and there are currently 51,551,171 registered users on Wikipedia. Use regex search responsibly.
MediaWiki'sregular expression syntax works like this:
insource:/C-3p0/ will search for pages containing the literal string "C-3p0" (case-sensitive).. + * ? | { [ ] ( ) " \ # @ < ~. Any metacharacter can beescaped by preceding it with a backslash\. Preceding any other character with a backslash is harmless. For example,insource:/yes\.\no/ will search for pages containing the literal string "yes.no" (case-sensitive). Regex experts should note that\n does not mean "newline",\d does not mean "digit", and so on: In MediaWiki syntax, theonly use of\ is to escape metacharacters./ is special because it indicates the end of the regex. For example,insource:/yes/no/ is treated the same asinsource:/yes/ no (because the keyword search forno/ ignores punctuation). The/ character must be backslash-escaped everywhere it appears inside a regex – even inside square brackets or quotation marks.. matches any single character. For example,insource:/yes.no/ is matched byyes/no,yes no,yesuno, etc.( ) group a sequence of characters into an atomic unit.| goes between two sequences and matches either of them. For example,insource:/a(g|ch)e/ matches eitherage orache.+ matches the preceding character or group one or more times. For example,insource:/ab+(cd)+/ is matched byabcd,abbbcd,abbcdcd, etc.insource:/a(g|ch)+e/ matchesagge,achgchchggche, etc.* matches the preceding character or group any number of times (including zero). For example,insource:/ab*(cd)*/ is matched bya,abbb,acdcd, etc.? matches the preceding character or group exactly zero or one times.{ } match the preceding character or group a fixed number of times. For example,insource:/[a-z]{2}/ matches exactly 2 lowercase letters in a row.insource:/[a-z]{2,4}/ matches any string of 2, 3, or 4 lowercase letters.insource:/[a-z]{2,}/ matches any string of 2or more lowercase letters.[ ] introduce acharacter class, which matches a single instance of any of the characters in the class. For example,insource:/[Pp]olish/ matches bothPolish andpolish. Characters inside square brackets generally don't have to be escaped, although escaping them remains harmless, and/ still needs to be escaped everywhere. For example,insource:/[.\/\]\n]/ matches a single instance of.,/,], orn.^ (if it appears first of all) represents negation, and the character- (unless it appears first or last) represents a range. For example,insource:/[A-Za-z0-9_]/ matches any alphanumeric character or underscore, andinsource:/[^A-Za-z]/ matches anynon-alphabetic character.< > stand for numbers treated as numbers, not characters. For example,insource:/AD <476-1453>/ is matched byAD 476,AD 477, ...AD 1452,AD 1453, but notAD 1474. (But it will also match the first six characters ofAD 4760.)~ "looks ahead" and negates the next character or group. For example,insource:/crab~(cake)c/ should match the first five characters ofcrabclaw but not the first five characters ofcrabcake.[clarification needed]There are a few additional quirks of the syntax:
@ is a synonym for.* (match any sequence of characters at all).insource:/0/ fails, althoughinsource:/1/ andinsource:/\0/ both succeed." " are an escape mechanism, like square brackets or the backslash. For example,insource:/".*"/ means the same thing asinsource:/\.\*/.# is also a metacharacter and must be escaped.[clarification needed]For this template, it is necessary to enter the pipe character using \{{!}} to find a literal pipe character in the wikitext.
\n does not mean "newline",\d does not mean "digit", and so on.^ does not mean "beginning of text" and$ does not mean "end of text". Searching from the beginning or end of a Wikipedia page is not generally useful.Although character classes\n,\s,\S are not supported, you may use these workarounds:
| PCRE | MediaWiki | Description |
|---|---|---|
\n | [^ -] | A newline (also atabulation character can be found[1]) |
[^\n] | [ -] | Any characterexcept a newline and tabulation |
\s | [^!-] | A whitespace character: space, newline, or tabulation |
\S | [!-] | Any characterexcept whitespace |
^ To exclude the tabulation character as well,copy it and add it to the character set.
In these ranges, " " (space) is the character immediately following thecontrol characters, "!" is the character immediately following space, and "" is U+10FFFF, the last character inUnicode. Thus, the range from " " to "" includes all characters except for control characters (of which articles may contain newlines and tabulation), while the range from "!" to "" includes all characters except for control characters and space.
# < > [ ] | { }.For this template, we need to replace the pipe character with{{!}} so that the "pipe" for the regexp won't confuse this template (or any other template). We need the parentheses at times because an alternation finds thelongest pattern, and so the parentheses define that boundary, but it's a boundary you don't have to make if an alternation is the entire regexp patter.
Regexp searches are restricted on the server, so this template reduces theregex search footprint by using theprefix: filter every time, restrictingthe search domain to a namespaceat most. Theprefix: parameter canfurther filter a namespace by specifyingpagenames that startwith a given letter(s).
Asearch link stores a query in a link that takes you to live search results for that stored search. They're found on user pages and talk pages. Use one to bring the full feature set of MediaWiki Search, or features of external search engines, to bear on users unfamiliar with their search parameters.
One type of search link is a wikilink with all the capabilities of Search (search box), and with standard wikilink syntax: [[Special:Search/query|label]]. So this search link will (1)navigate: [[Special:search/Wales]] →Special:search/Wales or (2) search: [[Special:search/~Wales | search/~Wales]] →search/~Wales if you prefix a ~ tilde character.
All other search links are made from atemplate that will build a URL instead of wikilink. A URL can for example can call off-site search engines to search Wikipedia.
Search boxes are made by<inputbox> tags. Seemw:Extension:InputBox.
For searches with exact matches, exact in upper and lower cases, or in punctuation marks, seeHelp:Searching § grep.
(term) in parentheses; useful forWikipedia:disambiguation study<inputbox>...</inputbox>