RELATED APPLICATIONThis application is a continuation of U.S. patent application Ser. No. 13/771,087, filed on Feb. 20, 2013, and entitled “EXECUTING A FAST CRAWL OVER A COMPUTER-EXECUTABLE APPLICATION”, the entirety of which is incorporated herein by reference.
BACKGROUNDAn application is computer software that is designed to perform a particular task. Exemplary conventional applications include web browsers, word processing applications, spreadsheet applications, presentation applications, etc. Recently, due to the increased popularity of smart phones, tablet computing devices, and other portable computing devices, applications have recently been designed for execution on such types of devices, where the applications are designed to be user-friendly and perform relatively simple tasks. Typically, these applications are available for download from an application repository, where a user can search for and select one or more applications, and cause selected applications to be retrieved from the application repository and installed on the computing device of the user.
Many currently available applications are configured to access content from the Internet by way of a network connection and present such content to a user responsive to the user initiating or interacting with the application. For instance, applications have been developed to provide users with restaurant reviews for restaurants that are relatively proximate to the respective users. Accordingly, the application can take the location of the user as an input, access data by way of the Internet that is relevant to the location of the user, and generate a page that includes the data for presentment to the user.
Further, many applications can generate pages for presentment to a user that include data that is not accessible on the Internet. Such data may be generated by the application developer or retained in a network-accessible repository that is not indexed by a search engine. Currently there are hundreds of thousands of applications that are available in application repositories. Generally, pages generated by these applications are not able to be searched by users; instead, a user must install and execute the application to view such pages. To assist users in ascertaining information about the content of applications, developers of the applications can assign textual metadata thereto that can be retrieved when a search is performed. It is to be understood, however, that content generated by the applications during execution is conventionally not searchable.
SUMMARYThe following is a brief summary of subject matter that is described in greater detail herein. This summary is not intended to be limiting as to the scope of the claims.
Described herein are various technologies pertaining to crawling computer-executable applications such that content (text, images, videos, etc.) of pages generated thereby is searchable. In an exemplary embodiment, an application may have content therein that is static. Static content is content that does not change over different executions of the application, such as different temporal executions of the application or execution of the application from different locations. Static content, in an example, can be extracted from the executable file (binary) of the application. Exemplary static content that can be included in an executable file includes strings, uniform resource locators (URLs) from which the application retrieves data, and the like. Other examples of static content include data retained in a resource file that is accessed by the application during runtime. Such a resource file may include text strings, images, URLs from which the application retrieves content, etc.
Additionally, an application can be configured to generate and display pages that include dynamic content. Dynamic content is content that changes over different executions of the application. In an example, the application can generate a page that includes first dynamic content during a first execution of the application, and the application can generate the page such that the page includes second dynamic content during a second execution of the application. For instance, dynamic content can change over time and/or can change based upon location from which the application is executed.
Technologies described herein pertain to crawling applications that present pages to users that include static content and/or dynamic content, wherein the crawling includes selectively writing content in pages generated by the application during runtime to disk. In connection with retrieving dynamic content included in at least one page generated by the application, the application binary is analyzed to discover type and location of user controls that can be interacted with by a user of the application during runtime thereof, where the user controls may include buttons, sliders, pull-down menus, hyperlinks, selectable lists, etc. Furthermore, for example, a navigation script can be learned or provided by a developer of the application, where the navigation script is an algorithmic traversal through pages of the application. For instance, a navigation script can indicate that a first button is to be selected on a first page, resulting in generation and presentment of a second page, and that a pull-down menu is to be accessed on the second page, and that a particular item in the pull-down menu is to be selected, causing the application to generate a third page, and so on.
The application can be loaded in an emulator, and execution of the application is emulated in the emulator using the locations of the user controls and the navigation script. This causes the application to generate numerous pages in accordance with the navigation script, wherein generation of the pages can include retrieving content from the Internet for inclusion in one or more pages. Each page generated by the application when being executed in the emulator can be written to disk, and a searchable index can be generated based upon the pages written to disk. Thus, content retrieved/generated by the application during runtime can be searched over utilizing a suitable search function. As noted above, the content retrieved/generated by the application can be based upon various parameters, such as location from which the application is emulated to be executing or other user input. Therefore, the application can be executed in the emulator multiple times, with each execution corresponding to a different location.
As can be ascertained, the process of emulating execution of the application may require a relatively large amount of time, particularly if execution of the application is emulated multiple times (for multiple locations) and if the application retrieves content from the Internet. Accordingly, a fast crawl over applications is described herein, wherein a fast crawl over an application comprises executing the application in the emulator utilizing at least one optimization technique, the optimization technique pertaining to crawling the application more quickly when compared to conventional approaches.
For example, when an application or an application update is received from a developer at an application repository where it can be selected, downloaded, and installed by user, the application can be subjected to a full crawl thereover. The full crawl over the application refers to an emulated execution of the application where substantially all pages that can be generated by the application (for substantially all locations from which the application outputs different content) are caused to be generated during the emulated execution. In other words, the full crawl is, as much as possible, an exhaustive emulation of the application. At a later point in time, statistics learned from the full crawl, previous full crawls, and/or previous fast crawls can be employed to perform a fast crawl over the application. The fast crawl utilizes at least one optimization technique to cause the fast crawl to be less time consuming than the full crawl. Pursuant to an example, uniform resource locators (URLs) identified as being pointed to by the application during the full crawl thereover can be retained in a list. A first optimization technique when performing the fast crawl comprises pre-fetching content from these URLs (e.g., in parallel), such that during an emulated execution of the application, the application need not access the content by way of the Internet, but may instead access the content from a local repository, which can significantly reduce an amount of time needed to emulate execution of the application.
Another exemplary optimization technique comprises analyzing pages written to disk from at least one previous crawl (full and/or fast), and identifying which of such pages includes the most unique content relative to other pages generated by the application during the crawl or relative to pages generated by the application over previous crawls. For instance, content of some pages generated by the application may change very little over different temporal executions of the application. In another example, a first page generated by the application during runtime may have a significant amount of duplicative content relative to a second page generated by the application during runtime. Thus, the optimization technique employed during the fast crawl can include causing the application to generate fewer pages when compared to the number of pages generated by the application during the full crawl, where the pages generated during the fast crawl are selected to provide a largest amount of unique content given a specified time constraint.
Another exemplary optimization that can be employed in connection with the fast crawl over the application is the identification and use of an appropriate location granularity, such that, with respect to an application that generates pages with content that depends on location of a computing device executing the application, execution of the application is emulated using appropriate location granularity. This leads to a reduction in a number of times that execution of the application needs to be emulated during a crawl. In an exemplary embodiment, the application may generate pages with different content at different locations, with a location granularity at the level of a city. Pages written to disk from previous crawls (full and/or fast) can be analyzed to identify the appropriate location granularity, such that, during emulation, the application is not caused to execute multiple times and provide the same content. As with the full crawl, pages retrieved during a fast crawl can be written to disk, such that a searchable index can be updated and content included in pages generated by the application can be searched over.
Other aspects will be appreciated upon reading and understanding the attached figures and description.
BRIEF DESCRIPTION OF THE DRAWINGSFIG. 1 is a functional block diagram of an exemplary system for crawling a computer executable application to retrieve content included in pages generated by the application during runtime.
FIG. 2 is a functional block diagram of an exemplary system that facilitates generating a navigation script used when emulating execution of an application.
FIG. 3 is an exemplary diagram that illustrates pages generated by an application during runtime that can be retrieved when performing a crawl over the application.
FIG. 4 is a functional block diagram of an exemplary system that facilitates identifying uniform resource locators (URLs) accessed by an application during runtime.
FIG. 5 is a functional block diagram of an exemplary system that facilitates determining an appropriate location granularity to employ when emulating execution of the application for a fast crawl.
FIG. 6 is a flow diagram that illustrates an exemplary methodology for executing a fast crawl over an application based at least in part upon content retrieved during a full crawl over the application.
FIG. 7 is a flow diagram that illustrates an exemplary methodology for executing a query over an index that comprises data retrieved during a fast crawl over an application.
FIG. 8 illustrates an exemplary graphical user interface that includes a search result comprising data retrieved during a crawl over an application.
FIG. 9 illustrates an exemplary graphical user interface corresponding to a search engine, the graphical user interface comprising a selectable vertical for searching application content.
FIG. 10 is an exemplary graphical user interface for searching content of applications available in an application repository.
FIG. 11 is an exemplary graphical user interface that includes a notification provided to a user based upon content of an application retrieved during a crawl thereover.
FIG. 12 is an exemplary computing system.
DETAILED DESCRIPTIONVarious technologies pertaining to executing crawls over computer-executable applications will now be described with reference to the drawings, where like reference numerals represent like elements throughout. In addition, several functional block diagrams of exemplary systems are illustrated and described herein for purposes of explanation; however, it is to be understood that functionality that is described as being carried out by certain system components may be performed by multiple components. Similarly, for instance, a component may be configured to perform functionality that is described as being carried out by multiple components. Additionally, as used herein, the term “exemplary” is intended to mean serving as an illustration or example of something, and is not intended to indicate a preference.
As used herein, the terms “component” and “system” are intended to encompass computer-readable data storage that is configured with computer-executable instructions that cause certain functionality to be performed when executed by a processor. The computer-executable instructions may include a routine, a function, or the like. It is also to be understood that a component or system may be localized on a single device or distributed across several devices.
With reference now toFIG. 1, anexemplary system100 that facilitates performing crawls over a computer-executable application is illustrated. The crawls performed can be full crawls or fast crawls, where a full crawl is a more exhaustive (and time-consuming) crawl when compared to a fast crawl. During the full crawl, as much content that can be possibly retrieved by the computer-executable application is caused to be retrieved, without being concerned with time or resource constraints. In contrast, for the fast crawl, the application is selectively caused to output “most important” pages while considering a time or resource constraint. Differences between a full crawl and a fast crawl will be set forth below. Thesystem100 comprises adata repository102 that includesprevious crawl data104. Theprevious crawl data104 includes content included in pages output by anapplication106 when a crawl is performed over theapplication106. Theapplication106 is a computer-executable application that is configured to perform a particular task when executed on a computing device. Generally, the computer-executable application106, at runtime, is configured to present certain information to a user. Theapplication106 may be a game, an application for provision of news to the user, an application for provision of restaurant reviews to the user, an application for provision of music/video to the user, etc.
Theapplication106, when executed on the computing device, is configured to generate pages, wherein content of at least some of the pages may be indiscernible unless theapplication106 is executed. Thus, theapplication106 may be configured to retrieve content from some network-accessible repository, such as a computing device accessible by way of theInternet108, and include such content in pages at runtime. In another example, theapplication106 can retrieve content customized specifically for the application (which may not be available by way of the Internet108), such as content generated by a developer or accumulated by the developer. Theapplication106 can retrieve the content and generate pages upon initialization of theapplication106, upon receipt of input from a user of the application106 (selection of a button, movement of a slider, selection of the menu, selection of an item from a list, etc.), upon a certain event occurring (passage of a certain amount of time, a data source accessed by theapplication106 outputting new content, etc.).
Generally, applications such as theapplication106 are not subjected to searching by conventional search engines. This is because search engines are not provided with access to pages generated by the applications during runtime. Instead, the applications are “closed”, such that pages (and content therein) generated by such applications are retrievable therefrom only if the applications are executed and interacted with. Thesystem100 is configured to crawl theapplication106, wherein crawling the application comprises causing theapplication106 to generate pages and write content therein for the purposes of forming and updating an index that is searchable by a search engine.
A full crawl over theapplication106 will now be described. During a full crawl over theapplication106, both static and dynamic content about theapplication106 can be obtained. Static content refers to data about theapplication106 that does not change as theapplication106 is executed at different times and/or with different input parameters, such as geographic location. Dynamic content refers to data output by theapplication106 that can change when theapplication106 is executed at different times and/or with different input parameters. In some cases, static data is included in the executable file of the application106 (the binary), and can comprise text strings included in the binary and/or URLs pointed to by theapplication106 during runtime to retrieve content. Additionally, static data may be included in a resource file that is accessed by theapplication106 during runtime, wherein such resource file is generally provided with the executable file when theapplication106 is downloaded to a computing device from an application repository (or during an update to the application106). Since static data about theapplication106 does not change over time, such static data can be retrieved upon, for example, theapplication106 being added to the application repository, wheresuch application106 can be selected and downloaded by users. The static data may thereafter be retrieved only when an update to theapplication106 is made by a developer thereof. This static data can be retained as a portion of theprevious crawl data104 in thedata repository102.
Thesystem100 may additionally comprise anemulator110 that is employed in connection with obtaining dynamic data about theapplication106. Theemulator110 generally refers to resources of a computing device that are allocated to emulating user interaction with theapplication106. In some embodiments, theemulator110 can additionally be configured to emulate a particular operating environment; for instance, theapplication106 may be configured to execute on a mobile telephone, which has a certain operating system. Theemulator110 can be configured to emulate such operating system, thereby providing an environment where user interaction with theapplication106 can be emulated. In this example, theapplication106 generates a plurality ofpages112 during runtime, wherein the plurality ofpages112 can include content retrieved from theInternet108 responsive to initialization of theapplication106 and/or user interaction with respect to controls set forth in theapplication106. Such controls can include buttons, sliders, selectable lists, menus, or other suitable controls. Thepages112 include dynamic data, in that content of thepages112 can change based upon when theapplication106 is executed, where theapplication106 is executed, etc. Generally, theemulator110 is configured to provide an environment and computer-executable code that emulates user interaction with respect to theapplication106 to cause theapplication106 to generate thepages112.
In an exemplary embodiment, when theapplication106 is provided to an application repository, wheresuch application106 can be located and downloaded by users, theapplication106 can be analyzed and instrumented to cause theapplication106 to write pages generated during runtime of the application to thedata repository102. Analysis of theapplication106 can include identification of user controls in theapplication106 and location thereof. For instance, theapplication106 can be analyzed to identify that on a homepage of the application106 a button exists at certain coordinates. Additionally, the developer of theapplication106 can provide a navigation script, which identifies preferred manners in which theapplication106 is to be interacted with by end users. Developers often generate navigation scripts for the purposes of testing, and can provide the navigation script to an entity that manages the application repository. The navigation script can identify, for example, that a user will initially be directed towards a homepage of theapplication106, and then may select a particular button to go to another page, and then may select a certain button to go to another page, etc. In another example, the navigation script can be automatically learned through analysis of the binary for theapplication106. In yet another example, the navigation script can be generated by a third party tester (not the developer) of theapplication106.
Once the location of the user controls in theapplication106 are identified and the navigation script is received, theemulator110 can emulate user interaction with theapplication106 during runtime in accordance with the navigation script. With more particularity, theemulator110 can comprise acontent retriever component114 that utilizes the navigation script and learned location of user controls in theapplication106 to emulate user interaction with theapplication106, thereby causing theapplication106 to generate thepages112 that include content retrieved from a network-accessible repository, such as one available by way of theInternet108. As thecontent retriever component114 causes the plurality ofpages112 to be generated, awriter component116 can writesuch pages112 to thedata repository102. Ideally, thecontent retriever component114, during a full crawl, causes theapplication106 to exhaustively generate thepages112, such that thepages112 include all possible content that can be retrieved by theapplication106. Accordingly, in an example where theapplication106 generates different pages depending upon a geographic location provided to theapplication106, during the full crawl thecontent retriever component114 can execute the navigation script over theapplication106 several times (using different geographic location values). For instance, theapplication106 may be configured to provide coupons for retail establishments across different locations. Therefore, thecontent retriever component114 can cause theapplication106 to generate pages for numerous different locations (e.g., each city in a particular geographic region). As pages are generated by thecontent retriever component114, thewriter component116 writes the pages to thedata repository102. Content of the pages can be retained in thedata repository102 as a portion of theprevious crawl data104.
Additionally, the static data and the dynamic data retrieved during the full crawl of theapplication106 may be employed to generate/update a searchable index118. The searchable index118 is shown as being included in thedata repository102; it is to be understood, however, that the searchable index118 can be retained in a different data repository from theprevious crawl data104 or spread across numerous repositories. Accordingly, if a user subsequently sets forth a query to search over contents of theapplication106, the searchable index118 can be searched over based upon the query, and search results can be retrieved and provided to the user.
It can be ascertained that executing a full crawl over theapplication106 can take a relatively large amount of time, as theapplication106 may access several URLs to retrieve content, and may generate different pages depending upon an input parameter, such as geographic location. For instance, retrieving pages from theInternet108 can be a relatively large time sync, in that in some cases it may take several seconds for theapplication106 to access a URL and retrieve content therefrom to generate one or more of thepages112. During the full crawl, however, the application is caused to generate as many different pages as possible, without regard to time or computing resources.
To reduce an amount of time when crawling theapplication106 and/or to cause most valuable content to be generated by the application in a given time limit, a fast crawl can be undertaken, which is undertaken in less time (when using the same computing resources to perform the fast crawl) when compared to time needed to execute the full crawl and/or is completed in the given time limit.
The fast crawl over theapplication106 can be based upon theprevious crawl data104. As noted above, theprevious crawl data104 includes content retrieved from a previously-executed full crawl over the application106 (and optionally data retrieved during a previously-executed fast crawls over the application106). During execution of the fast crawl, thecontent retriever component114 causes theapplication106 to retrieve less content from theInternet108 when outputting the plurality ofpages112 when compared to the content retrieved from theInternet108 during the full crawl of theapplication106. There are a variety of optimizations that can be undertaken by thecontent retriever component114 when undertaking a fast crawl over theapplication106.
In a first optimization, thecontent retriever component114 can identify which of thepages112 generated by theapplication106 include the most “new” content relative to an amount of time needed to generate such pages. The term “new content” can refer to content that is new relative to other content retrieved during a single crawl (e.g., a first page output by theapplication106 may include content that is substantially similar to a second page that is output by theapplication106 during a single execution of the application106). Additionally, “new content” can refer to content that is new over different temporal executions of theapplication106 or different locations provided as input to theapplication106. For example, during a first execution of theapplication106 at a first point in time, a page can be generated that includes first content retrieved from theInternet108. During a subsequent crawl of the application106 (at a later point in time), the page can be generated that includes the same or similar content. Therefore, during the fast crawl, thecontent retriever component114 need not cause such page to be generated by theapplication106, as it is likely that the page includes content that is substantially similar to content previously written to disk from a previous crawl. In another example, the page generated by theapplication106 may include substantially different content from different temporal executions of theapplication106. In such case, then, it may be desirable for thecontent retriever component114 to cause the page to be output by theapplication106 during the fast crawl. Therefore, the first optimization relates to identifying a subset of pages that can be generated by theapplication106 that include a substantial amount of new content, and causing such subset of pages to be generated by theapplication106 during the fast crawl (while not causing other pages that can be generated by theapplication106 to be output). With more particularity, thecontent retriever component114 can cause theapplication106 to output the subset of pages by choosing and visiting a relatively small portion of the application rather than visiting theapplication106 exhaustively (e.g., causing theapplication106 to generate a lesser number of pages than theapplication106 is capable of generating). Additionally, thecontent retriever component114 can cause theapplication106 to output the subset of pages by executing the application with a relatively small number of location inputs (if the application is location-aware) instead of all possible location inputs.
In an exemplary embodiment, a dynamic programming based algorithm can be employed to identify which of thepages112 are to be generated by theapplication106 during the fast crawl. Identifying pages generated by theapplication106 that include the most “new content” can be particularly useful when setting forth at time limit within which the fast crawl must be completed.
A second exemplary optimization that can be employed in connection with performing a fast crawl comprises pre-fetching of content from theInternet108, such that when theemulator110 emulates execution of theapplication106, content at URLs accessed by theapplication106 at runtime is already available in local storage. Thecontent retriever component114 can access theprevious crawl data104 and identify URLs pointed to by theapplication106 during runtime (as identified in the previous crawl data104). Thecontent retriever component114 can pre-fetch content at such URLs in parallel, such that when theapplication106 is loaded into theemulator110, theapplication106 need not access theInternet108 to retrieve content at the URLs, but can instead access the content directly from local storage. Thecontent retriever component114 can also identify patterns in URLs in theprevious crawl data104 to pre-fetch content from appropriate URLs. In an example, when different locations are provided to theapplication106, respective URLs pointed to by theapplication106 may slightly change. Over time, the manner in which URLs change can be identified, thereby facilitating retrieval of content at an appropriate URL when theapplication106 is subjected to a fast crawl.
A third optimization that can be undertaken during a fast crawl over the application comprises automatically identifying a granularity of location to provide to theapplication106 during the fast crawl. In an example, theapplication106 may be an application that outputs current sales for goods or services in respective geographic regions, such that provision of different locations to theapplication106 results in different sales being output by theapplication106. To cause theapplication106 to output all possible sales, different locations must be provided to the application106 (e.g., different cities). Thecontent retriever component114 can analyze theprevious crawl data104 to ascertain an appropriate location granularity to use when providing locations to theapplication106 during the fast crawl. For instance, theapplication106 may output different sales if the location is changed by a city; accordingly, changing the location provided to theapplication106 by a city block does not result in new data being included in pages output by theapplication106. A desired location granularity can be identified by determining a smallest granularity that causes the application to output different content. In another example, the desired location granularity can be selected to optimize a tradeoff between new data generated by the application and a time constraint for the fast crawl. By identifying an appropriate location granularity, a number of times that theapplication106 is executed in theemulator110 can be reduced (when compared to the number of times that theapplication106 is executed during an initial full crawl), thereby decreasing time needed to crawl theapplication106.
Turning now toFIG. 2, anexemplary system200 that facilitates learning a navigation script that can be employed by thecontent retriever component114 when theapplication106 is crawled over in theemulator110 is illustrated. Thesystem200 comprises adata repository202 that includes a binary204 for theapplication106. Thesystem200 further includes anapplication analyzer component206 that analyzes the binary204 to identify existence and location of controls that will be presented to users during runtime of theapplication106. As noted above, such controls can include buttons, sliders, pull-down menus, selectable lists, hyperlinks, etc.
Thesystem200 may also comprise ascript generator component208 that outputs a navigation script for theapplication106. Thescript generator component208 can receive the location of the user controls from theapplication analyzer component206 and may select each possible control during an emulated execution of theapplication106, thus generating the navigation script. The navigation script may subsequently be employed during full crawls or fast crawls over theapplication106. In another embodiment, a third party tester (a person or entity other than the developer) can manually execute theapplication106, and thescript generator component208 can record user interaction with theapplication106. Such recording can be employed as the navigation script, which may then be employed during emulated execution of theapplication106. In still another example, rather than thescript generator component208 generating the navigation script, such script can be provided by a developer of theapplication106.
Now referring toFIG. 3, adepiction300 of exemplary pages302-314, in the form of a tree structure, that can be generated by theapplication106 is illustrated. Thedepiction300 illustrates that seven separate pages302-314 can be generated by theapplication106 at runtime when interacted with by a user. For example, thefirst page302 is output by theapplication106 when theapplication106 is initiated by the user. The first page includes data d1with a time t1required to generate thefirst page302. A control on thefirst page302, when selected by the user, may cause thesecond page304 to be output by theapplication106, wherein thesecond page304 includes data d2. Thesecond page304 is output by theapplication106 in time t2. Once provided with thesecond page304, the user can return to thefirst page302 or exit theapplication106.
In another example, when viewing thefirst page302, the user may select a second user control and be provided with thethird page306. Thethird page306 includes data d3, and requires time t3to be output by the application106 (where t3includes the time t1to output the first page302).
Similarly, thefirst page302 may have a third user control that, when selected by the user, causes theapplication106 to output thefourth page308. Thefourth page308 includes content d4that may be different from the content d1of thefirst page302, the content d2of thesecond page304, and the content d3of thethird page306, although there may be some overlap in content.
Thefourth page308 can include a plurality of user controls (e.g., three), wherein selection of the controls causes other pages310-314 to be respectively output by theapplication106. For instance, if a user selects a first user control in thefourth page308, theapplication106 outputs thefifth page310, wherein the fifth page includes content d5, and wherein thefifth page310 is output in time is (which includes times t1and t4). If the user selects a second user control in thefourth page308, theapplication106 outputs thesixth page312, which includes data d6, and wherein thesixth page312 is output in time t6(which includes times t1and t4). If the user selects a third user control in thefourth page308, theapplication106 outputs theseventh page314, which includes data d7, and wherein theseventh page314 is output in time t7(which includes times t1and t4).
Theprevious crawl data104 in thedata repository102 can indicate how much new data is included in each of the pages302-314, as well as an amount of time needed for theapplication106 to output such pages. Thecontent retriever component114, in an exemplary embodiment, can perform an optimization to identify which of the pages302-314 to cause to be output by theapplication106 during the fast crawl, wherein the subset of pages identified by thecontent retriever component114 can result in obtainment of the most new data in a specified time constraint. For example, thecontent retriever component114 can be provided with a constraint that thecontent retriever component114 has 30 seconds to obtain as much data as possible about theapplication106 during the fast crawl. Through analysis of theprevious crawl data104, identity of a subset of the pages302-314 can be determined, wherein the subset of pages results in obtainment of a maximum amount of new data that can be output by theapplication106 in the time constraint.
A challenge when performing such optimization, for instance, is that thecontent retriever component114 cannot cause theapplication106 to jump directly to a particular page. For instance, to cause theapplication106 to output theseventh page314, theapplication106 must first output thefirst page302, and then output thefourth page308, and thereafter output theseventh page314. Thus, thecontent retriever component114 can have knowledge of a navigation tree of theapplication106, and can select a sub-tree of such navigation tree that provides the most new data in the constrained amount of time. During a fast crawl, then, thecontent retriever component114 may cause theapplication106 to output pages in the sub-tree.
With reference now toFIG. 4, anexemplary system400 that facilitates identifying URLs from which content is pre-fetched (for utilization when executing a fast crawl over the application106) is illustrated. Thesystem400 includes thedata repository102, which comprises theprevious crawl data104. Theprevious crawl data104 includes identifications of URLs that were accessed by theapplication106 during at least one previous crawl. Thesystem400 further comprises thecontent retriever component114, which can receive the URLs and access theInternet108 to fetch content at the URLs prior to theapplication106 being executed in theemulator110. During emulation, then, content from the URLs that has been pre-fetched can be quickly retrieved from a local repository by thecontent retriever component114, rather than thecontent retriever component114 having to access theInternet108 to obtain such content when theapplication106 is executing in theemulator110.
In an exemplary embodiment, for each page that can be output by theapplication106, a list of URLs from which content is to retrieved by theapplication106 to generate a respective page can be maintained. A challenge, however, is that when theapplication106 goes from one execution to the next, a URL may not be identical (there may be some slight change). Thus, theapplication106, when outputting the same page at different times or locations, may be retrieving content from different URLs. Oftentimes, however, URLs retrieved by theapplication106 at different times and/or when theapplication106 is emulated as being executed at different locations can be somewhat similar. For instance, only a particular parameter in the URL may change, wherein such parameter pertains to the location at which theapplication106 is executed.
Thesystem400 can comprise apattern recognizer component402 that analyzes theprevious crawl data104 to identify patterns in URLs accessed by theapplication106 during different crawls. For instance, the pattern may be a relatively slight change in the URL that is based upon the location provided to theapplication106. AURL identifier component404 can provide a URL to be fetched by thecontent retriever component114 based at least in part upon a pattern recognized by thepattern recognizer component402. In an example, thepattern recognizer component402 can analyze theprevious crawl data104 to ascertain that, for a particular page output by theapplication106 at different locations, a certain portion of a URL changes (e.g. the portion of the URL changes from “Seattle” when theapplication106 is provided with the location of Seattle to “Chicago” when theapplication106 is provided with the location of Chicago).
Thepattern recognizer component402 can recognize that this portion of the URL changes with location, and theURL identifier component404 can receive such pattern and can identify URLs to be fetched by thecontent retriever component114 based upon such pattern. For instance, theURL identifier component404 can automatically modify the URL in theprevious crawl data104 to include data corresponding to the pattern recognized by the pattern recognizer component402 (may change the portion of the URL to “Atlanta”).
Turning now toFIG. 5, anexemplary system500 that facilitates ascertaining a granularity of location to provide to theapplication106 when a fast crawl is executed over theapplication106 is illustrated. Thesystem500 includes thedata repository102, which comprises theprevious crawl data104. Theprevious crawl data104 can include content from pages output by theapplication106 during previous crawls over theapplication106 as well as corresponding locations provided as input to theapplication106 during such crawls. Thecontent retriever component114 can analyze theprevious crawl data104 to identify an appropriate granularity of location to provide toapplication106 during the fast crawl. Thecontent retriever component114 can include agranularity identifier component502 that causes theapplication106 to be executed with many different locations as input, wherein the locations have varying granularities. Different location-based applications may provide data using different location granularities. For example, if theapplication106 is configured to output identities of restaurants, a change in location of a few hundred feet may result in different content being output by theapplication106. If theapplication106 is directed towards coupons, however, and the location provided theapplication106 is changed by a few hundred feet or a few kilometers, the output of theapplication106 may be identical. In other words, theapplication106, in such example, will provide new content only if location is changed at least at a city level.
Thegranularity identifier component502 then can review outputs of several emulations of execution of the application106 (at different locations/location granularities) and identify if the content of pages changes during the different emulations. For a relatively small granularity (e.g., location changes on the order of several hundred feet), if the output does not change using different input locations, thegranularity identifier component502 can provide location data at a larger granularity (e.g., a mile) to theapplication106. Over time, thegranularity identifier component502 can determine an appropriate location granularity to use when performing a fast crawl over theapplication106.
In practice, from time to time, thecontent retriever component114 can execute a full crawl over theapplication106 to ensure that operation of theapplication106 is not changing, as well as to update statistics about theapplication106. Typically, however, fast crawls can be executed, such that a relatively large amount of content output by theapplication106 can be obtained and placed in the searchable index118 while performing such crawls in a smaller amount of time relative to the full crawl and/or within given time/resource constraints.
With reference now toFIGS. 6-7, various exemplary methodologies are illustrated and described. While the methodologies are described as being a series of acts that are performed in a sequence, it is to be understood that the methodologies are not limited by the order of the sequence. For instance, some acts may occur in a different order than what is described herein. In addition, an act may occur concurrently with another act. Furthermore, in some instances, not all acts may be required to implement a methodology described herein.
Moreover, the acts described herein may be computer-executable instructions that can be implemented by one or more processors and/or stored on a computer-readable storage medium or media. The computer-executable instructions may include a routine, a sub-routine, programs, a thread of execution, and/or the like. Still further, results of acts of the methodologies may be stored in a computer-readable storage medium, displayed on a display device, and/or the like.
Now referring toFIG. 6, anexemplary methodology600 that facilitates executing a fast crawl over an application is illustrated. Themethodology600 starts at602, and at604 an application is received. As described above, the application received at604 is configured for installment on an end-user computing device, wherein the application, when executed by a user on such device, outputs a plurality of pages that include content retrieved by way of the Internet responsive to receipt of respective user input.
At606, a full crawl is executed over the application. As described above, execution of the full crawl includes executing the application in an emulator and causing the application to output the plurality of pages. During the full crawl, for example, different input parameters can be provided and the application can be executed in the emulator multiple times, once for each different input parameter. Each page output by the application may then be stored to a data repository and can be used to generate/update a searchable index.
At608, subsequent to the full crawl being executed over the application, a fast crawl is executed over the application based at least in part, upon the full crawl. Specifically, output of the full crawl can be employed to update statistics about the execution of the application, including what new data is included in pages output by the application, identities of URLs that are accessed, amounts of time needed by the application to output pages, etc. By analyzing these statistics, the fast crawl can be executed more quickly than the full crawl (when using the same computing resources). For instance, during the fast crawl, a page that includes substantially similar content to another page may not be caused to be output by the application when the application is executing in the emulator. Further, content from URLs can be pre-fetched, such that when the application is executing in the emulator, the application need not access the URLs by way of the Internet, but may instead retrieve content from a local repository (e.g., content retrieved from URLs content can be cached). In another example, when executing in the emulator, theapplication106 can be provided with location input at appropriate granularities such that theapplication106 is not executed more than necessary in the emulator. Themethodology600 completes at610.
With reference now toFIG. 7, anexemplary methodology700 that facilitates outputting a search result that includes data obtained during a fast crawl over an application is illustrated. Themethodology700 starts at702, and at704 a fast crawl is executed over the application. Execution of the fast crawl results in writing of pages output by the application when executed in an emulator to disk and generating a searchable index based upon these pages written to disc. At706, a query is received, wherein the query may be received at a web-based search engine (e.g., accessed by a user through utilization of a browser). In another example, the query can be received at a desktop search engine that is configured to search over content of a machine of a user. In still yet another example, the query can be received at a search engine that is configured to search content of applications in an application repository, wherein users can select applications for downloading and installing on their respective client devices.
At708, the query is executed over a searchable index that comprises data obtained during the fast crawl over the application. Accordingly, the search result includes data from a page output during the fast crawl. Themethodology700 completes at710.
With reference now toFIG. 8, an exemplarygraphical user interface800 is illustrated. Thegraphical user interface800 may be a graphical user interface for a general purpose search engine that can be initiated through utilization of a web browser, through utilization of a search application installed on a computing device, or the like. Thegraphical user interface800 includes aquery field802 that is configured to receive a user query. The user can place a cursor in thequery field802 and enter a textual query therein. In other embodiments, the user can provide a query to the search engine through a voice command. In the example shown here, the user sets forth a query and is provided with a plurality of search results804-816. The search results806-816 may be conventional web search results. Thesearch result804, however, is a search result that comprises data retrieved from an application during a full crawl or fast crawl. Thesearch result804 can be highlighted in some manner to indicate to the end-user that the search result includes data outputtable by a computer-executable application.
Selection of thesearch result804 may direct the user to a web page that includes a screenshot of the page output by the application that includes data relevant to the query issued by the user. In another example, selection of thesearch result804 may direct the user to a location where the application can be downloaded for installment on her computing device. If the application already exists on the computing device of the user, selection of thesearch result804 may cause the application to be initiated on the computing device of the user.
With reference now toFIG. 9, another exemplarygraphical user interface900 of a search engine page is illustrated. Conventional search engines include numerous verticals that can be selected by users. When a user selects a vertical, a subsequent query provided by the user is executed only over such vertical. For instance, if the user wishes to obtain images about a particular celebrity, the user can select an “images” vertical, provide a query that includes the name of the celebrity to the search engine, and the search engine will provide images to the user. In the exemplarygraphical user interface900, the search engine includes six verticals: a “web” vertical902, an “images” vertical904, a “videos” vertical906, a “maps” vertical908, a “news” vertical910, and an “apps” vertical912, although a search engine may include more, fewer, or different verticals. Thegraphical user interface900 additionally includes aquery field914. In an example, the user can initially select the “apps” vertical912. Subsequently, the user can set forth a query in thequery field914, which causes the search engine to execute the query over the searchable index118, which is based upon pages written to disk from a full and/or fast crawl. Thus, in this example, search results returned to the user do not include conventional search results retrieved from the Internet, but would be based upon content retrieved during the full crawl and/or fast crawl.
Now referring toFIG. 10, another exemplarygraphical user interface1000 is illustrated. Thegraphical user interface1000 corresponds to an application repository, where a user directs a computing device to a network-accessible location, where the user can select, potentially pay for, and download applications for installment on the computing device. In the exemplarygraphical user interface1000, aquery field1002 can be included, where the user can set forth a query that is to be executed over content of applications in the application repository. In thegraphical user interface1000, the user has issued a query to the query field causing a plurality of search results1004-1008 to be retrieved. In an example, thesearch result1004 can include agraphical object1010 that is representative of a first application. Thesearch result1004 may also include content from the first application that is relevant to the query set forth by the user in thequery field1002. For instance, thecontent1012 may be a screenshot of a page output by the application represented by thegraphical object1010. Thegraphical object1010 may be a selectable graphical object that causes, for example, the application to be downloaded and installed on a computing device of the user. In another example, if the application is already installed on the computing device of the user, selection of thegraphical object1010 can cause the application to be initiated on the computing device of the user.
Thesecond search result1006 includes a secondgraphical object1013 corresponding to a second application andsecond content1014 that is relevant to the query set forth in thequery field1002. Thethird search result1008 includes a thirdgraphical object1016 corresponding to a third application and third content from the third application that is relevant to the query set forth in thequery field1002.
Thegraphical user interface1000, in another embodiment, may correspond to applications installed on the computing device of the user. Therefore, rather than the query set forth in thequery field1002 being executed over all applications in an application repository, the query set forth in thequery field1002 may be executed only over applications installed on the computing device of the user (or applications selected by the user).
Now referring toFIG. 11, an exemplarygraphical user interface1100 is illustrated. For example, a user may register a query, such that the query is executed over applications in an application repository or applications installed on a computing device of a user. For instance, the user may be interested in a vacation to Hawaii, and may register a query “deals on trips to Hawaii”. This query can be executed periodically or from time to time, and anotification1102 can be presented to the user if a search result that is relevant to the query is located. Continuing with the example set forth above, if an application outputs a page that includes information about a sale on plane tickets to Hawaii, thenotification1102 can be presented on the display screen of a computing device of the user informing such user of the content output by the application.
Now referring toFIG. 12, a high-level illustration of anexemplary computing device1200 that can be used in accordance with the systems and methodologies disclosed herein is illustrated. For instance, thecomputing device1200 may be used in a system that supports executing a full crawl and/or a fast crawl over an application. In another example, at least a portion of thecomputing device1200 may be used in a system that supports searching over an index that comprises data retrieved during a full crawl and/or fast crawl over an application. Thecomputing device1200 includes at least oneprocessor1202 that executes instructions that are stored in amemory1204. Thememory1204 may be or include RAM, ROM, EEPROM, Flash memory, or other suitable memory. The instructions may be, for instance, instructions for implementing functionality described as being carried out by one or more components discussed above or instructions for implementing one or more of the methods described above. Theprocessor1202 may access thememory1204 by way of asystem bus1206. In addition to storing executable instructions, thememory1204 may also store a navigation script, identities and locations of user controls of an application, etc.
Thecomputing device1200 additionally includes adata store1208 that is accessible by theprocessor1202 by way of thesystem bus1206. Thedata store1208 may be or include any suitable computer-readable storage device, including a hard disk, memory, etc. Thedata store1208 may include executable instructions, content retrieved from executing a full crawl and/or fast crawl over an application, etc. Thecomputing device1200 also includes aninput interface1210 that allows external devices to communicate with thecomputing device1200. For instance, theinput interface1210 may be used to receive instructions from an external computer device, from a user, etc. Thecomputing device1200 also includes anoutput interface1212 that interfaces thecomputing device1200 with one or more external devices. For example, thecomputing device1200 may display text, images, etc. by way of theoutput interface1212.
Additionally, while illustrated as a single system, it is to be understood that thecomputing device1200 may be a distributed system. Thus, for instance, several devices may be in communication by way of a network connection and may collectively perform tasks described as being performed by thecomputing device1200.
Various functions described herein can be implemented in hardware, software, or any combination thereof. If implemented in software, the functions can be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes computer-readable storage media. A computer-readable storage media can be any available storage media that can be accessed by a computer. By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. Disk and disc, as used herein, include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk, and Blu-ray disc (BD), where disks usually reproduce data magnetically and discs usually reproduce data optically with lasers. Further, a propagated signal is not included within the scope of computer-readable storage media. Computer-readable media also includes communication media including any medium that facilitates transfer of a computer program from one place to another. A connection, for instance, can be a communication medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio and microwave are included in the definition of communication medium. Combinations of the above should also be included within the scope of computer-readable media.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable modification and alteration of the above devices or methodologies for purposes of describing the aforementioned aspects, but one of ordinary skill in the art can recognize that many further modifications and permutations of various aspects are possible. Accordingly, the described aspects are intended to embrace all such alterations, modifications, and variations that fall within the spirit and scope of the appended claims. Furthermore, to the extent that the term “includes” is used in either the details description or the claims, such term is intended to be inclusive in a manner similar to the term “comprising” as “comprising” is interpreted when employed as a transitional word in a claim.